Open gfr10598 opened 7 years ago
This query seems to do the trick. It takes data from staging, and inserts it into a test table.
bq --project measurement-lab query --destination_table 'mlab-sandbox:validation.bad_20170501' --batch --nouse_legacy_sql --noflatten_results --allow_large_results --replace 'SELECT EXCEPT (row_number) FROM (SELECT , ROW_NUMBER() OVER (PARTITION BY test_id) row_number FROM mlab-staging.staging.ndt_20170501
) WHERE row_number = 1'
WARNING - After this query completes, it stalls for 5 minutes or so, then dumps a huge amount of data to the console. Pipe to /dev/null or find a way to suppress the output.
The query takes about 3 minutes to run for a single day.
The pipeline can reprocess all of the data for a given day into a new table with _YYYYMMDD suffix. We need then to be able to: 1) Find diff between the new table and the original day's data in master. 2) Dedup and replace the new data into the master table.
The handler also needs to determine whether the new table is compatible with the old table, or might cause failures either with the replacement, or with current or future streaming inserts.