m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Implement full day update and deduplication #82

Open gfr10598 opened 7 years ago

gfr10598 commented 7 years ago

The pipeline can reprocess all of the data for a given day into a new table with _YYYYMMDD suffix. We need then to be able to: 1) Find diff between the new table and the original day's data in master. 2) Dedup and replace the new data into the master table.

The handler also needs to determine whether the new table is compatible with the old table, or might cause failures either with the replacement, or with current or future streaming inserts.

gfr10598 commented 7 years ago

This query seems to do the trick. It takes data from staging, and inserts it into a test table.

bq --project measurement-lab query --destination_table 'mlab-sandbox:validation.bad_20170501' --batch --nouse_legacy_sql --noflatten_results --allow_large_results --replace 'SELECT EXCEPT (row_number) FROM (SELECT , ROW_NUMBER() OVER (PARTITION BY test_id) row_number FROM mlab-staging.staging.ndt_20170501 ) WHERE row_number = 1'

WARNING - After this query completes, it stalls for 5 minutes or so, then dumps a huge amount of data to the console. Pipe to /dev/null or find a way to suppress the output.

gfr10598 commented 7 years ago

The query takes about 3 minutes to run for a single day.