m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

New pipeline skips some NDT tests successfully handled by legacy. #108

Open gfr10598 opened 7 years ago

gfr10598 commented 7 years ago

ETL pipeline ndt_daily_test_20100102 is missing 54 tests (out of 50K) that appear in legacy table.

For example: ndt.iupui.mlab1.mia01.measurement-lab.org is missing: 1262440956003875 2010/01/02/20100102T14:02:23.989680000Z_189.46.151.237:31159.c2s_snaplog 1262440970350858 2010/01/02/20100102T14:02:23.989680000Z_189.46.151.237:31160.s2c_snaplog 1262470263996630 2010/01/02/20100102T22:10:52.542906000Z_201.80.42.183:4685.c2s_snaplog

Use this query to generate a one day table from legacy: select web100_log_entry.log_time as t, web100_log_entry.snap.StartTimeStamp as start, test_id, connection_spec.* from plx.google:m_lab.ndt.all

where PARSE_UTC_USEC("2010-01-01") < 1000000web100_log_entry.log_time and 1000000web100_log_entry.log_time < PARSE_UTC_USEC("2010-01-04") and test_id contains "/20100102T" limit 1000000

Then use this query to compare to ETL table generated from the tar files: SELECT legacy.t, legacy.connection_spec.server_hostname, legacy.test_id, etl.test_id FROM (select web100_log_entry.snap.StartTimeStamp as t, test_id, task_filename, connection_spec. from mlab_sandbox.ndt_test_daily_20100102 ) AS etl FULL OUTER JOIN EACH (select start as t, test_id, connection_spec. from mlab_sandbox.legacy_20100102 ) AS legacy ON etl.t=legacy.t WHERE etl.t IS NULL OR legacy.t IS NULL order by legacy.connection_spec.server_hostname, etl.test_id, legacy.test_id

pboothe commented 7 years ago

3.5 nines!

gfr10598 commented 7 years ago

Actually failed to update the number after fixing query. It is only 3 nines. 8-(