ETL pipeline ndt_daily_test_20100102 is missing 54 tests (out of 50K) that appear in legacy table.
For example: ndt.iupui.mlab1.mia01.measurement-lab.org is missing:
1262440956003875 2010/01/02/20100102T14:02:23.989680000Z_189.46.151.237:31159.c2s_snaplog
1262440970350858 2010/01/02/20100102T14:02:23.989680000Z_189.46.151.237:31160.s2c_snaplog
1262470263996630 2010/01/02/20100102T22:10:52.542906000Z_201.80.42.183:4685.c2s_snaplog
Use this query to generate a one day table from legacy:
select web100_log_entry.log_time as t, web100_log_entry.snap.StartTimeStamp as start, test_id, connection_spec.*
from plx.google:m_lab.ndt.all
where PARSE_UTC_USEC("2010-01-01") < 1000000web100_log_entry.log_time
and 1000000web100_log_entry.log_time < PARSE_UTC_USEC("2010-01-04")
and test_id contains "/20100102T"
limit 1000000
Then use this query to compare to ETL table generated from the tar files:
SELECT
legacy.t,
legacy.connection_spec.server_hostname,
legacy.test_id,
etl.test_id
FROM
(select web100_log_entry.snap.StartTimeStamp as t, test_id, task_filename, connection_spec.
from mlab_sandbox.ndt_test_daily_20100102
) AS etl
FULL OUTER JOIN EACH
(select start as t, test_id, connection_spec.
from mlab_sandbox.legacy_20100102
) AS legacy
ON
etl.t=legacy.t
WHERE
etl.t IS NULL
OR legacy.t IS NULL
order by legacy.connection_spec.server_hostname, etl.test_id, legacy.test_id
ETL pipeline ndt_daily_test_20100102 is missing 54 tests (out of 50K) that appear in legacy table.
For example: ndt.iupui.mlab1.mia01.measurement-lab.org is missing: 1262440956003875 2010/01/02/20100102T14:02:23.989680000Z_189.46.151.237:31159.c2s_snaplog 1262440970350858 2010/01/02/20100102T14:02:23.989680000Z_189.46.151.237:31160.s2c_snaplog 1262470263996630 2010/01/02/20100102T22:10:52.542906000Z_201.80.42.183:4685.c2s_snaplog
Use this query to generate a one day table from legacy: select web100_log_entry.log_time as t, web100_log_entry.snap.StartTimeStamp as start, test_id, connection_spec.* from plx.google:m_lab.ndt.all
where PARSE_UTC_USEC("2010-01-01") < 1000000web100_log_entry.log_time and 1000000web100_log_entry.log_time < PARSE_UTC_USEC("2010-01-04") and test_id contains "/20100102T" limit 1000000
Then use this query to compare to ETL table generated from the tar files: SELECT legacy.t, legacy.connection_spec.server_hostname, legacy.test_id, etl.test_id FROM (select web100_log_entry.snap.StartTimeStamp as t, test_id, task_filename, connection_spec. from mlab_sandbox.ndt_test_daily_20100102 ) AS etl FULL OUTER JOIN EACH (select start as t, test_id, connection_spec. from mlab_sandbox.legacy_20100102 ) AS legacy ON etl.t=legacy.t WHERE etl.t IS NULL OR legacy.t IS NULL order by legacy.connection_spec.server_hostname, etl.test_id, legacy.test_id