m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Migrate ndt5 datatype to use standard columns #1046

Closed stephen-soltesz closed 2 years ago

stephen-soltesz commented 2 years ago
stephen-soltesz commented 2 years ago

While verifying the v2 parser tables in staging, I found that there is a 1-3% difference in the number of rows.

WITH v2 AS (

SELECT date, "v2" as label, COUNTIF(raw.C2S IS NOT NULL) as upload, COUNTIF(raw.S2C IS NOT NULL) as download
from `mlab-staging.ndt.ndt5`
WHERE date > date('2021-09-01')
group by date
order by date

), v1 AS (

SELECT DATE(_PARTITIONTIME) as date, "v1" as label, COUNTIF(result.C2S IS NOT NULL) as upload, COUNTIF(result.S2C IS NOT NULL) as download
from `mlab-staging.base_tables.ndt5`
WHERE _PARTITIONTIME > TIMESTAMP('2021-09-01')
group by date
order by date
)
SELECT * FROM v2
UNION ALL 
SELECT * FROM v1
Screen Shot 2022-02-28 at 2 04 57 PM

It appears that the missing rows are due to empty UUIDs in the raw results, which the v2 parser skips b/c these are not valid measurements.

SELECT * 
from `mlab-staging.base_tables.ndt5`
WHERE _PARTITIONTIME > TIMESTAMP('2021-09-01')
AND result.S2C.UUID = ""
AND _PARTITIONTIME = TIMESTAMP("2021-11-21")
LIMIT 10

Ultimately these rows should be in the raw table, but they need not block this initial deployment b/c these rows will not be present in the unified views.

stephen-soltesz commented 2 years ago

Deletion of v1 portions part of: https://github.com/m-lab/etl/issues/1074