m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Add synthetic UUIDs for legacy datatypes #998

Closed stephen-soltesz closed 3 years ago

stephen-soltesz commented 3 years ago

This change adds new id columns to the legacy datatypes for ndt.web100 and sidestream and reuses the UUID column for traceroute1 (.paris).

The id is a synthetic UUID, derived from the same data used by gardener to dedup daily columns. Because all of these include dates, they are considered globally unique. The id also relies on the MD5 digest. The only collisions observed from manual inspection of these BQ tables come from duplicate data in archives uploaded to multiple archives (so the "collisions" are correctly identifying the same data).

These ids can be used directly during the annotation export process to generate synthetic uuid-annotations.

These ids will be preserved indefinitely to maintain compatibility with the synthetic uuid-annotations.

Part of milestone: https://docs.google.com/document/d/1seI56IGAZzfIhmkZH_Pp67fU11kynyO6mwf7gU3HeiM/edit#

This change is Reviewable

coveralls commented 3 years ago

Pull Request Test Coverage Report for Build 6498


Totals Coverage Status
Change from base Build 6488: 0.2%
Covered Lines: 3520
Relevant Lines: 5582

💛 - Coveralls