m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Clear annotation.Region from annotation datatype #1069

Closed stephen-soltesz closed 2 years ago

stephen-soltesz commented 2 years ago

The Geo1 and Geo2 MaxMind formats share some fields and have some mutually exclusive fields for the different standards used: Geo1 uses FIPS-10-4, Geo2 uses ISO3166-2.

The annotation-service re-used the Geo1 "Region" field to provide continuity for users as we migrated to Geo2. Unfortunately, the different standards were never normalized. And, during the Synthetic UUID Annotation export process, we archived this reused "region" field. This is not a problem until the unified views for NDT, which combine the annotations from web100, ndt5 and ndt7. The ndt7 annotations have used the uuid-annotator natively from the beginning because it was deployed after the uuid-annotator. The ndt5 annotations are a mixture of native annotations (from uuid-annotator) and synthetic exported annotations.

Because the Geo2 format does not use the "Region" field, all annotations collected by the uuid-annotator does not set this field. All synthetically generated annotations do.

The unified views can mask this field. But, the parser should eventually blank this field for all "annotation" datatype values.

stephen-soltesz commented 2 years ago

The v1 data pipeline cannot do this. And, we plan to decommission the v1 data pipeline.

stephen-soltesz commented 2 years ago

Region cleared from static ndt web100 transformation in https://github.com/m-lab/etl-schema/pull/134