m-lab / annotation-service

Annotation integration service for M-Lab data
Apache License 2.0
3 stars 5 forks source link

Include ISO 3166-2 subdivisions in client region field annotation #217

Open critzo opened 5 years ago

critzo commented 5 years ago

The annotation service currently populates connection_spec.client_geolocation.region with the top level ISO 3166-2 region code. To restore the region code granularity we had prior to 2017-05-11, we should add a field for the client region subdivision, annotated with the subdivision codes in the ISO 3166-2 standard.

To demonstrate the issue, I include the query below, and a trimmed result set for Great Britain. Prior to 2017-05-11 we annotated ~198 region codes in this field, whereas now it's 4:

SELECT partition_date, connection_spec.client_geolocation.region FROM `measurement-lab.release.ndt_all`
WHERE connection_spec.client_geolocation.country_code = 'GB'
AND partition_date BETWEEN '2017-05-10' AND '2017-05-12'
GROUP BY partition_date, connection_spec.client_geolocation.region
ORDER BY partition_date, connection_spec.client_geolocation.region

# Result:
partition_date  region
2017-05-10  A1
2017-05-10  A2
2017-05-10  A3
# + 195 more FIPS 10-4 regions in Great Brittain
...
2017-05-11  England
2017-05-11  Northern Ireland
2017-05-11  Scotland
2017-05-11  Wales
critzo commented 5 years ago

Noting that the ISO 3166-2 region codes are a part of the Maxmind Geolite2-City Locations.

The fields I believe we should consider adding to the ndt schema and annotator are:

critzo commented 4 years ago

Adding a link to the place in the code where the current region code is referenced: https://github.com/m-lab/annotation-service/blob/master/geolite2v2/geo-ip.go#L111

Should be straightforward to pull the ISO 3166-2 subregion 2 code from the City dataset for the sub-regions.

stephen-soltesz commented 4 years ago

https://github.com/m-lab/etl-gardener/issues/281