m-lab / mlab-vis-pipeline

M-Lab Visualization Dataflow pipelines for transforming ndt.all into the needed aggregation tables in bigtable.
2 stars 4 forks source link

[Data Cleaning] MaxMind regions are sometimes just the cities #11

Closed pbeshai closed 8 years ago

pbeshai commented 8 years ago

"London, City of" is the region name. See: image

pbeshai commented 8 years ago

So that is the region London, the city London is inside this region. The problem seems to stem from the fact that the bocoup.location_region_codes table has many cities as region names for the specified region codes in GB at least.

image

pbeshai commented 8 years ago

We get our data from http://dev.maxmind.com/faq/how-do-i-convert-region-codes-to-names/. It's not clear how we can easily fix this. Do we just get rid of region codes and names for GB? I think we can just accept it for now.

pbeshai commented 8 years ago

Note that #31 adds in support for removing the region by setting the new_region_code to null in the location_cleaning table. London is in there as an example.

pbeshai commented 8 years ago

On further investigation, it seems that region codes are necessary for our system to work. So setting to null is not possible. Every city in the data currently has a region. If the region is wrong, one could possibly update the region code to another one or create a new region code then use the Location Names resolver part of the pipeline to resolve that region code to a new region name.

vlandham commented 8 years ago

peter fixed this with an additional table.