maxmind / MaxMind-DB

Spec and test data for the MaxMind DB file format
https://maxmind.github.io/MaxMind-DB/
Apache License 2.0
285 stars 66 forks source link

The CSV files are missing geo subjects returned by the MMDB #61

Closed baygeldin closed 5 years ago

baygeldin commented 5 years ago

Problem

I've downloaded the latest updates of the Maxmind's GeoLite2 City database (both in MaxMind DB binary and CSV formats). When I tried to look up "88.184.98.0" here's what I got:

{"city":{"geoname_id":2982652,"names":{"de":"Rouen","en":"Rouen","es":"Ruan","fr":"Rouen","ja":"ルーアン","pt-BR":"Ruão","ru":"Руан","zh-CN":"鲁昂"}},"continent":{"code":"EU","geoname_id":6255148,"names":{"de":"Europa","en":"Europe","es":"Europa","fr":"Europe","ja":"ヨーロッパ","pt-BR":"Europa","ru":"Европа","zh-CN":"欧洲"}},"country":{"geoname_id":3017382,"is_in_european_union":true,"iso_code":"FR","names":{"de":"Frankreich","en":"France","es":"Francia","fr":"France","ja":"フランス共和国","pt-BR":"França","ru":"Франция","zh-CN":"法国"}},"location":{"accuracy_radius":5,"latitude":49.4431,"longitude":1.0993,"time_zone":"Europe/Paris"},"postal":{"code":"76100"},"registered_country":{"geoname_id":3017382,"is_in_european_union":true,"iso_code":"FR","names":{"de":"Frankreich","en":"France","es":"Francia","fr":"France","ja":"フランス共和国","pt-BR":"França","ru":"Франция","zh-CN":"法国"}},"subdivisions":[{"geoname_id":11071621,"iso_code":"NOR","names":{"de":"Normandie","en":"Normandy","es":"Normandía","fr":"Normandie"}},{"geoname_id":2975248,"iso_code":"76","names":{"de":"Seine-Maritime","en":"Seine-Maritime","es":"Sena Marítimo","fr":"Seine-Maritime","pt-BR":"Sena Marítimo"}}]}

However, there's no corresponding geoname_id for returned subdivisions in CSV files (e.g. cat GeoLite2-City-Locations-en.csv | fgrep 11071621 returns nothing). This situation is very common for subdivisions (e.g. Novosibirsk Oblast, Scotland, etc). Is it a bug or an expected behavoiur? What is the relation between the CSV files and the MMDB format?

Why is it important

For services that employ some kind of targeting or filtering of traffic based on location, the geoname_id's are important. For example, if we have an ad serving network and want to allow users to restrict a particular ad to a set of geolocations, it makes sense to describe such a set using respective geoname_id's from the CSV files and compare against it the geoname_id's returned by the MMDB format when deciding whether or not to serve the ad (depending on from which location the request came from). However, if a geoname_id is absent in the CSV files we can't use restrict the ad to the respective location even though the MMDB format returns it when resolving an IP address.

Workaround

A workaround is to manually add missing objects to the CSV files using the IDs returned by the MMDB format (although there's a lot of them to add manually), but in this case, a very important question is whether these geoname_id's are reliable or are they likely to change in the future?

klp2 commented 5 years ago

Each IP in both the CSV and MMDB file will map to exactly one geoname_id, and each of those geoname_id's will be found in the Locations files.

The subdivisions are values mapped to a particular location, not directly mapped to a given IP address. We do not include the geoname_id's of subdivisions in the CSV files by design, only the ISO code and name. If no IP's are mapped directly to a given subdivision, it is expected that the geoname_id for that subdivision will not appear in the Locations CSV files.

The geoname_id's are taken directly from https://www.geonames.org/, so you may be able to look there if there is data that you need which isn't otherwise find the CSV files.

baygeldin commented 5 years ago

@klp2 Thank you for the clarification! So, if I understand correctly the same logic works for countries (e.g. if we have a country which IP ranges are fully covered by its cities, the country will not be included in the CSV), right?

The subdivisions are values mapped to a particular location

How are the values mapped? Unfortunately, the hierarchy of geo-objects is quite difficult to reproduce from the files that Geonames provide (even the hierarchy.txt is not complete). Is there any chance that this mapping can be extracted from the MaxmindDB? This will at least show what locations are missing in the CSV files.

klp2 commented 5 years ago

All of the countries should show up in the Locations files.

I don't think there is a particularly easy way to extract the mapping out of the MMDB files.