IP Geo tools to verify progress

jimaek commented 1 year ago

We have GEO detection problems but we also dont have a simple way to compare and benchmark different DBs and algorithms.

~~I believe @MartinKolarik wrote some scripts to simplify testing, maybe we can build on top of them to finally get a tool that can reliably tell us if a change resulted in improved data or not.~~

e.g. stuff we could test would be using more DBs, like going all the way to 5. Plus a sixth ipmap data https://github.com/jsdelivr/globalping/issues/248 Later a 7th source of data a traceroute API and so on.

Example wrong IP which is really in Krakow:

Poznan https://ipinfo.io/37.47.67.95 Krakow https://www.maxmind.com/en/geoip2-precision-demo?ip_address=37.47.67.95 Dabrowa https://globalping-geoip.global.ssl.fastly.net/37.47.67.95 Tarnow https://db-ip.com/37.47.67.95 Krakow https://www.ip2location.com/demo/37.47.67.95

MartinKolarik commented 1 year ago

The script I wrote simply called the three resolvers we have here and logged when some gave different answers than others. As discussed, we'll start here by creating a list of IPs with verified locations.

jimaek commented 1 year ago

I prepared a Google Sheet with list of IPs I know the real location, we should expand it and build tools that would allow us to track quality changes based on different algorithms.

First step would be for the tool to test the current algorithm as baseline. And then we should discuss the next steps based on #265

alexey-yarmosh commented 1 year ago

If I understand correctly we need a script that will check different ip geo providers and accumulation algorithms against the 100% valid data we have. So as a result we will get a data and a script to generate the data again.

Format:	ip	real city	ipinfo	maxmind	fastly	provider#4	algorithm#1
1.1.1.1	Warsaw	Warsaw	Krakow	Krakow	Warsaw	Warsaw	Krakow
2.2.2.2	Barselona	Barselona	Barselona	Madrid	Madrid	Barselona	Madrid
...	...	...	...	...	...	...	...
Accuracy:	1	1	0.5	0	0.5	1	0

jimaek commented 1 year ago

That would be a good start yes. The only problem is that the verified DB is too small for now

alexey-yarmosh commented 1 year ago

Additionally we will see the amount of probes which have different locations from different providers, and understand the scale of a problem.

jimaek commented 1 year ago

yes, we can feed all connected probes into the script to see what we're dealing with

alexey-yarmosh commented 1 year ago

I've updated the google doc with the latest values. Next I will add info from db-ip, ip2location and ipmap there to understand how valuable they are.

alexey-yarmosh commented 1 year ago

A lot of mismatches happen because of preferring the suburb city instead of the main city (e.g. Dadri and Delhi) That is a point to think, do we want probes to be as accurate as possible (so user will be able to filter by Dadri, but filter by Delhi will not return that probe) or we want to accumulate all suburb cities in main city. Third option is to combine both and show that probe both in Dadri and Delhi searches. Other examples: Doddaballapura and Bangalore, Piscataway and New York.

Also, there are a lot of cases of different naming, seems we should check them manually and add proper aliases, so they will be treated as equals during geo ip algorithm and user search. (e.g. Bengaluru and Bangalore, Rodos and Rhodes).

jimaek commented 1 year ago

Yeah suburbs it seems create a huge issue, Redbridge is another example, which should be just London. Ideally we want a probe to have multiple locations in this case, both London and Redbridge, but seems impossible to automate it.

Regarding naming its a different issue we need to solve regardless of geo logic.

It seems the main problem comes from different DBs using different naming and we're trying to combine them. A potential idea would be to use a single DB like ipinfo and try to augment it with different data sources like the traceroute stuff.

This would solve the naming issue and potentially increase accuracy.

alexey-yarmosh commented 1 year ago

What do you mean by to augment? How that may look like?

jimaek commented 1 year ago

Merge data and overwrite DB values with traceroute result values

alexey-yarmosh commented 1 year ago

From 50 data items:

26 items had a result with cities different from city provided by DC
17 items had district/suburb city result instead of main city
6 items where different providers returned the same city with but with different name/spelling

more data can be found here.

What can be done:

Top thing to do is to find the main city for district/suburb. We can try geonames.org API or similar tools for that. For example here under the Hierarchy tab we can see that Spanga refers to Stockholm.
Alternatively it can be our own logic of finding main cities. E.g. if population of city is <200k and there is a big city in 100km radius => save both of the cities for that probe.
Our current algorithm compares cities using ===. So if provider1 = doddaballapura, provider2 = bengaluru and provider3 = bangalore they are treated as 3 different cities and we a fallbacking just to first provider. Instead, provider2 and provider3 should be treated as equals and their result should win. That can be done using 3rd party API. For example here we requested "Bengaluru" and under the Alternative names tab there is a EN "Bangalore".
Use as many providers as possible. In a few cases using 5 providers instead of current 3 will result in a correct result. Need to review pricing of db-ip and ip2location and add them to the algorithm.
Accuracy of the dp-ip (0.60) and ip2location (0.62) is the best, compared to the other providers. Current algorithm returns ipinfo (0.52) data if results from all providers are different. ipinfo should be replaced with the ip2location result.

jimaek commented 1 year ago

I moved the suburbs thing into a separate task since its unrelated to any logic I think, it can just be a final step after everything else is done.

Accuracy of the dp-ip (0.60) and ip2location (0.62) is the best, compared to the other providers. Current algorithm returns ipinfo (0.52) data if results from all providers are different. ipinfo should be replaced with the ip2location result.

This is complicated. When manually reviewing the data, ipinfo does seem more accurate in some cases even when its in the minority, meaning voting would make the data worse

e.g. line 30, correct is Dallas, ipinfo says Irving. But its basically a suburb of Dallas. All other DBs are way off.

At the same time the fastly DB is the worse one. Maybe a first step could be replacing it with ip2location DB?

alexey-yarmosh commented 1 year ago

Yeah, we should clean the data first (fix diff naming + suburb) and only then compare the accuracy coefficients.

MartinKolarik commented 1 year ago

Some notes:

In some cases, the data in the doc are not correct, e.g., Frankfurt am Main is the correct name, not Frankfurt. Realistically, they are both OK to use, but differences like this need to be considered when measuring accuracy. It's also a problem if we flip from one value to the other over time, which can easily happen when adding and removing geo providers.
Suburbs - there are two possible cases:
- sometimes, the DB may really use a district name even if the location is still within the main city (using the main city name seems clear here),
- sometimes, the location is really NOT in the city, just close. E.g., "Piscataway" is not in the city of New York. If we want to be precise, New York is the wrong value here. More fuzzy matching may make sense but sometimes, also maybe not.
@alexey-yarmosh not sure if you saw #163. I brought the "resolving" idea up before and after @ayuhito's research it seemed the results would be... questionable.

alexey-yarmosh commented 1 year ago

So seems suburbs can be fixed using coordinates pretty good. I've I used geonames.org API to get all cities in 100km area around the coordinates provided by ipinfo, then picked the city with the biggest population. See google sheet. In most of the cases it fixed the location correctly. A few times 100km was too big which leads to wrong result (valdivia -> osorno). A few times DC itself provides suburb as location (milpitas, ashburn) so that can't be fixed.

I think we should move further with that solution, as fixing suburbs will bring the most value to the geo logic. Need to find the most accurate radius value and recalculate values and accuracy for different providers.

In terms of implementation we can try to store the geonames data in redis and use redis GEOSEARCH to retrieve cities, alternatively geonames API + redis cache can be used.

alexey-yarmosh commented 1 year ago

In some cases, the data in the doc are not correct, e.g., Frankfurt am Main is the correct name, not Frankfurt. Realistically, they are both OK to use, but differences like this need to be considered when measuring accuracy. It's also a problem if we flip from one value to the other over time, which can easily happen when adding and removing geo providers.

I think we can store both values as a correct city names so we will be able to always show one consistent value. At the same time user will be able to search by any of two.

sometimes, the location is really NOT in the city, just close. E.g., "Piscataway" is not in the city of New York. If we want to be precise, New York is the wrong value here. More fuzzy matching may make sense but sometimes, also maybe not.

Yeah, we are choosing between accuracy and correctness. Maybe storing both suburbs and main city will do the trick here too? Also, is there a use case when user will need exactly "Piscataway" or similar instead of "New York"?

@alexey-yarmosh not sure if you saw https://github.com/jsdelivr/globalping/issues/163. I brought the "resolving" idea up before and after @ayuhito's research it seemed the results would be... questionable.

I need some time to understand all the problems there, hope to get back with something soon :)

alexey-yarmosh commented 1 year ago

I just realised that if we use "geonames nearby city" logic to fix suburbs that will automatically convert all cities to their standard geoname value. Also most of the providers returns lat long values so that may be the only data we will need from them. So final pipeline will look like:

lat, long of ip from provider
|
V
closest big city name from geonames
|
V
city name with fixed ascii characters from our current logic
|
V
final value

jimaek commented 1 year ago

Looks interesting. But which DB do we use for coordinates? Voting system or static ipinfo?

alexey-yarmosh commented 1 year ago

Voting, all providers have lat long info so it is lat/long of a winner.

MartinKolarik commented 1 year ago

So seems suburbs can be fixed using coordinates pretty good. I've I used geonames.org API to get all cities in 100km area around the coordinates provided by ipinfo, then picked the city with the biggest population. See google sheet. In most of the cases it fixed the location correctly. A few times 100km was too big which leads to wrong result (valdivia -> osorno). A few times DC itself provides suburb as location (milpitas, ashburn) so that can't be fixed.

I'm not quite sure about this because with such a big radius we'll only allow a few possible cities per state. Related:

Yeah, we are choosing between accuracy and correctness. Maybe storing both suburbs and main city will do the trick here too? Also, is there a use case when user will need exactly "Piscataway" or similar instead of "New York"?

The way I see it, in a "global" use case where you want to e.g. measure latency from different locations, this difference doesn't matter. But in such a case, why would you use a city as a location in the first place (instead of e.g. country)? On the other hand, for someone debugging, let's say a connectivity issue in a specific location, the difference might matter.

It seems to me that by making the city location only approximate, we might lose part of the reason for the field being there.

alexey-yarmosh commented 1 year ago

The way I see it, in a "global" use case where you want to e.g. measure latency from different locations, this difference doesn't matter. But in such a case, why would you use a city as a location in the first place (instead of e.g. country)? On the other hand, for someone debugging, let's say a connectivity issue in a specific location, the difference might matter.

Maybe we should apply suburb logic only if there is no consensus between providers? E.g. pr1 = "Dadri", pr2 = "Noida", pr3 = "Gautam Buddha Nagar". We are not sure what to choose so using "Delhi". In other case when pr1 = pr2 = pr3 = "Piscataway" we are using "Piscataway".

jimaek commented 1 year ago

Why not assigning multiple cities to each probe? The one reported directly by the DB, which could be the suburb or small town and also the second resolved city.

This way everyone will benefit, you can pinpoint if you want or just get everything that is close by.

alexey-yarmosh commented 1 year ago

While storing multiple cities the question is what small city should be stored. In "Dadri", "Noida", "Gautam Buddha Nagar" example what should we store along with "Delhi"?

jimaek commented 1 year ago

The winning DB provides a city, we try to resolve it, if its possible then we set both. And again if possible largest entity becomes "city" and smallest becomes "city-extra" or something.

alexey-yarmosh commented 1 year ago

Yes, that should work. I think it also worth leaving "city-extra" blank if there is no more providers that returned the same value. If they are all different that means that exact location is pretty random and it's better to provide only main city without potentially falsy suburb.

alexey-yarmosh commented 1 year ago

Here can be found a table with different search radius of main city. Results may be too bound to ipinfo and availiable data set, but that is what we have. 50-80km radius is the best and improves the accuracy on 15%.

Another table shows 60km suburb logic applied to all providers, along with their initial values. Also there are a few naive voting algorythms: (current algorythm, alg that uses all 6 providers, alg that uses suburb fixed values from all 6 providers, alg that uses 6 providers values + 6 suburb fixed values).

alexey-yarmosh commented 1 year ago

Updated algorithm:

providers returns city values
|
V
values are normalized using #163
|
V
voting happens and a single winner is chosen
|
V
searching main city by lat/long of the winner in a specified radius
|
V
main city value is stored in a "city" field
|
V
if winner city !== main city, and winner city was return by >1 provider then store winner city in a "city-extra" field ("area"?)
|
V
done. either only "city", or "city" and "city-extra" fields are fulfilled

Regarding voting. Accuracy of providers is:

fastly: 0.45
ipinfo: 0.56
maxmind: 0.58
db-ip: 0.62
ip2location: 0.67
ipmap: 0.7

Lets remove fastly, and give every provider 1 vote, and sort the priority by accuracy. Final list will look like: [ipmap, ip2location, db-ip, maxmind, ipinfo] E.g. [ipmap: A, ip2location: B, db-ip: C, maxmind: D, ipinfo: E] => A wins as ipmap has highest priority E.g. [ipmap: A, ip2location: B, db-ip: A, maxmind: B, ipinfo: C] => 2 for A, 2 for B, 1 for C => A wins as ipmap goes before ip2location E.g. [ipmap: A, ip2location: B, db-ip: B, maxmind: C, ipinfo: C] => 1 for A, 2 for B, 1 for C => B wins as ip2location goes before maxmind

I can start implementation with adding new providers. Then further on the algorithm list.

jimaek commented 1 year ago

I have new probes and here are some false positives: 128.14.103.113 = Mexico City - GP says Los Angeles 156.59.61.242 = Manilla - GP says Shanghai 128.14.105.249 = Singapore - GP says Los Angeles

Please check if new logic fixes them

MartinKolarik commented 1 year ago

I have new probes and here are some false positives: 128.14.103.113 = Mexico City - GP says Los Angeles 156.59.61.242 = Manilla - GP says Shanghai 128.14.105.249 = Singapore - GP says Los Angeles

Please check if new logic fixes them

Unfortunately, ipinfo is the only provider that got 2 of those correct and 1 at least close. All other providers are wrong. @alexey-yarmosh please add to the table.

alexey-yarmosh commented 1 year ago

@alexey-yarmosh please add to the table

Sure, @jimaek lets also add correct new known ips. Maybe other providers where wrong and that will affect the accuracy. You can DM me the list if you have one.

MartinKolarik commented 1 year ago

Regarding the radiuses:

The idea makes sense. Let's add it after #382.
I suggest we use a lower value than 60; 30 seems reasonable. This is because when you consider a typical radius of a city (check some on a map), the "suburbs" are typically 15-30 km from the center. At 60 km, it's so far we're really just guessing. Based on the data, 30 km provides almost the same improvement level and won't collapse correct smaller cities into the big ones (discussed this part with @jimaek last week).
For now, I wouldn't add any "extra" fields. If the radius is kept at 30 km, just use the value as the current city field.
Re "It depends on do we want to vote before the radius matching (using the original provider cities) or after (using the radius approximated cities). I was thinking about the first option, but second makes sense too." I'd go with second.

alexey-yarmosh commented 1 year ago

Sounds reasonable, additionally since radius approximation is a first step, city name normalisation will be done automatically. I'll focus on the implementation then.

alexey-yarmosh commented 1 year ago

The only geo ip thing to implement is city name normalisation. So we can close either this issue or https://github.com/jsdelivr/globalping/issues/163

jsdelivr / globalping

IP Geo tools to verify progress #264