mar-muel / local-geocode

Simple library for efficient geocoding without making API calls
MIT License
21 stars 5 forks source link

Odd number of countries retrieved during retrieval of data #4

Closed nicostombros closed 9 months ago

nicostombros commented 9 months ago

Hey there, I was reinitialising the Geocode class, when I noticed that the number of countries returned by the file is much greater than what I would expect. Screenshot below of the data retrieved and the count of the countries. I would expect this number to be more like 200 so it's closer to the countries listed on this page https://www.geonames.org/countries/

Screenshot 2024-02-04 at 18 19 45

This is a fantastic library btw, thank you for providing it!

mar-muel commented 9 months ago

Hey there - If I remember correctly it's because country names have a ton of different variants (think e.g. US, USA, United States, etc...) and also have various spellings in different languages, e.g. Italy, Italia, Repubblica Italiana, etc.

If your application only relies on English country names there might be ways to filter this specifically 🤔 Would need to look into how the country aliases are annotated. Thing is that the "official" country names are interestingly rarely used by anyone (i.e. people don't tend to spell out the full name of the US normally), so we can't just ignore aliases.

nicostombros commented 9 months ago

Hey @mar-muel, that makes sense thanks! Yes would be great to be able to filter by English-only. I see in the code that the featureCodes_en.txt file is downloaded, does that not filter by English only? I'm using this as my reference for the available API's https://www.geonames.org/export/ws-overview.html but not sure if this is the correct source...

mar-muel commented 9 months ago

Unfortunately, it seems like the alternate names of places are not properly annotated. E.g. the alternate names of Toledo, Spain are given as a list of strings without any sort of language annotation:

'Taleda,Toledas,Tolede,Toledo,Toledo i Spania,Toledu,Toletum,Toleu,Tolède,XTJ,tlytlt,to le do,toledo,toleto,tolledo,toredo,tuo lai duo,tuo li duo,twldw,twldw  aspanya,twlydw,Τολέδο,Таледа,Толедо,Տոլեդո,טאלעדא,טולדו,تولدو، اسپانیا,توليدو,طليطلة,طلیطلہ,तोलेदो,ਤੋਲੇਦੋ,டொலேடோ,โตเลโด,ტოლედო,ቶሌዶ,トレド,托利多,托萊多, '레도

As I mentioned above, I cannot simply ignore these alternative names of places as they are sometimes more meaningful than the official names. Else something like this would not be possible:

>>> gc.decode("L.A.")
[{'name': 'L.A.', 'official_name': 'Los Angeles', 'country_code': 'US', 'longitude': -118.24368, 'latitude': 34.05223, 'geoname_id': '5368361', 'location_type': 'city', 'population': 3898747}]
nicostombros commented 9 months ago

Unfortunate but understandable, thanks :)