mar-muel / local-geocode

Simple library for efficient geocoding without making API calls
MIT License
20 stars 5 forks source link

Swiss cities decoded as cantons #9

Open atemerev opened 3 weeks ago

atemerev commented 3 weeks ago

gc.decode("Geneva")

[{'name': 'Geneva', 'official_name': 'Canton de Genève', 'country_code': 'CH', 'longitude': 6.11044, 'latitude': 46.19673, 'geoname_id': '2660645', 'location_type': 'admin1', 'population': 506343}]

Should be decoded as a city (or at least included as an option)

mar-muel commented 3 weeks ago

Hi! Not a bug, but a configuration issue.

Geneva seems to be juust below the 200k "large city" population cutoff.

See here: https://github.com/mar-muel/local-geocode?tab=readme-ov-file#configuration

Can you try again using a cutoff of 150k?

from geocode.geocode import Geocode

gc = Geocode(large_city_population_cutoff=150_000)
gc.load() 
atemerev commented 3 weeks ago

Yes, I tried, still doesn't work.

The problem is apparently in these two lines in encoder.py:

# drop name duplicates by keeping only the high priority elements
df['name_lower'] = df['name'].str.lower()
df = df.drop_duplicates('name_lower', keep='first')

As Geneva is both a canton and a city (like Zurich, Bern, etc), cantons are given higher priority here (i.e. lower priority index), and only the canton is returned.

I overrode the method to exclude these lines, so encoder now returns both the city and the canton. Works for me, but perhaps we might think how to apply this in general.

Thanks for the excellent project, BTW! It is a lifesaver.

mar-muel commented 3 weeks ago

I was hoping that by changing the large city cutoff it would assign the city of Geneva higher priority which would then be above the admin1 level (see prioritization below). Would need to look into what's going on here. But happy you found a solution that works for you!

        # Priorities
        # 1) Large cities (population size > large_city_population_cutoff)
        # 2) States/provinces (admin_level == 1)
        # 3) Countries (admin_level = 0)
        # 4) Places
        # 5) counties (admin_level > 1)
        # 6) continents
        # 7) regions
atemerev commented 3 weeks ago

I think the reason is that in the original Geonames dataset, Geneva is classified as "administrative center of the corresponding canton" for some reason, not a "city". I'll take a closer look today.