mar-muel / local-geocode

Simple library for efficient geocoding without making API calls
MIT License
20 stars 5 forks source link

Way to change prioritisation #5

Open nicostombros opened 7 months ago

nicostombros commented 7 months ago

Hello, I'm having the following issue with a local-geocode search. My query is how to go about the following.

Input: Toledo, Spain Expected Output: Toledo, Spain [documented on geonames here here] Actual Output: Toledo, Ohio [documented on geonames here here]

I am expecting that this is something to do with the prioritisation in that both Toledo's were discovered, but that the Ohio one is being prioritised due to a larger population? If so, would be great to have a mechanism to fallback to some other type of prioritisation, or even to filter on the country by a parameter in the decode function. Thanks!

mar-muel commented 7 months ago

Hi - right yeah, I was often surprised by some results. Let's take a look at this specific case to see why this is.

Let's first pull in the raw data. Please note that if you do this it will pull several GBs of data from geonames, extract it and load it all into a pandas dataframe. Be sure to have enough RAM/disk space. Maybe best to run it on a Google Colab.

from geocode.geocode import Geocode

gc = Geocode()
df = gc.get_geonames_data()

Now let's look for all places in the US and Spain with at least 30k population (which is the default threshold)

>>> df[(df.name == 'Toledo') & (df.country_code.isin(['US', 'ES'])) & (df.population > 30_000)]
         geoname_id    name                                     alternatenames  latitude  longitude feature_code country_code  population
3029006     2510409  Toledo  Taleda,Toledas,Tolede,Toledo,Toledo i Spania,T...  39.85810   -4.02263         PPLA           ES       84282
3076342     6361828  Toledo                                              45168  39.86765   -4.00988         ADM3           ES       84019
11015695    5174035  Toledo  Fort Industry,Port Lawrence,TOL,Talida,Toledo,...  41.66394  -83.55521        PPLA2           US      279789

From this I can see that Toledo, Ohio has actually a huge population of 279k, whereas Toledo, Spain has 84k.

Unfortunately, at this time the Spanish Toledo gets "swallowed" by the US one because of the population difference, and the library only outputs a single entry for "Toledo". I think the reason I decided to do this was because of the huge number of "collisions". There's a surprisingly large number of Toledos in the world (e.g. apparently there's an admin area Toledo in Brazil with 142k inhabitants):

         geoname_id    name                                     alternatenames  latitude  longitude feature_code country_code  population
231525      3834251  Toledo                                             Toledo -31.55378  -64.00742          PPL           AR        3046
810719      3446370  Toledo                                  TOW,Toledo,Толедо -24.71361  -53.74306          PPL           BR      119313
857867      6321886  Toledo                                                NaN -22.70569  -46.39041         ADM2           BR        5761
859000      6323019  Toledo                                                NaN -24.73518  -53.82059         ADM2           BR      142645
2442867     3666959  Toledo                                             Toledo   7.30984  -72.48295        PPLA2           CO        5911
2442869     3666961  Toledo                         Municipio de Toledo,Toledo   7.23217  -72.31127         ADM2           CO       17272
2442870     3666962  Toledo                         Municipio de Toledo,Toledo   7.02349  -75.71487         ADM2           CO        5697
3029006     2510409  Toledo  Taleda,Toledas,Tolede,Toledo,Toledo i Spania,T...  39.85810   -4.02263         PPLA           ES       84282
3076342     6361828  Toledo                                              45168  39.86765   -4.00988         ADM3           ES       84019
6459331     3981414  Toledo                                             Toledo  24.80061 -104.46510          PPL           MX         247
6605347     8885405  Toledo                                             Toledo  18.92681  -98.37497          PPL           MX          73
6621485     8901548  Toledo                                             Toledo  22.94361 -101.32222          PPL           MX          25
6659939     8940027  Toledo                                             Toledo  16.73667  -93.21861          PPL           MX           5
8130440     1681602  Toledo  Ciudad ti Toledo,Dakbayan sa Talisay,Dakbayan ...  10.37730  123.63860          PPL           PH      207314
8545351     3372516  Toledo                                             Toledo  38.70000  -28.15000          PPL           PT          60
10105657    4251330  Toledo          Majority Point,Prairie City,Toledo,Толедо  39.27365  -88.24365        PPLA2           US        1221
10721872    4878703  Toledo  Toledo,Tolehdo,Tolido,twldw  aywwa,twlydw,Толе...  41.99555  -92.57686        PPLA2           US        2202
11015695    5174035  Toledo  Fort Industry,Port Lawrence,TOL,Talida,Toledo,...  41.66394  -83.55521        PPLA2           US      279789
11596146    5757007  Toledo                                      Toledo,Толедо  44.62151 -123.93845          PPL           US        3511
11652493    5813681  Toledo  Cowlitz Landing,Plomondon's Landing,TDO,Toledo...  46.43983 -122.84678          PPL           US         727
12141864    3439838  Toledo                                             Toledo -34.73807  -56.09469          PPL           UY        4397

so if I wanted to show some Toledos and not others, there would need to be some sort of general rule of which ones to include and in which order. Let me know if you have any suggestions to overcome this issue.

nicostombros commented 7 months ago

Makes sense and thank you for the thorough explanation. I think there's two possible solutions here.

Kwargs for decode that can be used for filtering I'd imagine having these on the decode function so it's defined something like def decode(self, input_text, feature_code_in=[], country_code_in=[], population_between=(0, max)): matches = self.kp.extract_keywords(input_text)

# Either doing this in the extract_keywords call or here:
if feature_code_in:
    matches[
        (matches.feature_code.isin(feature_code_in))
    ]
if country_code_in:
    matches[
        (matches.country_code.isin(country_code_in))
    ]
...

Prioritisation order passed to the Geocode init This would probably be similar to the above, but where a parameter is passed to control the prioritisation. I'd imagine this may be good to have both statically (for repickling) and dynamically (at decode time). This could be a dictionary where the keys are one of the 7 features mentioned in the code and their priority is an integer. Something like

{
    "feature_code_class_A_admin_level_1": 4,
    "feature_code_class_A_admin_level_0": 3,
    ...
}

Or the keys could be more dynamic than that, performing custom dataframe operations.

The first would likely be easier, and I haven't fully thought through the second one, but I think this sort of granular control would be really special

nicostombros commented 7 months ago

Any thoughts on this @mar-muel?

mar-muel commented 7 months ago

Hi - The prioritiziation/filtering cannot be changed at docode time. This is because both prioritization and filtering affect the pickle files.

These arguments would have to be passed to the __init__ at runtime and then when running gc.load() the first time it would trigger recomputation of the pickle files.

Currently we have these 3 arguments:

min_population_cutoff=30000, large_city_population_cutoff=200000, location_types=None

Happy to add country_code_in (or similar) and a way to provide priorities. Though not sure about changing priorities globally/arbitrarily. Currently this is how priorities are assigned:

        # Priorities
        # 1) Large cities (population size > large_city_population_cutoff)
        # 2) States/provinces (admin_level == 1)
        # 3) Countries (admin_level = 0)
        # 4) Places
        # 5) counties (admin_level > 1)
        # 6) continents
        # 7) regions
        # (within each group we will sort according to population size)
        # Assigning priorities
        df['priority'] = np.nan
        df.loc[(df.feature_code == 'RGN'), 'priority'] = 7
        df.loc[(df.feature_code == 'CONT'), 'priority'] = 6
        df.loc[(df.feature_code_class == 'A') & (df.admin_level > 1), 'priority'] = 5
        df.loc[df.feature_code_class == 'P', 'priority'] = 4
        df.loc[(df.feature_code_class == 'A') & (df.admin_level == 0), 'priority'] = 3
        df.loc[(df.feature_code_class == 'A') & (df.admin_level == 1), 'priority'] = 2
        df.loc[(df.population > self.large_city_population_cutoff) & (df.feature_code_class == 'P') & (~df.is_altname), 'priority'] = 1
nicostombros commented 7 months ago

That makes sense @mar-muel, think the country_code_in is a suitable solution, since it may just be that a user is interested in only North American countries

nicostombros commented 7 months ago

Created a PR for this now