Open nicostombros opened 7 months ago
Hi - right yeah, I was often surprised by some results. Let's take a look at this specific case to see why this is.
Let's first pull in the raw data. Please note that if you do this it will pull several GBs of data from geonames, extract it and load it all into a pandas dataframe. Be sure to have enough RAM/disk space. Maybe best to run it on a Google Colab.
from geocode.geocode import Geocode
gc = Geocode()
df = gc.get_geonames_data()
Now let's look for all places in the US and Spain with at least 30k population (which is the default threshold)
>>> df[(df.name == 'Toledo') & (df.country_code.isin(['US', 'ES'])) & (df.population > 30_000)]
geoname_id name alternatenames latitude longitude feature_code country_code population
3029006 2510409 Toledo Taleda,Toledas,Tolede,Toledo,Toledo i Spania,T... 39.85810 -4.02263 PPLA ES 84282
3076342 6361828 Toledo 45168 39.86765 -4.00988 ADM3 ES 84019
11015695 5174035 Toledo Fort Industry,Port Lawrence,TOL,Talida,Toledo,... 41.66394 -83.55521 PPLA2 US 279789
From this I can see that Toledo, Ohio has actually a huge population of 279k, whereas Toledo, Spain has 84k.
Unfortunately, at this time the Spanish Toledo gets "swallowed" by the US one because of the population difference, and the library only outputs a single entry for "Toledo". I think the reason I decided to do this was because of the huge number of "collisions". There's a surprisingly large number of Toledos in the world (e.g. apparently there's an admin area Toledo in Brazil with 142k inhabitants):
geoname_id name alternatenames latitude longitude feature_code country_code population
231525 3834251 Toledo Toledo -31.55378 -64.00742 PPL AR 3046
810719 3446370 Toledo TOW,Toledo,Толедо -24.71361 -53.74306 PPL BR 119313
857867 6321886 Toledo NaN -22.70569 -46.39041 ADM2 BR 5761
859000 6323019 Toledo NaN -24.73518 -53.82059 ADM2 BR 142645
2442867 3666959 Toledo Toledo 7.30984 -72.48295 PPLA2 CO 5911
2442869 3666961 Toledo Municipio de Toledo,Toledo 7.23217 -72.31127 ADM2 CO 17272
2442870 3666962 Toledo Municipio de Toledo,Toledo 7.02349 -75.71487 ADM2 CO 5697
3029006 2510409 Toledo Taleda,Toledas,Tolede,Toledo,Toledo i Spania,T... 39.85810 -4.02263 PPLA ES 84282
3076342 6361828 Toledo 45168 39.86765 -4.00988 ADM3 ES 84019
6459331 3981414 Toledo Toledo 24.80061 -104.46510 PPL MX 247
6605347 8885405 Toledo Toledo 18.92681 -98.37497 PPL MX 73
6621485 8901548 Toledo Toledo 22.94361 -101.32222 PPL MX 25
6659939 8940027 Toledo Toledo 16.73667 -93.21861 PPL MX 5
8130440 1681602 Toledo Ciudad ti Toledo,Dakbayan sa Talisay,Dakbayan ... 10.37730 123.63860 PPL PH 207314
8545351 3372516 Toledo Toledo 38.70000 -28.15000 PPL PT 60
10105657 4251330 Toledo Majority Point,Prairie City,Toledo,Толедо 39.27365 -88.24365 PPLA2 US 1221
10721872 4878703 Toledo Toledo,Tolehdo,Tolido,twldw aywwa,twlydw,Толе... 41.99555 -92.57686 PPLA2 US 2202
11015695 5174035 Toledo Fort Industry,Port Lawrence,TOL,Talida,Toledo,... 41.66394 -83.55521 PPLA2 US 279789
11596146 5757007 Toledo Toledo,Толедо 44.62151 -123.93845 PPL US 3511
11652493 5813681 Toledo Cowlitz Landing,Plomondon's Landing,TDO,Toledo... 46.43983 -122.84678 PPL US 727
12141864 3439838 Toledo Toledo -34.73807 -56.09469 PPL UY 4397
so if I wanted to show some Toledos and not others, there would need to be some sort of general rule of which ones to include and in which order. Let me know if you have any suggestions to overcome this issue.
Makes sense and thank you for the thorough explanation. I think there's two possible solutions here.
Kwargs for decode that can be used for filtering I'd imagine having these on the decode function so it's defined something like def decode(self, input_text, feature_code_in=[], country_code_in=[], population_between=(0, max)): matches = self.kp.extract_keywords(input_text)
# Either doing this in the extract_keywords call or here:
if feature_code_in:
matches[
(matches.feature_code.isin(feature_code_in))
]
if country_code_in:
matches[
(matches.country_code.isin(country_code_in))
]
...
Prioritisation order passed to the Geocode
init
This would probably be similar to the above, but where a parameter is passed to control the prioritisation. I'd imagine this may be good to have both statically (for repickling) and dynamically (at decode
time). This could be a dictionary where the keys are one of the 7 features mentioned in the code and their priority is an integer. Something like
{
"feature_code_class_A_admin_level_1": 4,
"feature_code_class_A_admin_level_0": 3,
...
}
Or the keys could be more dynamic than that, performing custom dataframe operations.
The first would likely be easier, and I haven't fully thought through the second one, but I think this sort of granular control would be really special
Any thoughts on this @mar-muel?
Hi - The prioritiziation/filtering cannot be changed at docode
time. This is because both prioritization and filtering affect the pickle files.
These arguments would have to be passed to the __init__
at runtime and then when running gc.load()
the first time it would trigger recomputation of the pickle files.
Currently we have these 3 arguments:
min_population_cutoff=30000, large_city_population_cutoff=200000, location_types=None
Happy to add country_code_in
(or similar) and a way to provide priorities. Though not sure about changing priorities globally/arbitrarily. Currently this is how priorities are assigned:
# Priorities
# 1) Large cities (population size > large_city_population_cutoff)
# 2) States/provinces (admin_level == 1)
# 3) Countries (admin_level = 0)
# 4) Places
# 5) counties (admin_level > 1)
# 6) continents
# 7) regions
# (within each group we will sort according to population size)
# Assigning priorities
df['priority'] = np.nan
df.loc[(df.feature_code == 'RGN'), 'priority'] = 7
df.loc[(df.feature_code == 'CONT'), 'priority'] = 6
df.loc[(df.feature_code_class == 'A') & (df.admin_level > 1), 'priority'] = 5
df.loc[df.feature_code_class == 'P', 'priority'] = 4
df.loc[(df.feature_code_class == 'A') & (df.admin_level == 0), 'priority'] = 3
df.loc[(df.feature_code_class == 'A') & (df.admin_level == 1), 'priority'] = 2
df.loc[(df.population > self.large_city_population_cutoff) & (df.feature_code_class == 'P') & (~df.is_altname), 'priority'] = 1
That makes sense @mar-muel, think the country_code_in
is a suitable solution, since it may just be that a user is interested in only North American countries
Created a PR for this now
Hello, I'm having the following issue with a
local-geocode
search. My query is how to go about the following.Input: Toledo, Spain Expected Output: Toledo, Spain [documented on geonames here here] Actual Output: Toledo, Ohio [documented on geonames here here]
I am expecting that this is something to do with the prioritisation in that both Toledo's were discovered, but that the Ohio one is being prioritised due to a larger population? If so, would be great to have a mechanism to fallback to some other type of prioritisation, or even to filter on the country by a parameter in the decode function. Thanks!