somnathrakshit / geograpy3

Extract place names from a URL or text, and add context to those names -- for example distinguishing between a country, region or city.
https://geograpy3.readthedocs.io
Apache License 2.0
124 stars 12 forks source link

geograpy.locateCity("Berlin") is returning US instead of DE #24

Closed robertocommit closed 3 years ago

robertocommit commented 4 years ago

Hi Admin, first of all many thanks for this great library.

Here my issue:

import geograpy

def extract_country(input):
    city=geograpy.locateCity(input)
    country=city.country.iso
    return country

if __name__ == "__main__":
    print(extract_country("Berlin"))

As result I get US

Should not return instead DE ?

Many thanks

WolfgangFahl commented 3 years ago

indeed our cities with population table is broken

select country_iso_code,wikidataurl,cityPop,subdivision_1_iso_code from citiesWithPopulation where city_name='Berlin'
country_iso_code wikidataurl cityPop subdivision_1_iso_code
US http://www.wikidata.org/entity/Q821199 19866.0 CT
US http://www.wikidata.org/entity/Q821244 10051.0 NH
US http://www.wikidata.org/entity/Q1569850 5524.0 WI
US http://www.wikidata.org/entity/Q614184 4485.0 MD
US http://www.wikidata.org/entity/Q524646 2866.0 MA
SV http://www.wikidata.org/entity/Q582242 (null) US
US http://www.wikidata.org/entity/Q1086827 (null) NJ
US http://www.wikidata.org/entity/Q1130950 (null) PA
somnathrakshit commented 3 years ago

The same issue takes place with many other cities as well. The following code reproduces it. Note that Belgium is absent in the list of countries whereas Brussels (probably Brussels, Wisconsin, USA) is present in the list of cities within USA.

import geograpy
url = 'https://en.wikipedia.org/wiki/UEFA_Euro_2020'
places = geograpy.get_geoPlace_context(url=url)
print(places.country_cities)

Output:

{
    "Azerbaijan": ["Baku"],
    "Belarus": ["Dublin"],
    "Romania": ["Bucharest"],
    "Australia": ["Denmark", "Seville"],
    "Portugal": ["Portugal"],
    "Spain": ["Seville", "Bilbao"],
    "Denmark": ["Copenhagen"],
    "United Kingdom": ["March", "London", "Glasgow"],
    "Switzerland": ["Nyon"],
    "Netherlands": ["Vijfhuizen", "Rome", "Amsterdam"],
    "Belgium": ["Brussels"],
    "Germany": ["Munich"],
    "Ireland": ["Portmarnock", "Dublin"],
    "Hungary": ["Budapest"],
    "Italy": ["Rome"],
    "Colombia": ["Armenia"],
    "United States": [
        "England",
        "London",
        "Scotland",
        "Dublin",
        "Rome",
        "Brussels",
        "English",
        "Glasgow",
        "Amsterdam",
        "Turkey",
        "Denmark",
        "North",
        "Italy",
        "Ireland",
        "Finland",
        "Munich",
        "Copenhagen",
        "Russia",
        "Seville",
        "Belgium",
    ],
    "Canada": ["Brussels", "Dublin", "London", "Scotland"],
}
WolfgangFahl commented 3 years ago

the legacy get_geo_place_context approach will be problematic. You might want to go with the new Locator or LocationContext interfaces which use the wikidata information and the populationas a disambiguator. See Singapore, Berlin and Athens as examples below which are correctly prioritized to be located in SG, DE and GR

select  distinct cl.wikidataid,label,name,pop,regionIso,regionName,countryIso,countryName,gndId,geoNameId,lat,lon from 
city_labels l 
join CityLookup cl on l.wikidataid=cl.wikidataid
where l.label in ('Berlin',',St. Petersburg','Singapore','Athens')
order by pop desc
wikidataid label name pop regionIso regionName countryIso countryName gndId geoNameId lat lon
Q334 Singapore Singapore 5888926 SG Singapore SG Singapore 4055089-8 1880251 104 1
Q64 Berlin Berlin 3644826 DE-BE Berlin DE Germany 4005728-8 6547383 53 13
Q64 Berlin Berlin 3644826 DE-BE Berlin DE Germany 4005728-8 6547539 53 13
Q64 Berlin Berlin 3644826 DE-BE Berlin DE Germany 4005728-8 2950159 53 13
Q64 Berlin Berlin 3644826 DE-BE Berlin DE Germany 4005728-8 2950157 53 13
Q1524 Athens Athens 664046 GR-I Attica Region GR Greece 4003366-1 264371 38 24
Q203263 Athens Athens 115452 US-GA Georgia US United States of America 4195479-8 4180386 34 -83
Q79439 Athens Athens 24000 US-AL Alabama US United States of America 4830668 35 -87
Q755420 Athens Athens 23832 US-OH Ohio US United States of America 4207790-4 4505542 39 -82
Q755420 Athens Athens 21342 US-OH Ohio US United States of America 4207790-4 4505542 39 -82
Q821199 Berlin Berlin 19866 US-CT Connecticut US United States of America 5282244 42 -73
Q3292481 Athens Athens 13548 US-TN Tennessee US United States of America 4603284 35 -85
Q3292481 Athens Athens 13458 US-TN Tennessee US United States of America 4603284 35 -85
Q755425 Athens Athens 12710 US-TX Texas US United States of America 4671545 32 -96
Q821244 Berlin Berlin 10051 US-NH New Hampshire US United States of America 5083330 44 -71
Q1086827 Berlin Berlin 7588 US-NJ New Jersey US United States of America 4500771 40 -75
Q1569850 Berlin Berlin 5524 US-WI Wisconsin US United States of America 5245497 44 -89
Q3720748 Berlin Berlin Township 5357 US-NJ New Jersey US United States of America 4500777 40 -75
Q614184 Berlin Berlin 4485 US-MD Maryland US United States of America 4348460 38 -75
Q3709146 Athens Athens 4089 US-NY New York US United States of America 5107467 42 -74
Q570807 Athens Athens 3013 CA-ON Ontario CA Canada 5887943 45 -76
Q524646 Berlin Berlin 2866 US-MA Massachusetts US United States of America 4930431 42 -72
Q578681 Athens Athens 1988 US-IL Illinois US United States of America 4232997 40 -90
Q821215 Berlin Berlin 1880 US-NY New York US United States of America 5108863 43 -73
Q4892348 Berlin Berlin 1145 US-WI Wisconsin US United States of America 44 -89
Q977715 Athens Athens 1105 US-WI Wisconsin US United States of America 5244312 45 -90
Q2218754 Athens Athens 1048 US-WV West Virginia US United States of America 4797549 37 -81
Q2218754 Athens Athens 1048 US-WV West Virginia US United States of America 4797549 37 -81
Q2791349 Athens Athens 1024 US-MI Michigan US United States of America 4984489 42 -85
Q3477164 Athens Athens 1019 US-ME Maine US United States of America 4956946 45 -70
Q4892350 Berlin Berlin 945 US-WI Wisconsin US United States of America 5245512 45 -90
Q142659 Berlin Berlin 898 US-OH Ohio US United States of America 5147132 41 -82
Q2345711 Berlin Berlin 551 US-GA Georgia US United States of America 4182096 31 -84
Q3239684 Athens Athens 249 US-LA Louisiana US United States of America 4315137 33 -93
Q1930098 Berlin Lincoln 162 US-IA Iowa US United States of America 4864742 42 -93
Q821229 Berlin Berlin 34 US-ND North Dakota US United States of America 5058265 46 -98
Q12303254 Berlin Berlin DK-83 Southern Denmark DK Denmark 55 10
Q844930 Athens Classical Athens GR-I Attica Region GR Greece 9962195 38 24
Q20144232 Berlin Berlín MX-CHP Chiapas MX Mexico 8922112 16 -93
Q582242 Berlin Berlín SV-US Usulután Department SV El Salvador 3587266 14 -89
Q4892339 Berlin Berlin US-AL Alabama US United States of America 4047787 34 -87
Q4813438 Athens Athens US-AR Arkansas US United States of America 4099912 34 -94
Q4813442 Athens Athens US-CA California US United States of America 5325144 34 -118
Q2504681 Berlin Berlin US-IL Illinois US United States of America 4233627 40 -90
Q4813446 Athens Athens US-IN Indiana US United States of America 4917703 41 -86
Q17509488 Berlin Berlin US-KS Kansas US United States of America 38 -95
Q42685464 Athens Athens US-KS Kansas US United States of America 38 -96
Q821190 Berlin Berlin US-KY Kentucky US United States of America 4283975 39 -84
Q4813443 Athens Athens US-KY Kentucky US United States of America 4282781 38 -84
Q33422655 Athens Athens US-ME Maine US United States of America 4956946 45 -70
Q7522845 Singapore Singapore US-MI Michigan US United States of America 43 -86
Q4813447 Athens Athens US-MN Minnesota US United States of America 5016799 45 -93
Q4813449 Athens Athens US-MS Mississippi US United States of America 4416664 34 -88
Q4892354 Berlin Berlin US-OK Oklahoma US United States of America 4530350 35 -100
Q1130437 Athens Athens US-PA Pennsylvania US United States of America 5178651 42 -77
Q1130437 Athens Athens US-PA Pennsylvania US United States of America 5178651 42 -77
Q1130950 Berlin Berlin US-PA Pennsylvania US United States of America 4556518 40 -79
Q1130950 Berlin Berlin US-PA Pennsylvania US United States of America 4556518 40 -79
Q21196360 Berlin Berlin US-TX Texas US United States of America 30 -96
Q14711981 Athens Athens US-VA Virginia US United States of America 4829249 38 -77
Q755427 Athens Athens US-VT Vermont US United States of America 5233317 43 -73
Q821245 Berlin Berlin US-VT Vermont US United States of America 44 -73
Q664745 Berlin Berlin US-WV West Virginia US United States of America 4798708 39 -80
Q4892343 Berlin Berlin ZA-EC Eastern Cape ZA South Africa -33 28