Closed jpullmann closed 3 years ago
@jpullmann - indeed this is an important need to be fullfilled and i am working on it in the context of the ConfIDent project that seeks to improve the Linked Open Data Situation for scientific events. As in your case i get thousands of references for places see e.g. Sample of 3 Proceeding Titles
In the testcases https://github.com/WolfgangFahl/ProceedingsTitleParser/blob/master/tests/test_GeoLookup.py you can see the current state of affairs. I'll keep improving geograpy to a point where our project will be happy with the results
The question is - what level of perfection would you seek? What would be your acceptance criteria for closing this ticket?
Version 0.2.0 now has: Issue #25 query
select * from countryLookup where label in ('France', 'Hungary', 'Poland', 'Spain', 'United Kingdom')
result | label | wikidataid | name | iso | pop | lat | lon |
---|---|---|---|---|---|---|---|
Hungary | Q28 | Hungary | HU | 9769526 | 47 | 19 | |
Spain | Q29 | Spain | ES | 46733038 | 40 | -4 | |
Poland | Q36 | Poland | PL | 38382576 | 52 | 19 | |
United Kingdom | Q145 | United Kingdom | GB | 66022273 | 55 | -2 | |
France | Q142 | France | FR | 66628000 | 47 | 2 |
def testIssue25(self):
'''
https://github.com/somnathrakshit/geograpy3/issues/25
'''
pc=PlaceContext(place_names=["Bulgaria","Croatia","Czech Republic","Hungary"])
if self.debug:
print (pc.countries)
with the result:
['Hungary', 'Czech Republic', 'Bulgaria', 'Croatia', 'Italy', 'Romania']
the reason for Italy and Romania showing up here is that there are two cities names Bulgaria in those countries: Issue #25 Bulgaria
query
select * from cityLookup where label in ('Bulgaria','Croatia','Hungary','Czech Republic') order by pop desc,regionName
result | label | level | locationKind | wikidataid | name | geoNameId | regionId | countryId | pop | lat | lon | partOfRegionId | gndId | regionName | regionIso | regionPop | regionLat | regionLon | countryName | countryIso | CountryLat | CountryLon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bulgaria | 5 | City | Q3656821 | Bulgaria | Q182624 | Q218 | 47 | 24 | Q100188 | Cluj County | RO-CJ | 691106 | 47 | 24 | Romania | RO | 46 | 25 | ||||
Bulgaria | 5 | City | Q18439885 | Bulgaria | 3181423 | Q1263 | Q38 | 44 | 12 | Q6662 | Emilia-Romagna | IT-45 | 4459477 | 45 | 11 | Italy | IT | 42 | 12 |
Since no acceptance criteria have been specified i am closing the ticket for now.
Hi, sometimes additional countries are being recognized by geograpy that were missing in the input:
or standard country mentions are not picked up at all:
I assume this relates to suboptimal NER results of the underlying nltk library. Could you please recommend a process to improve the precision/reliability?