somnathrakshit / geograpy3

Extract place names from a URL or text, and add context to those names -- for example distinguishing between a country, region or city.
https://geograpy3.readthedocs.io
Apache License 2.0
124 stars 12 forks source link

Tweaking country recognition #25

Closed jpullmann closed 3 years ago

jpullmann commented 4 years ago

Hi, sometimes additional countries are being recognized by geograpy that were missing in the input:

   > text="France, Hungary, Poland, Spain, United Kingdom"
   > print(f"{text}  vs. {str(sorted(geograpy.get_geoPlace_context(text=text).countries))}")
   > France, Hungary, Poland, Spain, United Kingdom  vs. ['France', 'Hungary', 'Poland', 'Spain', 'United Kingdom', 'United States']

or standard country mentions are not picked up at all:

  > text="Bulgaria 3, Croatia 2, Czech Republic 1, Hungary 3"
  > Bulgaria 3, Croatia 2, Czech Republic 1, Hungary 3  vs. ['Bulgaria', 'Croatia']

I assume this relates to suboptimal NER results of the underlying nltk library. Could you please recommend a process to improve the precision/reliability?

WolfgangFahl commented 3 years ago

@jpullmann - indeed this is an important need to be fullfilled and i am working on it in the context of the ConfIDent project that seeks to improve the Linked Open Data Situation for scientific events. As in your case i get thousands of references for places see e.g. Sample of 3 Proceeding Titles

In the testcases https://github.com/WolfgangFahl/ProceedingsTitleParser/blob/master/tests/test_GeoLookup.py you can see the current state of affairs. I'll keep improving geograpy to a point where our project will be happy with the results

WolfgangFahl commented 3 years ago

The question is - what level of perfection would you seek? What would be your acceptance criteria for closing this ticket?

WolfgangFahl commented 3 years ago

Version 0.2.0 now has: Issue #25 query

select * from countryLookup where label in ('France', 'Hungary', 'Poland', 'Spain', 'United Kingdom')
result label wikidataid name iso pop lat lon
Hungary Q28 Hungary HU 9769526 47 19
Spain Q29 Spain ES 46733038 40 -4
Poland Q36 Poland PL 38382576 52 19
United Kingdom Q145 United Kingdom GB 66022273 55 -2
France Q142 France FR 66628000 47 2
WolfgangFahl commented 3 years ago
def testIssue25(self):
        '''
        https://github.com/somnathrakshit/geograpy3/issues/25
        '''
        pc=PlaceContext(place_names=["Bulgaria","Croatia","Czech Republic","Hungary"])
        if self.debug:
            print (pc.countries)

with the result:

['Hungary', 'Czech Republic', 'Bulgaria', 'Croatia', 'Italy', 'Romania']

the reason for Italy and Romania showing up here is that there are two cities names Bulgaria in those countries: Issue #25 Bulgaria

query

select * from cityLookup where label in ('Bulgaria','Croatia','Hungary','Czech Republic') order by pop desc,regionName
result label level locationKind wikidataid name geoNameId regionId countryId pop lat lon partOfRegionId gndId regionName regionIso regionPop regionLat regionLon countryName countryIso CountryLat CountryLon
Bulgaria 5 City Q3656821 Bulgaria Q182624 Q218 47 24 Q100188 Cluj County RO-CJ 691106 47 24 Romania RO 46 25
Bulgaria 5 City Q18439885 Bulgaria 3181423 Q1263 Q38 44 12 Q6662 Emilia-Romagna IT-45 4459477 45 11 Italy IT 42 12
WolfgangFahl commented 3 years ago

Since no acceptance criteria have been specified i am closing the ticket for now.