openeventdata / mordecai

Full text geoparsing as a Python library
MIT License
742 stars 97 forks source link

Mordecai not finding place name even though it is present in gazetteer #75

Closed parrotcar00 closed 4 years ago

parrotcar00 commented 4 years ago

Hi, I just tried to do this: geo.geoparse("Beautiful Daedunsan, South Korea")

and I got: [{'word': 'South Korea', 'spans': [{'start': 21, 'end': 32}], 'country_predicted': 'KOR', 'country_conf': 0.9998105, 'geo': {'admin1': 'NA', 'lat': '36.5', 'lon': '127.75', 'country_code3': 'KOR', 'geonameid': '1835841', 'place_name': 'Republic of Korea', 'feature_class': 'A', 'feature_code': 'PCLI'}}]

So looks like mordecai is not recognizing "Daendunsan" which is a mountain in South Korea. I then looked up Daendunsan in http://www.geonames.org/ which is the default gazetteer mordecai is using (as learnt from the README) and geonames.org is returning the right search result for Daendunsan.

Is this a bug or do I need to download and use a newer version of the gazetteer from somewhere?

ahalterman commented 4 years ago

If Mordecai's not detecting the place name at all, that's an issue with the named entity recognition model it's using, specifically spaCy. spaCy's NER is trained on a set of text that doesn't have great geographic coverage (it often misses place names in the Middle East) as well. It would be possible to label text with more place names and train a better model, but I'm afraid I won't have the time to do that in the foreseeable future.