Closed andreas-wolf closed 3 years ago
This is a bit weird too:
print(geo.geoparse( """ Wuppertal (onsite/remote) """ ))
[{'word': 'Wuppertal', 'spans': [{'start': 1, 'end': 10}], 'country_predicted': 'DEU', 'country_conf': 0.9048774, 'geo': {'admin1': 'North Rhine-Westphalia', 'lat': '51.25627', 'lon': '7.14816', 'country_code3': 'DEU', 'geonameid': '2805753', 'place_name': 'Wuppertal', 'feature_class': 'P', 'feature_code': 'PPLA3'}}, {'word': 'onsite', 'spans': [{'start': 12, 'end': 18}], 'country_predicted': 'NRU', 'country_conf': 0.2482127}, {'word': 'remote', 'spans': [{'start': 19, 'end': 25}], 'country_predicted': 'MMR', 'country_conf': 0.2482127}]
Nauru is onsite and Myanmar remote?
I found that this problem is related to using the German language model from spacy:
geo = Geoparser(
es_hosts=[os.getenv('ELASTICSEARCH_HOST')],
es_port=os.getenv('ELASTICSEARCH_PORT'),
nlp=spacy.load('de_core_news_lg', disable=['parser', 'tagger'])
)
If I use the English model it seems to work:
geo = Geoparser(
es_hosts=[os.getenv('ELASTICSEARCH_HOST')],
es_port=os.getenv('ELASTICSEARCH_PORT'),
nlp=spacy.load('en_core_web_lg', disable=['parser', 'tagger'])
)
I concluded from "Mordecai’s key technical innovations are in a language-agnostic architecture...[...]" Source: https://joss.theoj.org/papers/10.21105/joss.00091 that it would work with other languages too. This seems not to be the case since e.g. Tansania (ger) is not found but Tanzania (en) is. Maybe a hint that Mordecai only works for English text would be nice.
Thanks for bring this up and sorry for any frustration. I just updated the README to clarify that using Mordecai on other languages requires the models to be retrained, largely because the pretrained embeddings it uses aren't aligned across languages, so the models won't perform well on non-English, non-GloVe embeddings.
If I do another major re-write of Mordecai, it'll be after the new transformer-based version of spaCy comes out. Then I can switch to using something like XLNet, use contextual embeddings instead of the mess of features right now, and hopefully everything will work cross-lingually.
Working example:
print(geo.geoparse( """ Wuppertal remote-option """ ))
Result:
[{'word': 'Wuppertal', 'spans': [{'start': 1, 'end': 10}], 'country_predicted': 'DEU', 'country_conf': 0.9048774, 'geo': {'admin1': 'North Rhine-Westphalia', 'lat': '51.25627', 'lon': '7.14816', 'country_code3': 'DEU', 'geonameid': '2805753', 'place_name': 'Wuppertal', 'feature_class': 'P', 'feature_code': 'PPLA3'}}]
Not working:
print(geo.geoparse( """ Wuppertal (remote-option) """ ))
Result is a stack trace from TF and a country prediction for 'remote-option':
Expected: No stack trace, preferable no country entry for remote-option
Not working:
print(geo.geoparse( """ Wuppertal (remote option) """ ))
Result is empty:
[]
Expected: Find Wuppertal as city