openeventdata / mordecai

Full text geoparsing as a Python library
MIT License
738 stars 97 forks source link

Weird results when text contains brackets #82

Closed andreas-wolf closed 3 years ago

andreas-wolf commented 3 years ago

Working example:

print(geo.geoparse( """ Wuppertal remote-option """ ))

Result: [{'word': 'Wuppertal', 'spans': [{'start': 1, 'end': 10}], 'country_predicted': 'DEU', 'country_conf': 0.9048774, 'geo': {'admin1': 'North Rhine-Westphalia', 'lat': '51.25627', 'lon': '7.14816', 'country_code3': 'DEU', 'geonameid': '2805753', 'place_name': 'Wuppertal', 'feature_class': 'P', 'feature_code': 'PPLA3'}}]

Not working: print(geo.geoparse( """ Wuppertal (remote-option) """ ))

Result is a stack trace from TF and a country prediction for 'remote-option':

ValueError: Input 0 of layer sequential is incompatible with the layer: expected axis -1 of input shape to have value 12 but received input with shape [None, 0]

None
[{'word': 'Wuppertal', 'spans': [{'start': 1, 'end': 10}], 'country_predicted': 'DEU', 'country_conf': 0.9048774, 'geo': {'admin1': 'North Rhine-Westphalia', 'lat': '51.25627', 'lon': '7.14816', 'country_code3': 'DEU', 'geonameid': '2805753', 'place_name': 'Wuppertal', 'feature_class': 'P', 'feature_code': 'PPLA3'}}, {'word': 'remote-option', 'spans': [{'start': 12, 'end': 25}], 'country_predicted': '', 'country_conf': 0}]`

Expected: No stack trace, preferable no country entry for remote-option

Not working: print(geo.geoparse( """ Wuppertal (remote option) """ ))

Result is empty: [] Expected: Find Wuppertal as city

andreas-wolf commented 3 years ago

This is a bit weird too: print(geo.geoparse( """ Wuppertal (onsite/remote) """ ))

[{'word': 'Wuppertal', 'spans': [{'start': 1, 'end': 10}], 'country_predicted': 'DEU', 'country_conf': 0.9048774, 'geo': {'admin1': 'North Rhine-Westphalia', 'lat': '51.25627', 'lon': '7.14816', 'country_code3': 'DEU', 'geonameid': '2805753', 'place_name': 'Wuppertal', 'feature_class': 'P', 'feature_code': 'PPLA3'}}, {'word': 'onsite', 'spans': [{'start': 12, 'end': 18}], 'country_predicted': 'NRU', 'country_conf': 0.2482127}, {'word': 'remote', 'spans': [{'start': 19, 'end': 25}], 'country_predicted': 'MMR', 'country_conf': 0.2482127}]

Nauru is onsite and Myanmar remote?

andreas-wolf commented 3 years ago

I found that this problem is related to using the German language model from spacy:

geo = Geoparser(
  es_hosts=[os.getenv('ELASTICSEARCH_HOST')],
  es_port=os.getenv('ELASTICSEARCH_PORT'),
  nlp=spacy.load('de_core_news_lg', disable=['parser', 'tagger'])
)

If I use the English model it seems to work:

geo = Geoparser(
  es_hosts=[os.getenv('ELASTICSEARCH_HOST')],
  es_port=os.getenv('ELASTICSEARCH_PORT'),
  nlp=spacy.load('en_core_web_lg', disable=['parser', 'tagger'])
)

I concluded from "Mordecai’s key technical innovations are in a language-agnostic architecture...[...]" Source: https://joss.theoj.org/papers/10.21105/joss.00091 that it would work with other languages too. This seems not to be the case since e.g. Tansania (ger) is not found but Tanzania (en) is. Maybe a hint that Mordecai only works for English text would be nice.

ahalterman commented 3 years ago

Thanks for bring this up and sorry for any frustration. I just updated the README to clarify that using Mordecai on other languages requires the models to be retrained, largely because the pretrained embeddings it uses aren't aligned across languages, so the models won't perform well on non-English, non-GloVe embeddings.

If I do another major re-write of Mordecai, it'll be after the new transformer-based version of spaCy comes out. Then I can switch to using something like XLNet, use contextual embeddings instead of the mess of features right now, and hopefully everything will work cross-lingually.