kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
249 stars 24 forks source link

Problem of disambiguation of ENs according to the case or spelling of terms #132

Open aa303554 opened 3 years ago

aa303554 commented 3 years ago

In French entity-fishing has difficulty recognising Ireland by case and spelling. "Irlande" is the correct spelling "Ireland" is the English spelling and the others are "Irlande" written with spelling mistakes. It is not consistent according to the case certain spellings are not recognized the same, if there is a capital letter or not. With other countries there is no such problem (I tested with Japan).

correctly recognize : irlande, irland (incorrect), irelande(incorrect), Irelande(incorrect) only type ner LOCATION: Irlande, Ireland(incorrect) not recognize : Irland, ireland

See below for example.

image image

kermitt2 commented 3 years ago

Thanks for the issue @aa303554 !

Currently each spelling variant relies on its own statistics as anchor in the French Wikidata, which explains these different behaviours. In case we don't have enough context for a spelling variant, the term will never been linked to the Wikidata entity - in the current state of entity-fishing.

To improve this, my idea is to use better smoothing and priors for variants and for Wikidata labels (currently not used), #72.

Please note: the example you are using is too short for the normal text disambiguation field (normal unit for text field is more a paragraph), you need to use the short text input. It might not solve the spelling error/variants problem in general, but it will work better. I think however it solves your first example:

Screenshot from 2021-08-02 23-20-55