Open aa303554 opened 3 years ago
Thanks for the issue @aa303554 !
Currently each spelling variant relies on its own statistics as anchor in the French Wikidata, which explains these different behaviours. In case we don't have enough context for a spelling variant, the term will never been linked to the Wikidata entity - in the current state of entity-fishing.
To improve this, my idea is to use better smoothing and priors for variants and for Wikidata labels (currently not used), #72.
Please note: the example you are using is too short for the normal text disambiguation field (normal unit for text field is more a paragraph), you need to use the short text input. It might not solve the spelling error/variants problem in general, but it will work better. I think however it solves your first example:
In French entity-fishing has difficulty recognising Ireland by case and spelling. "Irlande" is the correct spelling "Ireland" is the English spelling and the others are "Irlande" written with spelling mistakes. It is not consistent according to the case certain spellings are not recognized the same, if there is a capital letter or not. With other countries there is no such problem (I tested with Japan).
correctly recognize : irlande, irland (incorrect), irelande(incorrect), Irelande(incorrect) only type ner LOCATION: Irlande, Ireland(incorrect) not recognize : Irland, ireland
See below for example.