Open dddpt opened 3 years ago
Hello @dddpt !
Sorry for the slow response :(
For non-English/French texts, no NER is used, which means "terms" are selected only via Wikipedia anchors of this language. So if the German Wikipedia has the best coverage for a given domain, there is no NER problem because the anchors will be very rich and the disambiguation more frequent.
NER is nice for general text (like journalism, history. ...), because the named entity classes are very general. NER does not help for more specialized domains, like scientific domains. Wikipedia vocabulary is bringing reliable terms, and NER actually is often noisy.
Hi @kermitt2,
Thanks for the answer ;-)
A Wikipedia anchor is the text of a link from a wikipedia article to another right?
So it means that each time a term (or a sequence of terms?) corresponds to any anchor in the Wikipedia of the corresponding language, it is recognized as an entity? Doesn't it label almost every word as an entity?
(and while I'm at it, is there a technical report/article detailing entity-fishing in addition to the readthedocs?)
So it means that each time a term (or a sequence of terms?) corresponds to any anchor in the Wikipedia of the corresponding language, it is recognized as an entity?
It is recognized as entity candidate, this is how more or less all entity linking tools work (although often not at full scale). In English for instance, there are 206 million "terms" (so anchors, plus article titles and synomyms - single or multiple word terms) considered by entity-fishing for every input. Each term of these 206 million terms is associated with one or several Wikidata entities.
Doesn't it label almost every word as an entity?
Well indeed plenty of words/multi-word terms might be considered (what I call "mention"), leading a massive amount of entity candidates. The challenge is to 1) select the most likely correct entity candidates 2) decide if the most-likely one is acceptable (so reject some "linking", because the term is used as common word, not as a reference to a particular entity). Only a few candidates are finally selected as final label entities.
The "best" mentions and entities are selected by learning the disambiguation made by the wikipedia contributors when adding anchors in Wikipedia.
(and while I'm at it, is there a technical report/article detailing entity-fishing in addition to the readthedocs?)
This presentation at WikiDataCon https://grobid.s3.amazonaws.com/presentations/29-10-2017.pdf
Great, thanks for the detailed reply :+1:
I am using entity-fishing on a corpus of ~35k documents with a french, an italian and a german version.
In the entity-fishing documentation, there is this paragraph:
What does it mean for non english/french texts? Is another named entity recognition system used? Should I expect worse results for entity recognition on german and italian?
The german wikipedia has the best coverage of the topics in my corpus so I was thinking of focusing on the german version of the corpus. Now I'm wondering if I should instead focus on the french version hoping for better performance on recognition. Any hints?
Thanks for this great tool! :-)