kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
249 stars 24 forks source link

Named Entity Recognition and Classification for languages other than EN/FR #136

Open dddpt opened 3 years ago

dddpt commented 3 years ago

I am using entity-fishing on a corpus of ~35k documents with a french, an italian and a german version.

In the entity-fishing documentation, there is this paragraph:

The tool currently supports English, German, French, Spanish and Italian languages (more to come!). For English and French, a Name Entity Recognition based on CRF grobid-ner is used in combination with the disambiguation. For each recognized entity in one language, it is possible to complement the result with crosslingual information in the other languages. A nbest mode is available. Domain information are produced for a large amount of entities in the technical and scientific fields, together with Wikipedia categories and confidence scores.

What does it mean for non english/french texts? Is another named entity recognition system used? Should I expect worse results for entity recognition on german and italian?

The german wikipedia has the best coverage of the topics in my corpus so I was thinking of focusing on the german version of the corpus. Now I'm wondering if I should instead focus on the french version hoping for better performance on recognition. Any hints?

Thanks for this great tool! :-)

kermitt2 commented 2 years ago

Hello @dddpt !

Sorry for the slow response :(

For non-English/French texts, no NER is used, which means "terms" are selected only via Wikipedia anchors of this language. So if the German Wikipedia has the best coverage for a given domain, there is no NER problem because the anchors will be very rich and the disambiguation more frequent.

NER is nice for general text (like journalism, history. ...), because the named entity classes are very general. NER does not help for more specialized domains, like scientific domains. Wikipedia vocabulary is bringing reliable terms, and NER actually is often noisy.

dddpt commented 2 years ago

Hi @kermitt2,

Thanks for the answer ;-)

A Wikipedia anchor is the text of a link from a wikipedia article to another right?

So it means that each time a term (or a sequence of terms?) corresponds to any anchor in the Wikipedia of the corresponding language, it is recognized as an entity? Doesn't it label almost every word as an entity?

(and while I'm at it, is there a technical report/article detailing entity-fishing in addition to the readthedocs?)

kermitt2 commented 2 years ago

So it means that each time a term (or a sequence of terms?) corresponds to any anchor in the Wikipedia of the corresponding language, it is recognized as an entity?

It is recognized as entity candidate, this is how more or less all entity linking tools work (although often not at full scale). In English for instance, there are 206 million "terms" (so anchors, plus article titles and synomyms - single or multiple word terms) considered by entity-fishing for every input. Each term of these 206 million terms is associated with one or several Wikidata entities.

Doesn't it label almost every word as an entity?

Well indeed plenty of words/multi-word terms might be considered (what I call "mention"), leading a massive amount of entity candidates. The challenge is to 1) select the most likely correct entity candidates 2) decide if the most-likely one is acceptable (so reject some "linking", because the term is used as common word, not as a reference to a particular entity). Only a few candidates are finally selected as final label entities.

The "best" mentions and entities are selected by learning the disambiguation made by the wikipedia contributors when adding anchors in Wikipedia.

(and while I'm at it, is there a technical report/article detailing entity-fishing in addition to the readthedocs?)

This presentation at WikiDataCon https://grobid.s3.amazonaws.com/presentations/29-10-2017.pdf

dddpt commented 2 years ago

Great, thanks for the detailed reply :+1: