kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
246 stars 24 forks source link

Different results when supplying entity spans #135

Open antonyscerri opened 3 years ago

antonyscerri commented 3 years ago

If i run the exact same passage of text but a) with no preexisting entity spans, with ner and wikipedia mentioners, b) with a full set of existing entity spans and no ner or wikipedia mentions and c) same as b but with one or two of the entities. I end up with three different sets of output. In the case of a and b i'll get largely the same spans but the linked concepts will differ as will the scores. In the case of c i wont get any results, even though the same span will generate a result for and b.

I would have expected for the same input text and where the ner or wikipedia mentioners find the same span as if i pass it in that the outputs should be the same. Contextually they would be identical, except if its leveraging other surrounding entities (post linking) to additionally help resolve. Even if I pass in all the entity spans found from scenario a into b i get different results, both not all the same spans come out and the score and selected concept can change too.

kermitt2 commented 3 years ago

Thank you very much @antonyscerri for the issue. It looks indeed a problem for c) and does not seem to be the expected behavior. However to be sure to understand the problem and to reproduce it, would it be possible to have an example for a/b/c ?

For a) and b), the origin of the identified span (the mention) has an impact on the classifier. A span identified as NER (with a NER class that restrict the sense, but also some unreliability) will contribute differently from a span corresponding to a wikipedia mention (which is "certain" as known expression, but more ambiguous semantically). But I might misunderstand the case b) and an example would really help.

antonyscerri commented 3 years ago

Here is a set of files, the three (a,b,c) inputs (.json files) and the corresponding outputs (.resp files). intra.zip