shon-otmazgin / fastcoref

MIT License
147 stars 25 forks source link

Matching clusters against the NER entities in spaCy pipeline #41

Closed mattkallo closed 1 year ago

mattkallo commented 1 year ago

This is not an issue. Didn't have the discussion option, hence asking this question here.

Is there a way to link the coref clusters to respective entities extracted with NER component? A simple string match/search doesn't work 100%. See the eg. below

Cluster [{'text': 'Zappos, subsidiary of Amazon,', 'span': (913, 973)}, {'text': 'It', 'span': (2843, 2844)}] In this case, the cluster entry has 2 entities ("Zappos" and "Amazon"), though it refers to "Zappos". I would like to link entity "Zappos" to this cluster programatically.

Thanks

shon-otmazgin commented 1 year ago

Hello, sorry for the late response I was on vacation.

Not sure I got the example you shared. this is the NER outputs? if so can you share NER output, coref outputs and desired outputs?

mattkallo commented 1 year ago

Hi @shon-otmazgin . Thanks for the response.

Those were coreference clusters. Below is more detailed example.

from fastcoref import spacy_component
import spacy

nlp = spacy.load("./models/spaCy-en-large-model")
nlp.add_pipe("fastcoref")
text="Zappos, a subsidiary of Amazon, started its online presence in 2011. It expanded outside of Americas in the year 2020"
doc = nlp(text)
doc._.coref_clusters

[[(0, 31), (40, 43), (69, 71)]]

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
Zappos 0 6 ORG
Amazon 24 30 ORG
2011 63 67 DATE
Americas 92 100 GPE
2020 113 117 DATE

There is only one cluster in the example. The cluster head "Zappos, a subsidiary of Amazon," (0,31) has 2 entities Zappos and Amazon. My question was, is there a way to identify "Zappos" as the entity the cluster is referring to, though the cluster head has 2 entities?

shon-otmazgin commented 1 year ago

So this is like a nested entity? The model should find also nested entities, can you try to run the LINGMESS model to see if it can predict the nested as well?

Regarding the cluster head - cluster head is not well defined. Some will consider one entity as the head and others something different. I usually takes the shortest entity which is also a Proper Noun, but that's my interpretation.

mattkallo commented 1 year ago

Yes, this is a nested entity. It can appear in many similar cases. "Jane Doe, supervisor of John Doe" etc.. Let me try LINGMESS. I have found a workaround by finding the main entity using dependency parse tree when more than one entity exists.