huggingface / neuralcoref

✨Fast Coreference Resolution in spaCy with Neural Networks
https://huggingface.co/coref/
MIT License
2.86k stars 477 forks source link

Doesn't work when span is merged. #110

Closed lahsuk closed 5 years ago

lahsuk commented 5 years ago
nlp = spacy.load('en_coref_sm')
text = nlp("Michelle Obama is the wife of former U.S. President Barack Obama. Prior to her role as first lady, she was a lawyer.")

spans = list(text.noun_chunks)
for span in spans:
    span.merge()

for word in text:
    print(word)
    if(word._.in_coref):
        print(text._.coref_clusters)

When the above code is run, it gives the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-98-4252d464f86d> in <module>()
      1 for word in text:
      2     print(word)
----> 3     if(word._.in_coref):
      4         print(text._.coref_clusters)

~\Anaconda3\lib\site-packages\spacy\tokens\underscore.py in __getattr__(self, name)
     29         default, method, getter, setter = self._extensions[name]
     30         if getter is not None:
---> 31             return getter(self._obj)
     32         elif method is not None:
     33             return functools.partial(method, self._obj)

neuralcoref.pyx in __iter__()

span.pyx in __iter__()

span.pyx in spacy.tokens.span.Span._recalculate_indices()

IndexError: [E037] Error calculating span: Can't find a token ending at character offset 78.
tomkomt commented 5 years ago

Hello,

we have very similar issue right now, if not the same actually. Is there a solution for that?

lahsuk commented 5 years ago
nlp       = spacy.load('en_core_web_sm')
nlp_coref = spacy.load('en_coref_sm')

doc = nlp_coref(s.strip())
if doc._.has_coref:
    doc = nlp(doc._.coref_resolved)

This is one way of doing it but if it mistakes the coreference, it will be hard to recover the text that has been replaced.

thomwolf commented 5 years ago

I see. This error is still there in version 4.0. I'll open an issue on SpaCy's github for this.

thomwolf commented 5 years ago

For now, the simplest solution is just to re-run the neuralcoref pipeline component on the retokenized document after the merges (please also note that the recomended way to do merges has evolved now. Here is a fixed example

import spacy
import neuralcoref
nlp = spacy.load('en_coref_sm')
neuralcoref.add_to_pipe(nlp)

text = nlp("Michelle Obama is the wife of former U.S. President Barack Obama. Prior to her role as first lady, she was a lawyer.")

spans = list(text.noun_chunks)
with text.retokenize() as retokenizer:
    for span in spans:
        retokenizer.merge(span)

# Re-run NeuralCoref after the merges
text = nlp.get_pipe('neuralcoref')(text)

for word in text:
    print(word)
    if(word._.in_coref):
        print(text._.coref_clusters)
stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.