richardpaulhudson / coreferee

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages
MIT License
102 stars 16 forks source link

coreferee does not take into account merged tokens #24

Closed shmouelsamares closed 1 year ago

shmouelsamares commented 1 year ago

While trying to use Coreferee to replace proper nouns with their corresponding references, Coreferee will return the wrong token indexes. This issue only occure if a merge was done beforehand.

doc = nlp("the big bad wolf is small, he is also bad")
with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[1:4])

def coref(sentences):
#     nlp = spacy.load('en_core_web_trf')
#     nlp.add_pipe('coreferee')

    resolved_text = ""
    for token in doc:
        print('token:',token)
        repres = doc._.coref_chains.resolve(token)
        if repres:
            print("refer to: ",repres)
            resolved_text += " " + " and ".join([t.text for t in repres])
        else:
            resolved_text += " " + token.text
    return(resolved_text)

resolved_text = coref(doc)
print(resolved_text)

I expect "he" to refer to "big bad wolf" I get "small" instead

richardpaulhudson commented 1 year ago

The problem here is that the coreference annotation is occurring within the pipeline and thus before the retokenization. You can work around it by re-annotating after the retokenization:

from coreferee.manager import CorefereeManager
ann = CorefereeManager().get_annotator(nlp)
ann.annotate(doc)