coreferee does not take into account merged tokens

richardpaulhudson / coreferee

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

MIT License

102 stars 16 forks source link

While trying to use Coreferee to replace proper nouns with their corresponding references, Coreferee will return the wrong token indexes. This issue only occure if a merge was done beforehand.

doc = nlp("the big bad wolf is small, he is also bad")
with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[1:4])

def coref(sentences):
#     nlp = spacy.load('en_core_web_trf')
#     nlp.add_pipe('coreferee')

    resolved_text = ""
    for token in doc:
        print('token:',token)
        repres = doc._.coref_chains.resolve(token)
        if repres:
            print("refer to: ",repres)
            resolved_text += " " + " and ".join([t.text for t in repres])
        else:
            resolved_text += " " + token.text
    return(resolved_text)

resolved_text = coref(doc)
print(resolved_text)

I expect "he" to refer to "big bad wolf" I get "small" instead

richardpaulhudson / coreferee

coreferee does not take into account merged tokens #24