richardpaulhudson / holmes-extractor

Information extraction from English and German texts based on predicate logic
MIT License
134 stars 12 forks source link

Guidance requested on using doc.retokenize #6

Closed adelevie closed 2 years ago

adelevie commented 2 years ago

I'm having a great time with this library, but running into the following issue:

As mentioned in https://github.com/explosion/holmes-extractor/issues/2#issuecomment-1195536077, I am manually setting entities using doc.set_ents. I found that some obvious / verbatim search phrases (e.g. something like ENTITYCUSTOM_ENT1 something ENTITYCUSTOM_ENT2) would not match properly until I made sure to retokenize the doc before registering it with the manager:

docs = []
spans = filter_spans(spans=spans)
doc.set_ents(spans, default='unmodified')
with doc.retokenize() as retokenizer:
    for span in spans:
        retokenizer.merge(span, {'POS': 'NOUN'})
docs.append(doc)
# ...
for doc in docs:
    manager.register_serialized_document(doc.to_bytes(), label=label)

However, this also appears to have created a mis-alignment in doc length and I am not sure how. Some search phrases end up throwing this error when a match is found:

IndexError: [E026] Error accessing token at position 145: out of bounds in Doc of length 88.

I was able to debug that particular doc, and found that my retokenized doc is 88 (word) tokens. If I ran the doc.text through manager.nlp, however, the resulting doc is 145 word tokens. This suggests to me that somewhere the merged span information is lost, perhaps after some internal call to manager.nlp?

richardpaulhudson commented 2 years ago

Yes, the problem here is that Holmes (and Coreferee on which it relies) are both applied to the document when it is first parsed and before retokenization. This means that the Coreferee and Holmes information stored on the document will no longer be valid after you retokenize it.

The problem should be solved if you run the document back through both extensions after retokenization:

coreferee_ext = manager.nlp.get_pipe("coreferee")
holmes_ext = manager.nlp.get_pipe("holmes")
coreferee_ext(doc)
holmes_ext(doc)