Closed adelevie closed 2 years ago
Yes, the problem here is that Holmes (and Coreferee on which it relies) are both applied to the document when it is first parsed and before retokenization. This means that the Coreferee and Holmes information stored on the document will no longer be valid after you retokenize it.
The problem should be solved if you run the document back through both extensions after retokenization:
coreferee_ext = manager.nlp.get_pipe("coreferee")
holmes_ext = manager.nlp.get_pipe("holmes")
coreferee_ext(doc)
holmes_ext(doc)
I'm having a great time with this library, but running into the following issue:
As mentioned in https://github.com/explosion/holmes-extractor/issues/2#issuecomment-1195536077, I am manually setting entities using
doc.set_ents
. I found that some obvious / verbatim search phrases (e.g. something likeENTITYCUSTOM_ENT1 something ENTITYCUSTOM_ENT2
) would not match properly until I made sure to retokenize the doc before registering it with the manager:However, this also appears to have created a mis-alignment in doc length and I am not sure how. Some search phrases end up throwing this error when a match is found:
I was able to debug that particular doc, and found that my retokenized doc is 88 (word) tokens. If I ran the
doc.text
throughmanager.nlp
, however, the resulting doc is 145 word tokens. This suggests to me that somewhere the merged span information is lost, perhaps after some internal call tomanager.nlp
?