Coreference resolution should be added as a pre-tokenization step. This will improve the knowledge graph extraction recall, and also improve BM25 accuracy.
Coref poses several challenges, most importantly: accuracy can be low, and a performance hit will be incurred.
Candidates for the step include the neuralcoref library (https://github.com/huggingface/neuralcoref), and the BERT based coref library (https://github.com/mandarjoshi90/coref). The former offers easy integration with spaCy, but has a lower accuracy than the latter. The latter offers higher accuracy but probably needs a GPU for reasonable performance, and is finicky to get working (the example colab notebook doesn't work out of the box).
Coreference resolution should be added as a pre-tokenization step. This will improve the knowledge graph extraction recall, and also improve BM25 accuracy.
Coref poses several challenges, most importantly: accuracy can be low, and a performance hit will be incurred.
Candidates for the step include the neuralcoref library (https://github.com/huggingface/neuralcoref), and the BERT based coref library (https://github.com/mandarjoshi90/coref). The former offers easy integration with spaCy, but has a lower accuracy than the latter. The latter offers higher accuracy but probably needs a GPU for reasonable performance, and is finicky to get working (the example colab notebook doesn't work out of the box).