o19s / hello-nlp

A natural language search microservice
Other
95 stars 12 forks source link

Coreference resolution plugin #12

Open binarymax opened 3 years ago

binarymax commented 3 years ago

Coreference resolution should be added as a pre-tokenization step. This will improve the knowledge graph extraction recall, and also improve BM25 accuracy.

Coref poses several challenges, most importantly: accuracy can be low, and a performance hit will be incurred.

Candidates for the step include the neuralcoref library (https://github.com/huggingface/neuralcoref), and the BERT based coref library (https://github.com/mandarjoshi90/coref). The former offers easy integration with spaCy, but has a lower accuracy than the latter. The latter offers higher accuracy but probably needs a GPU for reasonable performance, and is finicky to get working (the example colab notebook doesn't work out of the box).

binarymax commented 3 years ago

This is on hold until the upgrade to spaCy 3.0