richardpaulhudson / holmes-extractor

Information extraction from English and German texts based on predicate logic
MIT License
135 stars 12 forks source link

Custom named entities #2

Closed NixBiks closed 2 years ago

NixBiks commented 2 years ago

Hi @richardpaulhudson

Congratulations with your work on Holmes. It's a very cool approach.

I'm wondering if it's possible to provide custom named entities - e.g. using the SpanRuler (or EntityRuler) component from spaCy. I'm thinking the answer is no at this point but what would it require - I imagine it requires some training?

adelevie commented 2 years ago

I would also like to be able to do this. I am wondering if rather than passing a string representing a model name, one could pass an nlp instance (or factory) directly.

NixBiks commented 2 years ago

I would also like to be able to do this. I am wondering if rather than passing a string representing a model name, one could pass an nlp instance (or factory) directly.

You can make your own custom models accessible by string representation using entry points. However I'm pretty sure that Holmes require specifically trained model to work.

richardpaulhudson commented 2 years ago

This is possible and doesn't require any extra training. However, the best way to approach it is to load the language model (pipeline) into Holmes and add the SpanRuler / EntityRuler to it there. This ensures that the base model is one of the standard spaCy models, which is important because Holmes uses Coreferee for coreference resolution, and Coreferee is trained on the standard spaCy models. If you load a model that is not one of the standard spaCy models, one of two things will happen:

In the next version of Holmes, I think I shall add an option not to try and load Coreferee to give people the option of running Holmes with their own models and without coreference. But in the meantime you can add a SpanRuler or EntityRuler to one of the standard models like this:

import holmes_extractor as holmes
manager = holmes.Manager("en_core_web_trf")
config = {"spans_key": None, "annotate_ents": True, "overwrite": False}
ruler = manager.nlp.add_pipe("span_ruler", config=config)
patterns = [{"label": "FOOD", "pattern": "ice cream"}]
ruler.add_patterns(patterns)
manager.parse_and_register_document("We ate some ice cream")
manager.register_search_phrase("Somebody eats ENTITYFOOD")
manager.match()
adelevie commented 2 years ago

Thanks for the detailed write-up, @richardpaulhudson.

I am wondering what the best hook/entry-point for modifying a doc instance might be. For example, https://spacy.io/api/doc#set_ents allows one to manually set named entities (using word token or character offsets). In my particular use-case, I have a knowledge base of such offsets which thankfully means I can use these entities with virtually any model and no training. I am thinking something along the lines of:

from spacy.tokens import Span
import holmes_extractor as holmes

manager = holmes.Manager("en_core_web_trf")
manager.register_search_phrase('ENTITYFELINE are better than ENTITYCANINE')

doc = manager.nlp("Cats are better than dogs.")
doc.set_ents([Span(doc, 0, 1, "FELINE")], [Span(doc, 4, 5, "CANINE")])

manager.register_serialized_document(doc.to_bytes(), label="doc1")
richardpaulhudson commented 2 years ago

@adelevie, that will work, although the penultimate line should read

doc.set_ents([Span(doc, 0, 1, "FELINE"), Span(doc, 4, 5, "CANINE")])
adelevie commented 2 years ago

Thanks, and good catching my set_ents typo.