Closed NixBiks closed 2 years ago
I would also like to be able to do this. I am wondering if rather than passing a string representing a model name, one could pass an nlp
instance (or factory) directly.
I would also like to be able to do this. I am wondering if rather than passing a string representing a model name, one could pass an
nlp
instance (or factory) directly.
You can make your own custom models accessible by string representation using entry points. However I'm pretty sure that Holmes require specifically trained model to work.
This is possible and doesn't require any extra training. However, the best way to approach it is to load the language model (pipeline) into Holmes and add the SpanRuler / EntityRuler to it there. This ensures that the base model is one of the standard spaCy models, which is important because Holmes uses Coreferee for coreference resolution, and Coreferee is trained on the standard spaCy models. If you load a model that is not one of the standard spaCy models, one of two things will happen:
In the next version of Holmes, I think I shall add an option not to try and load Coreferee to give people the option of running Holmes with their own models and without coreference. But in the meantime you can add a SpanRuler
or EntityRuler
to one of the standard models like this:
import holmes_extractor as holmes
manager = holmes.Manager("en_core_web_trf")
config = {"spans_key": None, "annotate_ents": True, "overwrite": False}
ruler = manager.nlp.add_pipe("span_ruler", config=config)
patterns = [{"label": "FOOD", "pattern": "ice cream"}]
ruler.add_patterns(patterns)
manager.parse_and_register_document("We ate some ice cream")
manager.register_search_phrase("Somebody eats ENTITYFOOD")
manager.match()
Thanks for the detailed write-up, @richardpaulhudson.
I am wondering what the best hook/entry-point for modifying a doc
instance might be. For example, https://spacy.io/api/doc#set_ents allows one to manually set named entities (using word token or character offsets). In my particular use-case, I have a knowledge base of such offsets which thankfully means I can use these entities with virtually any model and no training. I am thinking something along the lines of:
from spacy.tokens import Span
import holmes_extractor as holmes
manager = holmes.Manager("en_core_web_trf")
manager.register_search_phrase('ENTITYFELINE are better than ENTITYCANINE')
doc = manager.nlp("Cats are better than dogs.")
doc.set_ents([Span(doc, 0, 1, "FELINE")], [Span(doc, 4, 5, "CANINE")])
manager.register_serialized_document(doc.to_bytes(), label="doc1")
@adelevie, that will work, although the penultimate line should read
doc.set_ents([Span(doc, 0, 1, "FELINE"), Span(doc, 4, 5, "CANINE")])
Thanks, and good catching my set_ents
typo.
Hi @richardpaulhudson
Congratulations with your work on Holmes. It's a very cool approach.
I'm wondering if it's possible to provide custom named entities - e.g. using the SpanRuler (or EntityRuler) component from spaCy. I'm thinking the answer is no at this point but what would it require - I imagine it requires some training?