Open Hveemos opened 10 months ago
+1 for this
Maybe @tomaarsen point some tips.
Thanks! Great suggestion. Something along those lines? https://github.com/tomaarsen/SpanMarkerNER?tab=readme-ov-file#using-pretrained-spanmarker-models-with-spacy
In the transformers case, we used spacy-huggingface-pipelines to integrate a huggingface/transformers model into a spacy pipeline, because presidio requires the other modules in spaCy in order to run (tokenization, lemmatization etc.). See more here: https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/#how-ner-results-flow-within-presidio
@Hveemos would you be interested in adding this capability?
Yes, something along those lines (https://github.com/tomaarsen/SpanMarkerNER?tab=readme-ov-file#using-pretrained-spanmarker-models-with-spacy). But I'm sorry to say that I don't have the time or competency to contribute in this project. I solved this with duct tape instead, i.e. running the SpanMarker on each line and using regex to redact (takes forever though).
@Hveemos the easiest solution would be to create a recognizer class and glue the SpanMarker output to a RecognizerResult
object. See something similar here, we did for the flair package: https://github.com/microsoft/presidio/blob/ca7772ce4919af9b2e08063dc5f0759aa3205fb6/docs/samples/python/flair_recognizer.py#L131
UniversalNER is also another interesting generative candidate: https://universal-ner.github.io/
I have found that SpanMarker models such as tomaarsen/span-marker-mbert-base-multinerd to be very usefull at NER recognition. But Presidio does not seem to support the class 'span_marker.configuration.SpanMarkerConfig'.
Can I resolve this myself or might this be added as an additional feature?
Regards Joakim