microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.8k stars 571 forks source link

Support for SpanMarker #1237

Open Hveemos opened 10 months ago

Hveemos commented 10 months ago

I have found that SpanMarker models such as tomaarsen/span-marker-mbert-base-multinerd to be very usefull at NER recognition. But Presidio does not seem to support the class 'span_marker.configuration.SpanMarkerConfig'.

Can I resolve this myself or might this be added as an additional feature?

Regards Joakim

ogencoglu commented 10 months ago

+1 for this

ogencoglu commented 10 months ago

Maybe @tomaarsen point some tips.

omri374 commented 10 months ago

Thanks! Great suggestion. Something along those lines? https://github.com/tomaarsen/SpanMarkerNER?tab=readme-ov-file#using-pretrained-spanmarker-models-with-spacy

In the transformers case, we used spacy-huggingface-pipelines to integrate a huggingface/transformers model into a spacy pipeline, because presidio requires the other modules in spaCy in order to run (tokenization, lemmatization etc.). See more here: https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/#how-ner-results-flow-within-presidio

omri374 commented 10 months ago

@Hveemos would you be interested in adding this capability?

Hveemos commented 10 months ago

Yes, something along those lines (https://github.com/tomaarsen/SpanMarkerNER?tab=readme-ov-file#using-pretrained-spanmarker-models-with-spacy). But I'm sorry to say that I don't have the time or competency to contribute in this project. I solved this with duct tape instead, i.e. running the SpanMarker on each line and using regex to redact (takes forever though).

omri374 commented 10 months ago

@Hveemos the easiest solution would be to create a recognizer class and glue the SpanMarker output to a RecognizerResult object. See something similar here, we did for the flair package: https://github.com/microsoft/presidio/blob/ca7772ce4919af9b2e08063dc5f0759aa3205fb6/docs/samples/python/flair_recognizer.py#L131

ogencoglu commented 10 months ago

UniversalNER is also another interesting generative candidate: https://universal-ner.github.io/