Closed lsmith77 closed 2 years ago
Hi @lsmith77, Good point! We don't have an official solution for this, but here's an idea which might work:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
import spacy
# Create a class inheriting from SpacyNlpEngine
class LoadedSpacyNlpEngine(SpacyNlpEngine):
def __init__(self, loaded_spacy_model):
self.nlp = {"en": loaded_spacy_model}
# Load a model a-priori
nlp = spacy.load("en_core_web_sm")
# Pass the loaded model to the new LoadedSpacyNlpEngine
loaded_nlp_engine = LoadedSpacyNlpEngine(loaded_spacy_model = nlp)
# Pass the engine to the analyzer
analyzer = AnalyzerEngine(nlp_engine = loaded_nlp_engine)
# Analyze text
analyzer.analyze(text="My name is Bob", language="en")
This hack might work, but your suggestion is great for future improvement. If you'd like to create a PR I'd be happy to help reviewing it.
thank you .. will try it out and might create a PR.
I can confirm that this works.
I guess it might make more sense to just document this "hack", rather than add this class, wdyt?
Hi @lsmith77, sorry for the delayed response. Yes this could either be enhanced by documentation or extended functionality. Any contribution would be greatly appreciated!
I went the documentation route https://github.com/microsoft/presidio/pull/854
@lsmith77 sorry for necroposting. Do you have your presidio-powered scrubber for Sentry open-sourced? We're solving the same problem right now, and I wonder if we could reuse some of your work.
Hi @orsinium, what kind of integration are you looking for? For Presidio to run as part of Sentry and detect PII?
@lsmith77 sorry for necroposting. Do you have your presidio-powered scrubber for Sentry open-sourced? We're solving the same problem right now, and I wonder if we could reuse some of your work.
we ended up not using presidio as it seemed like it wasn't able to cover non-western names. so we build a very simple scrubber in python ourselves that just handles URLs, emails and numbers.
Thanks for the feedback @lsmith77. Have you looked into other NER models? or worked with the default spaCy one?
Thanks for the feedback @lsmith77. Have you looked into other NER models? or worked with the default spaCy one?
Yes, we are using spaCy and we also use their NER on their LG models in English/German. it works ok-ish for detecting names based on sentence structure.
We also tried some NER models on huggingface and found them to be more accurate but in the end we decided to stick with spaCy because we were already using spaCy for other purposes and so it was "cheaper" to accept the spaCy limitations
what kind of integration are you looking for? For Presidio to run as part of Sentry and detect PII?
Sentry has a "scrubber", a middleware that removes sensitive data from all events before sending them to the server. The default one is very basic: it only removes top-level values (and doesn't check nested values) based on a pre-defined deny list of words like "password". We looked into our Sentry and found a lot of PII. Luckily, you can provide your own scrubber:
https://docs.sentry.io/platforms/python/data-management/sensitive-data/
I started to look for a solution and I'm considering making a custom scrubber based on presidio. And search for "Sentry" in issues led me here.
so we build a very simple scrubber in python ourselves that just handles URLs, emails and numbers.
Got it, thank you. I might also end up not overthinking it and just using a big hardcoded deny list of names and a bunch of regexes :)
We have build a custom NLP API using spacy. We plan to use persidio to remove PII from data we sent to sentry. since our API is already using spacy, we would like to re-use the same models and load them only once.
In this spirit we are wondering if you would be open to consider supporting passing in a spacy modal instance via configuration, rather than just allowing to pass in a model name that is then loaded via
spacy.load()
in the presidio code: https://github.com/microsoft/presidio/blob/de50157ade5e64595bda2f9cc2a92ab85f50898c/presidio-analyzer/presidio_analyzer/nlp_engine/spacy_nlp_engine.py#L37I have to admit that I have not done any benchmarking on this yet, but I assume this should shave off time loading the models and also reduce the memory footprint.