Closed Gommorach closed 1 year ago
Hi, latency depends heavily on the types of recognizers and NLP models you apply, and there's a latency-accuracy tradeoff.
The fastest setup I can think of, is to use the small spaCy model (en_core_web_sm
) as the NER model, and remove recognizers that are not needed (the PhoneRecognizer
being the slowest I believe. If you expect to have phone numbers only from a certain country, you can also configure it to look only for patterns belonging to this country).
Then, going with heavier spacy models (en_core_web_lg
) all the way to transformers (en_core_web_trf
or using huggingface) and flair models.
Our use case for Presidio is whether PII information is present in the text that is analyzed, without needing to know which entity that is. We use the built-in entities and custom matchers. We're hitting performance issues in terms of latency as we rely on live feedback. The allocated cpu and memory resources are not being maxed out by Presidio.
We suspect that the underlying spaCy pipeline is too heavy for us in the sense that we're not relying on the ner step in our output. Does this analysis make sense? If so, would it be possible to make that pipeline configurable?