microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.65k stars 557 forks source link

Using Presidio with Huggingface support #1083

Open Matei9721 opened 1 year ago

Matei9721 commented 1 year ago

Hi, currently I am using presidio with Spacy and Stanza by creating an nlp_engine using NlpEngineProvider and passing it the correct model in the config. I was planning on adding support for HuggingFace transformer models, but I was a bit confused by the fact that there are 2 ways of doing this:

  1. Using a TransformerRecognizer
  2. Using TransformerNlpEngine

As far as I understand, if you use the recognizer then you apply the recognizer on top of the usual e.g. Spacy NER pipeline so you will get results from both Spacy and HuggingFace model. On the other hand, using the TransformerNlpEngine substitutes the Spacy NER module in the pipeline.

In this example: https://microsoft.github.io/presidio/samples/python/transformers_recognizer/ it is shown how to use the TransformersRecognizer with a specific configuration given as an example in configuration.py where you can do the MODEL_TO_PRESIDIO_MAPPING. If you are to use the TransformerNlpEngine, how are you supposed to do the mapping between model types and presidio types similar to the ones done in TransformerRecognizer?

Is my understanding above right and if yes, is there a way to create an AnalyzerEnginewith a TransformerNlpEnginewith the same configuration as a TransformerRecognizer?

Thanks for the help!

Matei9721 commented 1 year ago

Actually, after checking the source code more, it's actually not clear to me how one is supposed to use the TransformerNlpEngine. What is the TransformersComponent class used for in this case?

Using the TransformerRecognizer seems easier as there are more code examples, but is it advised to use it over TransformerNlpEngine?

omri374 commented 1 year ago

Hi @Matei9721, thanks for your feedback! I can understand why this causes confusion. We initially wanted to support Huggingface the same way we support Stanza, but bumped into some issues. In the future, the plan is to integrate the new spacy-huggingface-pipelines package for a more seamless integration.

The easiest path forward, IMHO, is to use the TransformerRecognizer in parallel to the default SpacyNlpEngine. In our demo website's code, you'll find a method which does this. It uses the small spacy model to reduce the overhead (but maintain capabilities like lemmas), and removes the SpacyRecognizer to avoid getting results from both spaCy and the transformers model. I'll paste it here too:

def create_nlp_engine_with_transformers(
    model_path: str,
) -> Tuple[NlpEngine, RecognizerRegistry]:
    """
    Instantiate an NlpEngine with a TransformersRecognizer and a small spaCy model.
    The TransformersRecognizer would return results from Transformers models, the spaCy model
    would return NlpArtifacts such as POS and lemmas.
    :param model_path: HuggingFace model path.
    """

    from transformers_rec import (
        STANFORD_COFIGURATION,
        BERT_DEID_CONFIGURATION,
        TransformersRecognizer,
    )

    registry = RecognizerRegistry()
    registry.load_predefined_recognizers()

    if not spacy.util.is_package("en_core_web_sm"):
        spacy.cli.download("en_core_web_sm")
    # Using a small spaCy model + a HF NER model
    transformers_recognizer = TransformersRecognizer(model_path=model_path)

    if model_path == "StanfordAIMI/stanford-deidentifier-base":
        transformers_recognizer.load_transformer(**STANFORD_COFIGURATION)
    elif model_path == "obi/deid_roberta_i2b2":
        transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)
    else:
        print(f"Warning: Model has no configuration, loading default.")
        transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)

    # Use small spaCy model, no need for both spacy and HF models
    # The transformers model is used here as a recognizer, not as an NlpEngine
    nlp_configuration = {
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
    }

    registry.add_recognizer(transformers_recognizer)
    registry.remove_recognizer("SpacyRecognizer")

    nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()

    return nlp_engine, registry

Hope this helps. We'll work on making this easier going forward.

Matei9721 commented 1 year ago

Thank you for your swift reply @omri374 , that's exactly what I ended up following! I just wanted to make sure that I am doing it in the "best" way possible and not re-invent the wheel. :) Looking forward to the spacy-hugging face-pipeline addition as it seems to indeed streamline the process more.

I will close the issue as my questions were answered and it's clear how to approach the task now!

LSD-98 commented 1 year ago

Dear @omri374 & @Matei9721 ,

Sorry to re-open this issue. The answers are really helpful. After reviewing the demo website's code, I have the feeling that the TransformersRecognizer used here (coming from here docs/samples/python/streamlit/transformers_rec/transformers_recognizer.py) is different than the one included in the package (in the pre-defined recognizers here presidio-analyzer/presidio_analyzer/predefined_recognizers/transformers_recognizer.py).

Am I wrong and can I use the TransformersRecognizer from the pre-defined recognizers in the package in a very similar workflow as the one presented in the demo website's code ?

Thanks in advance !

omri374 commented 1 year ago

Hi @LSD-98, you are correct. There are essentially two flows here, and we're also about to improve the experience in the upcoming weeks, but in essence, the flows are:

  1. Use a NER model as part of the NlpEngine. This is how spaCy models are used by default. Entities are extracted during the NlpEngine phase, and passed to recognizers. the SpacyRecognizer collects those and returns a list of RecognizerResult. We extended this capability to support Huggingface/transformers models as well, which are used as part of a spaCy pipeline (see #887). This is where the TransformersRecognizer in the package gets into the picture. All it does is collect the entities already extracted from the model during the NlpEngine phase.
  2. In parallel, it is always possible to create new recognizers calling any model. The TransformersModel sample on the demo site and on the docs/samples follows this approach. During the call to the .analyze method, it calls the model to get the predictions. This allows the flexibility of calling 5 different models, or having models serving different languages.

In essense: Flow 1:

sequenceDiagram
    AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
    SpacyNlpEngine->>NamedEntityRecognitionModel: call spaCy NER model
    NamedEntityRecognitionModel->>SpacyNlpEngine: return PII entities
    SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens etc.)
    Note over AnalyzerEngine: Call all recognizers
    AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
    Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
    SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]<BR>based on entities

Flow 2:

sequenceDiagram
    Note over AnalyzerEngine: Call all recognizers, <br>including <br>MyNerModelRecognizer
    AnalyzerEngine->>MyNerModelRecognizer: call .analyze
    MyNerModelRecognizer->>transformers_model: Call transformers model
    transformers_model->>MyNerModelRecognizer: get NER/PII entities
    MyNerModelRecognizer->>AnalyzerEngine: Return List[RecognizerResult] <br>of PII entities

Where MyNerModelRecognizer is a wrapper over an NLP library, similar to the transformers example and flair example.

omri374 commented 1 year ago

Reopening to improve logic and docs. Will be fixed in #1159

farnazgh commented 1 year ago
 nlp_configuration = {
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
    }

@omri374 This seems to work for current model and for english language. However, when I want to use french HF models , I get following error.

ValueError: No matching recognizers were found to serve the request.

These are the changes I made:

 from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from transformers_recognizer import TransformersRecognizer
import spacy
from presidio_analyzer.nlp_engine import NlpEngineProvider

FR_MODEL_CONF = {'PRESIDIO_SUPPORTED_ENTITIES': ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
                'DEFAULT_MODEL_PATH': 'Jean-Baptiste/camembert-ner-with-dates',
                'DATASET_TO_PRESIDIO_MAPPING': {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
                "MODEL_TO_PRESIDIO_MAPPING": {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
                "CHUNK_OVERLAP_SIZE": 40,
                "CHUNK_SIZE": 600,
                "ID_SCORE_MULTIPLIER": 0.4,
                "ID_ENTITY_NAME": "ID"}

registry = RecognizerRegistry()
registry.load_predefined_recognizers()

if not spacy.util.is_package("fr_core_news_sm"):
    spacy.cli.download("fr_core_news_sm")

supported_entities = FR_MODEL_CONF.get(
        "PRESIDIO_SUPPORTED_ENTITIES")

model = "Jean-Baptiste/camembert-ner-with-dates"
transformers_recognizerr = TransformersRecognizer(model_path=model, supported_entities= supported_entities)
transformers_recognizerr.load_transformer(**FR_MODEL_CONF)

if not spacy.util.is_package("fr_core_news_sm"):
    spacy.cli.download("fr_core_news_sm")

registry.add_recognizer(transformers_recognizerr)
registry.remove_recognizer("SpacyRecognizer")

nlp_configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "fr", "model_name":"fr_core_news_sm"}],
}

nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()

analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine)

results = analyzer.analyze(
 text="Je m'appelle jean-baptiste et j'habite à montréal depuis fevr 2012",
 language="fr",
 entities = ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
 return_decision_process=True,
 )
for result in results:
    print(result)
    print(result.analysis_explanation)
LSD-98 commented 1 year ago

Many thanks @omri374 for the reply, very clear.

```python
 nlp_configuration = {
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
    }

@omri374 This seems to work for current model and for english language. However, when I want to use french HF models , I get following error.

ValueError: No matching recognizers were found to serve the request.

These are the changes I made:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from transformers_recognizer import TransformersRecognizer
import spacy
from presidio_analyzer.nlp_engine import NlpEngineProvider

FR_MODEL_CONF = {'PRESIDIO_SUPPORTED_ENTITIES': ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
               'DEFAULT_MODEL_PATH': 'Jean-Baptiste/camembert-ner-with-dates',
               'DATASET_TO_PRESIDIO_MAPPING': {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
               "MODEL_TO_PRESIDIO_MAPPING": {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
               "CHUNK_OVERLAP_SIZE": 40,
               "CHUNK_SIZE": 600,
               "ID_SCORE_MULTIPLIER": 0.4,
               "ID_ENTITY_NAME": "ID"}

registry = RecognizerRegistry()
registry.load_predefined_recognizers()

if not spacy.util.is_package("fr_core_news_sm"):
   spacy.cli.download("fr_core_news_sm")

supported_entities = FR_MODEL_CONF.get(
       "PRESIDIO_SUPPORTED_ENTITIES")

model = "Jean-Baptiste/camembert-ner-with-dates"
transformers_recognizerr = TransformersRecognizer(model_path=model, supported_entities= supported_entities)
transformers_recognizerr.load_transformer(**FR_MODEL_CONF)

if not spacy.util.is_package("fr_core_news_sm"):
   spacy.cli.download("fr_core_news_sm")

registry.add_recognizer(transformers_recognizerr)
registry.remove_recognizer("SpacyRecognizer")

nlp_configuration = {
   "nlp_engine_name": "spacy",
   "models": [{"lang_code": "fr", "model_name":"fr_core_news_sm"}],
}

nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()

analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine)

results = analyzer.analyze(
text="Je m'appelle jean-baptiste et j'habite à montréal depuis fevr 2012",
language="fr",
entities = ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
return_decision_process=True,
)
for result in results:
   print(result)
   print(result.analysis_explanation)

I tried the same thing last week and had the exact same issue. I did not manage to solve it and moved to another project. I assume there will be an easier way to use HF models when #1159 is pushed!

omri374 commented 1 year ago

Make sure you pass the language argument to the TransformersRecognizer