microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.71k stars 565 forks source link

Pass in custom trained spacy model #851

Closed vajjasaikiran closed 2 years ago

vajjasaikiran commented 2 years ago

We have trained a custom spacy model having entities which currently spacy does not have. We plan to use that spacy model as the default spacyNLP engine .

I tried with the code mentioned in #822 , but I am not getting the required entities.

I tried the below code.

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
import spacy

#Create a class inheriting from SpacyNlpEngine
class LoadedSpacyNlpEngine(SpacyNlpEngine):

    def __init__(self, loaded_spacy_model):
        self.nlp = {"en": loaded_spacy_model}

#Load a model a-priori
nlp = spacy.load("/path/to/custom_model")

#Pass the loaded model to the new LoadedSpacyNlpEngine
loaded_nlp_engine = LoadedSpacyNlpEngine(loaded_spacy_model = nlp)

#Pass the engine to the analyzer
analyzer = AnalyzerEngine(nlp_engine = loaded_nlp_engine)

#Analyze text
analyzer.analyze(text="My name is Bob. I work for Google as an ML engineer.", language="en") 

Expected entities: [PERSON, ORG, CUSTOM] Predicted entities: [PERSON, ORG]

Can somebody explain if there is any hack or something to achieve this.

omri374 commented 2 years ago

Hi @vajjasaikiran, Presidio uses spaCy first as an NLP engine, and second to extract NER. For the latter, there's a recognizer called SpacyRecognizer.

It has the en_core_web_lg entity types by default, but it's possible to pass others. For example:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers import SpacyRecognizer

# Define the new entities supported by the custom model
spacy_entities = ["PERS", "LOC", "ORG", "TIME", "DATE", "MONEY", "PERCENT", "MISC__AFF", "MISC__ENT"]

# Translate the model's entity types to Presidio's (if needed, in this example we map tham 1:1)
spacy_label_groups = [({ent}, {ent}) for ent in spacy_entities]

spacy_recognizer = SpacyRecognizer(supported_language="en", 
                                   supported_entities=spacy_entities, 
                                   check_label_groups=spacy_label_groups)

# Create Presidio Analyzer Engine
analyzer = AnalyzerEngine()

# List existing (predefined) recognizers
print([rec.name for rec in analyzer.registry.recognizers])

# Remove the previous SpacyRecognizer
analyzer.registry.recognizers = [rec for rec in analyzer.registry.recognizers if rec.name != "SpacyRecognizer"]

# Add the new custom SpacyRecognizer
analyzer.registry.add_recognizer(spacy_recognizer)

# Run Analyzer Engine
res = analyzer.analyze(text="text with custom entities", language="en")

Hope this helps!

vajjasaikiran commented 2 years ago

Hi @omri374 . Thank you for the response.

I can see that we are adding new custom entities, label groups to the Analyzer Engine. But where are we passing the custom model weight file into the Analyzer engine? Could you please modify the above code or explain me how the custom model weight file is getting utilised in the Analyzer engine pipeline.

omri374 commented 2 years ago

Hi @vajjasikiran,

This is a good point. This is the flow in high level:

  1. The NlpEngine creates an object called NlpArtifacts, which contains the output of the spaCy pipeline (tokens, entities etc.)
  2. The SpacyRecognizer object is leveraging the NlpArtifacts simply to extract the requested entities out of the NlpArtifacts.

So the actual model weights are being used when the NlpEngine runs the input text through the model. Then, the outputs are propagated to all other recognizers, including the SpacyRecognizer.

This example shows this in more detail. It takes tokens out of the NlpArtifacts to extract token attributes, but a similar logic is used to extract entities in the SpacyRecognizer class:

https://github.com/microsoft/presidio/blob/e52cf5f0b4ddfb298b408107ffcca01731fa7a3c/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py#L95

vajjasaikiran commented 2 years ago

Hi @omri374 ,

I understood the flow. But one thing I am still not clear is that , how can I pass my new model weights to the NLPEngine. Presidio by default has en_core_web_lg model loaded during initialisation. I wanted to pass my new custom trained model to the Engine.

You can consider this as a doube NER kind of pipeline. Predict(spacy entities) + Predict(custom entities) + combine them and give the result.

My custom model can predict 5 entities, which spacy default model does not. I want the NLPEngine to predict (Spacy entities + custom entities). Could you please share a sample code where I can pass my custom model object and it just adds to the default pipeline of presidio and DONE. If there is no such way to do it right now, can you help me in doing some hacks around the available classes and get the work DONE.

omri374 commented 2 years ago

Hi @vajjasaikiran, so if I understand correctly the ambition is to have both en_core_web_lg and an additional custom model.

In this case I would suggest creating a new recognizer which loads the custom model. Here's an example implementation. It uses the same logic in the SpacyRecognizer, just with a loaded model instead of what's received in the NlpArtifacts:

from typing import Tuple, Set

from presidio_analyzer import AnalyzerEngine, RecognizerResult
from presidio_analyzer.predefined_recognizers import SpacyRecognizer

import spacy

class CustomSpacyRecognizer(SpacyRecognizer):

    def __init__(self, path_to_model:str):
        """
        SpacyRecognizer with a new/custom model, 
        to run in parallel with the model in NlpEngine.
        :param path_to_model: Path to the custom model's location
        """

        self.path_to_model = path_to_model
        self.model = None # Model will be loaded on .load()

        entities = ["ORG"] # TODO change to the custom model's entities
        spacy_label_groups = [({ent}, {ent}) for ent in entities]

        super().__init__(
                supported_language='en',
                supported_entities=entities,
                ner_strength=0.85,
                check_label_groups=spacy_label_groups
        )

    def load(self):
        self.model = spacy.load(self.path_to_model)

    def analyze(self, text, entities, nlp_artifacts=None):
        """
        Analyze using a spaCy model. Similar to SpacyRecognizer.analyze, 
        except it has an actual call to a spaCy model loaded as part of this recognizer.
        """
        results = []

        doc = self.model(text)

        ner_entities = doc.ents

        for entity in entities:
            if entity not in self.supported_entities:
                continue
            for ent in ner_entities:
                if not self.__check_label(entity, ent.label_, self.check_label_groups):
                    continue
                textual_explanation = f"Identified as {ent.label_} by the spaCy model: {self.path_to_model}"
                explanation = self.build_spacy_explanation(
                    self.ner_strength, textual_explanation
                )
                spacy_result = RecognizerResult(
                    entity_type=entity,
                    start=ent.start_char,
                    end=ent.end_char,
                    score=self.ner_strength,
                    analysis_explanation=explanation,
                    recognition_metadata={
                        RecognizerResult.RECOGNIZER_NAME_KEY: self.name
                    },
                )
                results.append(spacy_result)

        return results

    @staticmethod
    def __check_label(
        entity: str, label: str, check_label_groups: Tuple[Set, Set]
    ) -> bool:
        return any(
            [entity in egrp and label in lgrp for egrp, lgrp in check_label_groups]
        )

Adding the new recognizer (in this example only detects the ORG entity):

custom_spacy = CustomSpacyRecognizer(path_to_model="en_core_web_sm")

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(custom_spacy)

results = analyzer.analyze(text="David Smith works at IBM", language="en", return_decision_process=True)

Results (with the decision process to see that the same entity was detected twice, once by the default spaCy model and second by the custom model, in this case en_core_web_sm but could be anything else:

[res.__dict__ for res in results]
[{'entity_type': 'PERSON',
  'start': 0,
  'end': 11,
  'score': 0.85,
  'analysis_explanation': {'recognizer': 'SpacyRecognizer', 'pattern_name': None, 'pattern': None, 'original_score': 0.85, 'score': 0.85, 'textual_explanation': "Identified as PERSON by Spacy's Named Entity Recognition", 'score_context_improvement': 0, 'supportive_context_word': '', 'validation_result': None},
  'recognition_metadata': {'recognizer_name': 'SpacyRecognizer'}},
 {'entity_type': 'ORG',
  'start': 21,
  'end': 24,
  'score': 0.85,
  'analysis_explanation': {'recognizer': 'CustomSpacyRecognizer', 'pattern_name': None, 'pattern': None, 'original_score': 0.85, 'score': 0.85, 'textual_explanation': 'Identified as ORG by the spaCy model: en_core_web_sm', 'score_context_improvement': 0, 'supportive_context_word': '', 'validation_result': None},
  'recognition_metadata': {'recognizer_name': 'CustomSpacyRecognizer'}}]
vajjasaikiran commented 2 years ago

Hi @omri374

Thank you so much for your quick responses. I tried this and it is working as expected.

efka84 commented 9 months ago

@omri374 @vajjasaikiran

I have installed my custom trained NER model as a Python package. How can i use it with the final provided pieces of code (the accepted solution).

omri374 commented 9 months ago

@efka84 is it a spaCy model? if yes, you can pass a model loaded by spaCy into Presidio spaCy model loading: https://spacy.io/usage/saving-loading

YVMVN commented 3 weeks ago

Adding the new recognizer (in this example only detects the ORG entity):

custom_spacy = CustomSpacyRecognizer(path_to_model="en_core_web_sm")

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(custom_spacy)

Thanks for this explanation. I think I am trying to achieve a similar result. I have a question tho, can I achieve the same results by using add_nlp_recognizer method in class RecognizerRegistry, and pass in the loaded_nlp_engine which we initiate like this:

# Load the model from the local path
nlp = spacy.load("./gliner_model")

class LoadedSpacyNlpEngine(SpacyNlpEngine):
    def __init__(self, loaded_spacy_model):
        super().__init__()
        self.nlp = {"en": loaded_spacy_model}

loaded_nlp_engine = LoadedSpacyNlpEngine(loaded_spacy_model=nlp)

Will the results be similar to what you have explained in the above example? Or am I missing something? I apologize, I am still trying to figure out the logic and the relationship of spaCy with gliner_model and a registry, where we can pass our custom recognizers. I would appreciate any help.

omri374 commented 2 weeks ago

Hi @YVMVN. If your objective is to have a loaded spacy model, then you'd have to use the NLP engine and not just the recognizer registry. Every NER model can be used in Presidio in two ways:

  1. As part of an NlpEngine, which flows through the entire system and providers NER, tokens, lemmas etc. to other modules in Presidio. Generally speaking, you can only have one NER model in an NlpEngine.
  2. As a separate EntityRecognizer. This approach allows you to have multiple models in parallel. The CustomSpacyRecognizer is an example of a recognizer which wraps a spaCy model, that runs in parallel to the model running as part of the Nlp Engine.

See this doc for more details.