microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.8k stars 571 forks source link

Configure AnalyzerEngine from file #1338

Open omri374 opened 7 months ago

omri374 commented 7 months ago

Is your feature request related to a problem? Please describe. In many use-cases, especially around the Docker based option, it is challenging to configure the AnalyzerEngine for a specific scenario. For example, in order to have an API supporting multiple languages, it is required to change the code in the app.py: https://github.com/microsoft/presidio/blob/4db5278bd1636416f0a450e2937236803d77e81c/presidio-analyzer/app.py#L40

Having a way to configure which initial parameters are used (languages, nlp engine, recognizers, default score etc.) will allow a code-free configuration in both Docker based use-cases and for a more configurable Python pipeline.

Describe the solution you'd like

Describe alternatives you've considered An alternative would be documentation of how to change app.py, but code would still have to be changed.

Additional context Presidio already has several conf file, e.g.:

GautierT commented 7 months ago

Hey ! Thanks for this PR.

Can I use it to use an other transformer model ? Like this one : https://huggingface.co/Jean-Baptiste/camembert-ner

I was thinkings about using a conf file yaml like this :

nlp_engine_name: transformers
models:
  -
    lang_code: fr
    model_name:
      spacy: fr_core_news_lg
      transformers: Jean-Baptiste/camembert-ner

ner_model_configuration:
  labels_to_ignore:
  - O
  aggregation_strategy: simple # "simple", "first", "average", "max"
  stride: 16
  alignment_mode: strict # "strict", "contract", "expand"
  model_to_presidio_entity_mapping:
    PER: PERSON
    LOC: LOCATION
    ORG: ORGANIZATION
    AGE: AGE
    ID: ID
    EMAIL: EMAIL
    PATIENT: PERSON
    STAFF: PERSON
    HOSP: ORGANIZATION
    PATORG: ORGANIZATION
    DATE: DATE_TIME
    PHONE: PHONE_NUMBER
    HCW: PERSON
    HOSPITAL: ORGANIZATION

  low_confidence_score_multiplier: 0.4
  low_score_entity_names:
  - ID

Can this work ? Without your PR, fr language never seems available.

Thanks.

omri374 commented 7 months ago

Hi @GautierT, are you looking to run this through a REST API? If no, then you can configure your model using the standard NlpEngineProvider logic, for example see this documentation If yes, then the only additional change needed is on app.py to pass the NlpEngine into the AnalyzerEngine. Instead of this: https://github.com/microsoft/presidio/blob/5bc4b679608a8f65c10799f210bf2ee15c0434a7/presidio-analyzer/app.py#L40

Have this:


class Server:
    """HTTP Server for calling Presidio Analyzer."""

    def __init__(self):
        fileConfig(Path(Path(__file__).parent, LOGGING_CONF_FILE))
        self.logger = logging.getLogger("presidio-analyzer")
        self.logger.setLevel(os.environ.get("LOG_LEVEL", self.logger.level))
        self.app = Flask(__name__)
        self.logger.info("Starting analyzer engine")

        provider = NlpEngineProvider(conf_file=PATH_TO_CONF)
        nlp_engine = provider.create_engine()
        self.engine = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["fr"])
        self.logger.info(WELCOME_MESSAGE)