microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.8k stars 572 forks source link

Ability to combine spacy and transformers configs into a single one #1238

Closed ogencoglu closed 10 months ago

ogencoglu commented 10 months ago

Is it possible to combine spacy and transformers configs into a single one?

For example

configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "fi", "model_name": "fi_core_news_sm"},
               {"lang_code": "en", "model_name": "en_core_web_sm"},
               {"lang_code": "ru", "model_name": "ru_core_news_sm"},
               ],
}

+

configuration_hf = {
    "nlp_engine_name": "transformers",
    "models": [{"lang_code": "et", "model_name": {"spacy": "en_core_web_sm", "transformers": "tartuNLP/EstBERT_NER_v2"}},
               ],
}

in a single config?

omri374 commented 10 months ago

Are you interested in running a spacy model in parallel to a Huggingface model?

If yes, then the best way to do this is to use one of them (say spaCy) as an NlpEngine, and the other as an additional recognizer. There's no reason to have them both as NLP Engines as the other functions the nlp engine brings (tokens, lemmas, keywords) are not needed twice.

ogencoglu commented 10 months ago

Do I understand you correctly that I can do something like? :

configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "fi", "model_name": "fi_core_news_sm"},
               {"lang_code": "en", "model_name": "en_core_web_sm"},
               {"lang_code": "ru", "model_name": "ru_core_news_sm"},
               {"lang_code": "et", "model_name": {"spacy": "en_core_web_sm", "transformers": "tartuNLP/EstBERT_NER_v2"}
               ],
}
omri374 commented 10 months ago

You can either use a SpacyNlpEngine or a TransformersNlpEngine, but not both, so the provided configuration would not work. For the configuration you have here, the simplest way would be to:

  1. Configure fi, en, ru languages using SpacyNlpEngine with the configuration you provided, without the line for et.
  2. Create a new custom recognizer for the tartuNLP/EstBERT_NER_v2 model. See doc here: https://microsoft.github.io/presidio/samples/python/transformers_recognizer/

This would allow you to have all models running in parallel.

Note that small spaCy models (everything that ends with _sm) are not very accurate at identifying named entities.

ogencoglu commented 10 months ago

Thanks for the swift reply!

ogencoglu commented 10 months ago

Continuing the discussion, do I understand correctly that even if I have all models running in parallel (as you described above), I still need to tell the specific language AnalyzerEngine.analyze works on such as

analyze(
        text=text,
        entities=analyzer.get_supported_entities(),
        language="en",
        return_decision_process=False,
    )

and I can not do something like language=["en", "ru", "et"] ?

Meaning that, I still need to detect the language of the text or conversation with some language detection tool to be able to route the pipeline to the correct language.

Is my understanding correct?