Open omri374 opened 7 months ago
Hey ! Thanks for this PR.
Can I use it to use an other transformer model ? Like this one : https://huggingface.co/Jean-Baptiste/camembert-ner
I was thinkings about using a conf file yaml like this :
nlp_engine_name: transformers
models:
-
lang_code: fr
model_name:
spacy: fr_core_news_lg
transformers: Jean-Baptiste/camembert-ner
ner_model_configuration:
labels_to_ignore:
- O
aggregation_strategy: simple # "simple", "first", "average", "max"
stride: 16
alignment_mode: strict # "strict", "contract", "expand"
model_to_presidio_entity_mapping:
PER: PERSON
LOC: LOCATION
ORG: ORGANIZATION
AGE: AGE
ID: ID
EMAIL: EMAIL
PATIENT: PERSON
STAFF: PERSON
HOSP: ORGANIZATION
PATORG: ORGANIZATION
DATE: DATE_TIME
PHONE: PHONE_NUMBER
HCW: PERSON
HOSPITAL: ORGANIZATION
low_confidence_score_multiplier: 0.4
low_score_entity_names:
- ID
Can this work ? Without your PR, fr
language never seems available.
Thanks.
Hi @GautierT, are you looking to run this through a REST API?
If no, then you can configure your model using the standard NlpEngineProvider
logic, for example see this documentation
If yes, then the only additional change needed is on app.py
to pass the NlpEngine
into the AnalyzerEngine
. Instead of this:
https://github.com/microsoft/presidio/blob/5bc4b679608a8f65c10799f210bf2ee15c0434a7/presidio-analyzer/app.py#L40
Have this:
class Server:
"""HTTP Server for calling Presidio Analyzer."""
def __init__(self):
fileConfig(Path(Path(__file__).parent, LOGGING_CONF_FILE))
self.logger = logging.getLogger("presidio-analyzer")
self.logger.setLevel(os.environ.get("LOG_LEVEL", self.logger.level))
self.app = Flask(__name__)
self.logger.info("Starting analyzer engine")
provider = NlpEngineProvider(conf_file=PATH_TO_CONF)
nlp_engine = provider.create_engine()
self.engine = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["fr"])
self.logger.info(WELCOME_MESSAGE)
Is your feature request related to a problem? Please describe. In many use-cases, especially around the Docker based option, it is challenging to configure the
AnalyzerEngine
for a specific scenario. For example, in order to have an API supporting multiple languages, it is required to change the code in the app.py: https://github.com/microsoft/presidio/blob/4db5278bd1636416f0a450e2937236803d77e81c/presidio-analyzer/app.py#L40Having a way to configure which initial parameters are used (languages, nlp engine, recognizers, default score etc.) will allow a code-free configuration in both Docker based use-cases and for a more configurable Python pipeline.
Describe the solution you'd like
AnalyzerEngine
instanceDescribe alternatives you've considered An alternative would be documentation of how to change
app.py
, but code would still have to be changed.Additional context Presidio already has several conf file, e.g.: