microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.6k stars 552 forks source link

REST endpoint /supportedentities only supports english #1058

Open paulo-raca opened 1 year ago

paulo-raca commented 1 year ago

Describe the bug

I was looking at the /supportedentities REST API and tried adding ?language=es and ?language=it to get the Spain / Italy-specific entities I saw in the docs.

Turns out that it doesn't really work. Anything but en returns {"error":"No matching recognizers were found to serve the request."} (HTTP 500)

To Reproduce

$ curl http://localhost:3001/supportedentities?language=en
["US_PASSPORT","DATE_TIME","MEDICAL_LICENSE","AU_ABN","IBAN_CODE","CREDIT_CARD","AU_TFN","AU_ACN","PHONE_NUMBER","AU_MEDICARE","IP_ADDRESS","URL","CRYPTO","US_SSN","PERSON","NRP","LOCATION","EMAIL_ADDRESS","US_DRIVER_LICENSE","US_BANK_NUMBER","SG_NRIC_FIN","US_ITIN","UK_NHS"]
$ curl http://localhost:3001/supportedentities?language=it
{"error":"No matching recognizers were found to serve the request."}
$ curl http://localhost:3001/supportedentities?language=es
{"error":"No matching recognizers were found to serve the request."}
curl http://localhost:3001/supportedentities?language=xy
{"error":"No matching recognizers were found to serve the request."}

Expected behavior

I'm not entirely sure if the global entities (email, phone number, URL, etc) should be returned too, since they have supported_language='en' in the code. But this is probably another issue :sweat_smile:

omri374 commented 1 year ago

Hi @paulo-raca, the issue is with how the AnalyzerEngine is defined when the flask app is set up. Currently it is set up with default parameters (i.e. only English as supported language). If you update the app.py file to create the AnalyzerEngine differently, you would be able to get other recognizers as well. Instead of: https://github.com/microsoft/presidio/blob/60911edf166d216e14cbed6ba6a0ac2d42796fb4/presidio-analyzer/app.py#L40

You could pass:

self.engine = AnalyzerEngine(supported_languages=["en", "es"])

Thank you for the feedback. We will look for ways to make this easier, and would be happy to consider community contributions as well.

paulo-raca commented 1 year ago

Hello, @omri374, thanks for pointing this up.

I think this should be configurable via CLI arguments (That are also acessible via the docker run commands)

If you agree, I can create a PR for this

omri374 commented 1 year ago

A contribution would be awesome! Thanks @paulo-raca

omri374 commented 1 year ago

There are other parameters we can take into account (for this PR or a future one) such as the NLP engine configuration