microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.92k stars 580 forks source link

US Driving license recognizer doesn't work correctly #291

Closed sgsmittal226 closed 4 years ago

sgsmittal226 commented 4 years ago

US Driving license number has different format for each state. but current recognizer match any random string as driving license as well

omri374 commented 4 years ago

Hi sgsmittal226, thanks for your input.

Many entities have very simple patterns (like 7 digits) and it's difficult to differentiate real positive cases from false positives. This is why many recognizers would return results with a very low score (0.01). Additional context words (like "driver" and "license") would increase the score. My suggestion is to put a threshold on the output of the analyzer, in order to avoid false positives.

You can do this by adding a resultsScoreThreshold field to the analyzer template. See swagger information here: https://github.com/microsoft/presidio/blob/431ac2cea27881878dbc16bdc112b80e827c75d2/presidio-api/cmd/presidio-api/docs/swagger.yaml#L64

Closing for now. Feel free to reopen if you would like to ask additional questions or wish to add additional information.