tokern / piicatcher

Scan databases and data warehouses for PII data. Tag tables and columns in data catalogs like Amundsen and Datahub
https://tokern.io/piicatcher/
Apache License 2.0
281 stars 96 forks source link

Columns names and data are identified incorrectly pii #216

Open denisbnet opened 1 year ago

denisbnet commented 1 year ago

DatumSpacyDetector:

  1. Code (salt) like '0c065d65-883a-4286-8284-9c2668ee7608' identified as Address
  2. Education code like 'HIGHER' or 'MASTER' identified as Address
  3. Employment code like 'FULL_TIME' identified as Person
  4. Source code like 'MANUAL' or 'CAREER SECTION' identified as Person
  5. Salary like '30000' identified as Birth Date
  6. Skills description identified as Address etc. Is it possible to fix it?

spaCy version 3.5.2 Platform Linux-5.15.0-70-generic-x86_64-with-glibc2.35 Python version 3.10.6 Pipelines en_core_web_md (3.5.0), en_core_web_sm (3.5.0)

ColumnNameRegexDetector:

  1. Passpord identifed as Password
nicolepng commented 1 year ago

Hi @denisbnet! :)

For the DatumSpacyDetector, we are currently utilizing commonregex-improved python library to carry out regex matching to detect the pii types. To fix the problems raised for DatumSpacyDetector, we can look into generating different regex expressions or looking at utilizing a different method to increase accuracy for DatumSpacyDetector. Feel free to open PRs or suggestions in doing so :)

For ColumnNameRegexDetector, it is possible to update the regex for column matching by changing the regex in scanner.py. If you would like to create a new detector, you can do so by referring to the documentation in detectors.py