Write tests for recognizers migrated from Privacy API scanner library

pvcy / presidio

MIT License

0 stars 0 forks source link

Write tests for recognizers migrated from Privacy API scanner library #7

Closed willsthompson closed 3 years ago

willsthompson commented 4 years ago

Write tests for the new recognizers created in issue #6

Open question: Should the new recognizers live in the presidio repo or in the privacy-api repo? I'm leaning toward privacy-api to prevent the presidio repo from unnecessarily diverging further from the public master. It would also be easier to contribute our updates back to the public repo without including every (or any) custom recognizers.

If recognizers are moved, also clean up presidio's history.

willsthompson commented 3 years ago

Recognizers have been moved from the presidio repo to the privacy_api repo.
Presidio repo history is cleaned up.
In anticipation of JSON-based Presidio changes, it doesn't make sense to write tests directly against recognizers that will need major refactoring. Instead, the tests will be written for the pii report. This way they can also be used to validate correctness after the recognizers have been rewritten in the new JSON style:
- [x] Advertising ID
- [x] City
- [x] Coordinate
- [x] Partial Coordinate
- Device IDs
  - [x] ESN Decimal
  - [x] ESN Hex
  - [x] IMEI
  - [x] MAC
  - [x] meid
- [x] Gender
- [ ] License plate
- [x] Nationality
- [x] Address
- [x] State
- [x] Time
- [x] Zipcode
- [x] Birth date
- [x] Death date
- [x] US Tax ID

willsthompson commented 3 years ago

I finished a very cursory pass at these tests. Most recognizers need more tests, especially expected non-matching cases and variations with titles. Multiple recognizers need improvement for more robust detection.

In the next phase of testing, these should probably be pushed down a level and tested at the output of engine.analyze(), instead of pii_report. The pii_report testing should be separated/isolated to define its independent behavior, which may not be much, assuming there are tests covering filter_intersecting_results, is_categorical, and ordered type detection/handling. The pii_report tests may be better suited as a schema/packaging validation.