Add new recognizers for various industries

willsthompson commented 4 years ago

Copied from @john-craft's notes in the obsolete issue: pvcy/privacy-api#45

As part of the rewrite, we should expand the number of items we look for and categorize them into buckets that can be applied for customer projects. An example of identifier buckets is below.

Listed below are additional strings we need to identify while analyzing a data set.

Finance

[x] IBAN
[ ] SWIFT
[ ] CUSIP
[ ] Routing

Health

[ ] ICD (see this website for a comprehensive list)
[ ] FDA
[ ] NDC ("Drug products are identified and reported using a unique, three-segment number, called the National Drug Code (NDC), which serves as a universal product identifier for drugs")
[ ] DEA number (see this Wikipedia article for details)
[ ] NPI ("Ten-digit NPI numbers may be validated using the Luhn algorithm by prefixing '80840' to the 10-digit number.")

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity.

john-craft commented 3 years ago

We need to ensure that we remove all direct identifiers (according to HIPAA) and are able to produce a "limited data set." Limited data sets are defined in 45 CFR §164.514(e)(2).

The list of the identifiers to detect and remove is below. "Medical record numbers" and "health plan beneficiary numbers" are the least obvious and parenthesis have been added below as notes.

Names;
Postal address information, other than town or city, State, and zip code;
Telephone numbers;
Fax numbers;
Electronic mail addresses;
Social security numbers;
Medical record numbers (this is usually a "random" number and could be used if one also knew the institution that assigned it);
Health plan beneficiary numbers (likely insurance card/member ID);
Account numbers;
Certificate/license numbers;
Vehicle identifiers and serial numbers, including license plate numbers;
Device identifiers and serial numbers;
Web Universal Resource Locators (URLs);
Internet Protocol (IP) address numbers;
Biometric identifiers, including finger and voice prints; and
Full face photographic images and any comparable images.

willsthompson commented 3 years ago

Most of these are implemented or should be easy to add. Some will be harder, like biometrics, face image, etc., unless we can be totally reliant on column name.

Account and other IDs - The naive way to identify stuff like this in the data is to check to see if all the values are unique. However, we should probably expect cases where data rows may have more than one entry per person, which would break the naive impl. We might need to noodle on @antisyzygy 's idea for user ID detection.
URLs - Are there any details on this? What kind of URLs?
Biometrics - Not sure what to do here. I think we may need to formalize a workflow for any undetected or low confidence columns and force users to deal with them.

john-craft commented 3 years ago

Biometrics - Not sure what to do here. I think we may need to formalize a workflow for any undetected or low confidence columns and force users to deal with them.

Agreed. For biometrics and photographic images, we could also just drop them wholesale.

antisyzygy commented 3 years ago

JSON is also often stored in tabular databases, we might want to use an anonymizer on that. It could serve as a lower-fruit way to add functionality for little anonymizers for special data types like biometrics or whatever else.

Biometrics

Im not sure yet how to treat that kind of data. It doesn't fit well into the architecture of anything we have put together yet. It's not a document format, not tabular, etc.

To me it feels like image-treatment would mostly be looking for text or faces and censoring them or deep-faking them somehow.

For audio seems like a voice-changer it would help if we detect it's a voice. Otherwise we treat as text anonymization perhaps (meaning we need to transcribe it) and censor.

All of these are harder to do and I agree we put it off.

john-craft commented 3 years ago

Im not sure yet how to treat that kind of data.

For now, all we need to do is remove the 16 different types of direct identifiers specified by the statute. We don't have to worry about anonymizing them. The important thing is to detect biometric data; once that is done, we can just drop the column(s). I have no idea how hard it is to detect.

The main question for @antisyzygy is if there is a more accurate method for detecting medical record numbers and health plan beneficiary numbers. More accurate than the naive implementation we have now.

willsthompson commented 3 years ago

More accurate than the naive implementation we have now.

To clarify, what we have now is an idea for an implementation naive algo. Nothing is implemented yet.

john-craft commented 3 years ago

Just making a quick note here that ICD and NPI codes are something we should put an emphasis on. They are going to be present in almost any healthcare dataset set encounter.

antisyzygy commented 3 years ago

It would make sense perhaps to develop features for ICD/NPI codes in the generalizers/row-swapper. We may have to use a generalization hierarchy for these codes to avoid swapping two totally unrelated procedures or whatever.

It could be partial motivation, perhaps, for the feature we discussed where we treat city/state fields more intelligently so we're not moving Alaska to Florida and things like that.

Anyway, these two examples: NPI/ICD codes and city/state seem like they could be treated with the same feature perhaps.

antisyzygy commented 3 years ago

It seems the Datavant meeting gave us some feedback that also seems to be partial motivation for treating fields with certain entity-types more intelligently :

We need a more sophisticated way of anonymizing locations in data (ex., swapping Indianapolis for Tallahassee).

John in Slack/product

NPI/ICD codes and city/state seem like they could be treated with some of the same scaffolding perhaps.

john-craft commented 3 years ago

ICD codes are hierarchical. I guess locations are, too. But I am not sure we need to anonymize ICD codes; we could start by just identifying them with Presidio and flagging them.

I pulled this PDF from one of the links above.

pvcy / presidio

Add new recognizers for various industries #2