Open willsthompson opened 4 years ago
This issue is stale because it has been open 30 days with no activity.
We need to ensure that we remove all direct identifiers (according to HIPAA) and are able to produce a "limited data set." Limited data sets are defined in 45 CFR §164.514(e)(2).
The list of the identifiers to detect and remove is below. "Medical record numbers" and "health plan beneficiary numbers" are the least obvious and parenthesis have been added below as notes.
Most of these are implemented or should be easy to add. Some will be harder, like biometrics, face image, etc., unless we can be totally reliant on column name.
Biometrics - Not sure what to do here. I think we may need to formalize a workflow for any undetected or low confidence columns and force users to deal with them.
Agreed. For biometrics and photographic images, we could also just drop them wholesale.
JSON is also often stored in tabular databases, we might want to use an anonymizer on that. It could serve as a lower-fruit way to add functionality for little anonymizers for special data types like biometrics or whatever else.
Biometrics
Im not sure yet how to treat that kind of data. It doesn't fit well into the architecture of anything we have put together yet. It's not a document format, not tabular, etc.
To me it feels like image-treatment would mostly be looking for text or faces and censoring them or deep-faking them somehow.
For audio seems like a voice-changer it would help if we detect it's a voice. Otherwise we treat as text anonymization perhaps (meaning we need to transcribe it) and censor.
All of these are harder to do and I agree we put it off.
Im not sure yet how to treat that kind of data.
For now, all we need to do is remove the 16 different types of direct identifiers specified by the statute. We don't have to worry about anonymizing them. The important thing is to detect biometric data; once that is done, we can just drop the column(s). I have no idea how hard it is to detect.
The main question for @antisyzygy is if there is a more accurate method for detecting medical record numbers and health plan beneficiary numbers. More accurate than the naive implementation we have now.
More accurate than the naive implementation we have now.
To clarify, what we have now is an idea for an implementation naive algo. Nothing is implemented yet.
Just making a quick note here that ICD and NPI codes are something we should put an emphasis on. They are going to be present in almost any healthcare dataset set encounter.
It would make sense perhaps to develop features for ICD/NPI codes in the generalizers/row-swapper. We may have to use a generalization hierarchy for these codes to avoid swapping two totally unrelated procedures or whatever.
It could be partial motivation, perhaps, for the feature we discussed where we treat city/state fields more intelligently so we're not moving Alaska to Florida and things like that.
Anyway, these two examples: NPI/ICD codes and city/state seem like they could be treated with the same feature perhaps.
It seems the Datavant meeting gave us some feedback that also seems to be partial motivation for treating fields with certain entity-types more intelligently :
We need a more sophisticated way of anonymizing locations in data (ex., swapping Indianapolis for Tallahassee).
- John in Slack/product
NPI/ICD codes and city/state seem like they could be treated with some of the same scaffolding perhaps.
ICD codes are hierarchical. I guess locations are, too. But I am not sure we need to anonymize ICD codes; we could start by just identifying them with Presidio and flagging them.
I pulled this PDF from one of the links above.
Copied from @john-craft's notes in the obsolete issue: pvcy/privacy-api#45
As part of the rewrite, we should expand the number of items we look for and categorize them into buckets that can be applied for customer projects. An example of identifier buckets is below.
Listed below are additional strings we need to identify while analyzing a data set.
Finance
Health