Add remaining elements of protected health information

higgi13425 commented 6 years ago

Many of these are included already, but the full list is here:

https://medschool.duke.edu/research/clinical-and-translational-research/duke-office-clinical-research/irb-and-institutional-14

would be nice to add:
- random street names
- random zip code
- random city name
- random email, perhaps related to name
- random county name
- random SSN

Name Address (all geographic subdivisions smaller than state, including street address, city county, and zip code) All elements (except years) of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89) Telephone numbers Fax number Email address Social Security Number Medical record number Health plan beneficiary number Account number Certificate or licence number Any vehicle or other device serial number Web URL Internet Protocol (IP) Address Finger or voice print Photographic image - Photographic images are not limited to images of the face. Any other characteristic that could uniquely identify the individual

sckott commented 6 years ago

thanks @higgi13425

Is the idea that people managing data under HIPAA will replace real data with fake data?

higgi13425 commented 6 years ago

Exactly. To deidentify a clinical dataset. zipcode replaced with deid_zipcodename replaced with deid_namestreet with deid_streetdob with deid_dobetc. Ideally, the date of birth(dob) would be the index date, and could be assigned a random date in the year 1900. then all other dates in the dataset could be adjusted relative to deid_dob, to preserve the sequence of events and relative time, while keeping data deidentified. This would be really helpful for folks like me with HIPAA issues with PHI-containing datasets. Even cooler - a function to 1) add a deid_x version of each PHI variable in the dataset, then2) split dataset into two - one with PHI plus unique key (stored securely)- and the 2nd with unique key plus deid_x versions of PHI data (plus all the other data). then you could share the 2nd dataframe (on GitHub, etc),but if you really needed to, you could merge to re-identify. thanks for considering it. Peter

sckott commented 6 years ago

thanks @higgi13425

done already

[x] Telephone numbers - done, see PhoneNumberProvider/ch_phone_number
[x] Fax number (done I assume, or are there different fax number formats?)
[x] street names, done, see street_name in AddressProvider
[x] zip code, done, see postcode in AddressProvider
[x] city name, done, see city in AddressProvider

not done, questions

[ ] birthdate is just a date, see DateTimeProvider$new()$date("%Y-%M-%d") we don't have a way to pick a date within a certain range of years, can look into that
[ ] county name - are you intersted in US counties only?

For the below, I assume there's no standard format to this? is it just a string of letters and numbers? If so, we don't need specialized functions for each one

[ ] Medical record number
[ ] Health plan beneficiary number
[ ] Account number
[ ] Certificate or licence number
[ ] Any vehicle or other device serial number

not done, can do

[x] email address - can do that, see InternetProvider$new()$email()
[ ] SSN - can do that
[x] Web URL, can do that, see InternetProvider$new()$url()
[x] Internet Protocol (IP) Address, can do that, see InternetProvider$new()$ipv4()

your function idea is interesting. i'll open a new issue for that so this issue can focus on the data types

higgi13425 commented 6 years ago

birthdate - the idea was to randomly select a day/month, and place the date of birth in a year that clearly is not the real date of birth - so that there is no confusion later between true dob and deid_dob. 1900 is a reasonable year, in that there are no people born in 1900 still alive. county name - for my purposes, US county only.I could imagine that if this becomes popular, the equivalent in other countries would be worthwhile. I agree, Most of the numbers can already be done. fax number ~ phone number

This sounds promising! Peter

sckott commented 6 years ago

DOB: okay, i see now what you mean. can do it like

z <- DateTimeProvider$new()
z$date_time_between("1900-01-01", "1900-12-31")

counties: thanks, my feeling is to only do us counties for now

ropensci / charlatan

Add remaining elements of protected health information #61