Closed higgi13425 closed 1 year ago
thanks @higgi13425
Is the idea that people managing data under HIPAA will replace real data with fake data?
Exactly. To deidentify a clinical dataset. zipcode replaced with deid_zipcodename replaced with deid_namestreet with deid_streetdob with deid_dobetc. Ideally, the date of birth(dob) would be the index date, and could be assigned a random date in the year 1900. then all other dates in the dataset could be adjusted relative to deid_dob, to preserve the sequence of events and relative time, while keeping data deidentified. This would be really helpful for folks like me with HIPAA issues with PHI-containing datasets. Even cooler - a function to 1) add a deid_x version of each PHI variable in the dataset, then2) split dataset into two - one with PHI plus unique key (stored securely)- and the 2nd with unique key plus deid_x versions of PHI data (plus all the other data). then you could share the 2nd dataframe (on GitHub, etc),but if you really needed to, you could merge to re-identify. thanks for considering it. Peter
thanks @higgi13425
done already
PhoneNumberProvider
/ch_phone_number
street_name
in AddressProvider
postcode
in AddressProvider
city
in AddressProvider
not done, questions
DateTimeProvider$new()$date("%Y-%M-%d")
we don't have a way to pick a date within a certain range of years, can look into that For the below, I assume there's no standard format to this? is it just a string of letters and numbers? If so, we don't need specialized functions for each one
not done, can do
InternetProvider$new()$email()
InternetProvider$new()$url()
InternetProvider$new()$ipv4()
your function idea is interesting. i'll open a new issue for that so this issue can focus on the data types
birthdate - the idea was to randomly select a day/month, and place the date of birth in a year that clearly is not the real date of birth - so that there is no confusion later between true dob and deid_dob. 1900 is a reasonable year, in that there are no people born in 1900 still alive. county name - for my purposes, US county only.I could imagine that if this becomes popular, the equivalent in other countries would be worthwhile. I agree, Most of the numbers can already be done. fax number ~ phone number
This sounds promising! Peter
z <- DateTimeProvider$new()
z$date_time_between("1900-01-01", "1900-12-31")
Many of these are included already, but the full list is here:
https://medschool.duke.edu/research/clinical-and-translational-research/duke-office-clinical-research/irb-and-institutional-14
Name Address (all geographic subdivisions smaller than state, including street address, city county, and zip code) All elements (except years) of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89) Telephone numbers Fax number Email address Social Security Number Medical record number Health plan beneficiary number Account number Certificate or licence number Any vehicle or other device serial number Web URL Internet Protocol (IP) Address Finger or voice print Photographic image - Photographic images are not limited to images of the face. Any other characteristic that could uniquely identify the individual