ropensci / charlatan

Create fake data in R
https://docs.ropensci.org/charlatan/
Other
294 stars 28 forks source link

Add remaining elements of protected health information #61

Closed higgi13425 closed 1 year ago

higgi13425 commented 6 years ago

Many of these are included already, but the full list is here:

https://medschool.duke.edu/research/clinical-and-translational-research/duke-office-clinical-research/irb-and-institutional-14

Name Address (all geographic subdivisions smaller than state, including street address, city county, and zip code) All elements (except years) of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89) Telephone numbers Fax number Email address Social Security Number Medical record number Health plan beneficiary number Account number Certificate or licence number Any vehicle or other device serial number Web URL Internet Protocol (IP) Address Finger or voice print Photographic image - Photographic images are not limited to images of the face. Any other characteristic that could uniquely identify the individual

sckott commented 6 years ago

thanks @higgi13425

Is the idea that people managing data under HIPAA will replace real data with fake data?

higgi13425 commented 6 years ago

Exactly. To deidentify a clinical dataset. zipcode replaced with deid_zipcodename replaced with deid_namestreet with deid_streetdob with deid_dobetc. Ideally, the date of birth(dob) would be the index date, and could be assigned a random date in the year 1900. then all other dates in the dataset could be adjusted relative to deid_dob, to preserve the sequence of events and relative time, while keeping data deidentified. This would be really helpful for folks like me with HIPAA issues with PHI-containing datasets. Even cooler - a function to 1) add a deid_x version of each PHI variable in the dataset, then2) split dataset into two - one with PHI plus unique key (stored securely)- and the 2nd with unique key plus deid_x versions of PHI data (plus all the other data). then you could share the 2nd dataframe (on GitHub, etc),but if you really needed to, you could merge to re-identify. thanks for considering it. Peter

sckott commented 6 years ago

thanks @higgi13425

done already

not done, questions

For the below, I assume there's no standard format to this? is it just a string of letters and numbers? If so, we don't need specialized functions for each one

not done, can do


your function idea is interesting. i'll open a new issue for that so this issue can focus on the data types

higgi13425 commented 6 years ago

birthdate - the idea was to randomly select a day/month, and place the date of birth in a year that clearly is not the real date of birth - so that there is no confusion later between true     dob    and     deid_dob. 1900 is a reasonable year, in that there are no people born in 1900 still alive. county name - for my purposes, US county only.I could imagine that if this becomes popular, the equivalent in other countries would be worthwhile. I agree, Most of the numbers can already be done. fax number ~ phone number

This sounds promising! Peter

sckott commented 6 years ago
z <- DateTimeProvider$new()
z$date_time_between("1900-01-01", "1900-12-31")