Creating final de-identified datasets

soodoku commented 6 years ago

Time has arrived for building the first draft of the final data.frames + dictionary that we will include in the R data package. And it makes sense to pick low hanging fruits first. Let's start with California. It has the twin virtues of being relatively clean and big.

For CA, write a script that: a. Replaces name with a random 10 character string b. Does data integrity checks and flags or fixes issues as needed c. unzips and rbinds years and tiers of government and adds useful information such as what level of government or what year the data are from if such information is missing. d. final outcome = tidy data

After that, write a Rmd that presents some basic summaries of the data and presents a dictionary.

Note: if you think you can improve the description of the issue, please do. And don't let the description keep you from doing sensible things.

ChrisMuir commented 6 years ago

One question: Is it important that the deidentified string be identical for identical raw string values? Like if "cats" appears 12 times, should the strings post-anonymization be identical? i.e. all 12 instances would be "sdlfijosd98fs"?

If so, we should consider getting hash values, maybe via the digest package.

soodoku commented 6 years ago

Thanks, @ChrisMuir!

Was thinking about this particular point and hashing. Ya, we do want to map each specific string to a particular value. Basically, it allows us to achieve our purpose---not make it too easy for people to look up specific people---without losing much info.

On to the point about losing info.: Names are pretty useful for imputing gender and ethnicity. And we probably want to enrich the data a bit---impute race and gender using lincoln mullen's gender + my ethnicolr package---in lieu of losing this info.

do you think that's a reasonable way to go? i worry just a bit about having names of people but perhaps we should just go w/ it. what are your thoughts?

ChrisMuir commented 6 years ago

I feel like removing proper names is a good idea. Even though all of this data is public, it feels weird to leave in people's names and make it all available in one central source.

If we want to hash the names prior to release, I assume we shouldn't impute gender and race prior to hashing them, and add them as two new variables? As in, that's not a good option, probably because it would be seen as not very transparent....is this correct?

Yeah the more I think about it, the more I'm leaning towards hashing the names. I'm not an expert in data science ethics though.

soodoku commented 6 years ago

We are on the same page. In the final 'clean' data we package in R, we won't have actual names. As is the norm, we will have two packages: one data package (downloadable from GH) and one that provides the API.

Proposed order for starting on our effort:

Merge and 'clean' data for a single state (cal.)
Augment data with ethnicity and gender
Build a data dictionary
Get Ready to export
Find a way to 'hash' names to random strings: the problem with hashing = there is a chance of a 'collision'. We want to guarantee uniqueness. I will think more about this. We also want to make it so that reverse searches are not trivial. Once we have figured this, we export put a rd file in the respective R data package folder.

public-salaries / public_salaries

Creating final de-identified datasets #11