Spike: use a data anonymizer

newswim commented 2 years ago

Estimate the lift of using a tool like https://github.com/ArtLabss/open-data-anonymizer

derac commented 2 years ago

Looks straightforward. I would add another script as a processing step on the collected and parsed json data.

lianilychee commented 3 months ago

Validate with Josh if this is still needed.

tpadmanabhan commented 1 month ago

context: this was originally a request from the grant so that we don't have have raw data stored in our db

eg: nick sawyer = 123 josh leibowitz = 456

Data masking may also be an option for this

nicolassaw commented 5 days ago

I'm thinking that anonymization may be handled differently based on the type of data. For example, I think that defendant information can be removed entirely until we identify a need to connect cases for individuals together.

Because we want to use defense attorney information but don't want to reveal it, we could use the same scrambled identifier across defense attorneys. How do we keep track of those unique identifiers? I started to create a database of defense attorney information, but considering the defense attorney information only includes the full name and phone number, we could create a unique string (like 'attorney name:phone nummber') and then create a hash of it. So, whenever you have a unique defense attorney (same name and phone number), you'll be able to link using the same hash but folks will only see the resulting hash in the dataset. This will avoid the use of having to have a database.

However, someone could technically reverse-engineer the hash using a known defense attorney/phone number and link it to rows with that hash in the dataset. Perhaps we may need to then scramble that hash and store it in the CosmosDB in order to make it more secure afterall?

Any thoughts?

open-austin / indigent-defense-stats

Spike: use a data anonymizer #75