Test-suite for data anonymization

louislva / OpenActionData

Building a diverse and clean dataset of humans using the web. Open source.

https://open-action-data.vercel.app

4 stars 1 forks source link

Test-suite for data anonymization #2

Open louislva opened 1 year ago

louislva commented 1 year ago

We need to develop automatic data anonymization, and to do that sanely, we should have a test-suite to check for false negatives in the data anonymization.

A simple way to do that: Record a number of sessions of humans typing in (fake) sensitive data, and save them as JSON files. Then make a test-suite that puts each JSON file through the anonymize() function, and checks whether the values to be anonymized are present after. It should also check for them inside the concatenated keystrokes. If they are still present, this should fail the test case.

The kind of sensitive data we should test for:

Email address
Password
Name
Home address
Phone no
Bank account / credit card details
Crypto seed phrases
API keys
Social security / VAT number / passport number
... anything else you can think of! Please throw a comment!

JohannesHa commented 1 year ago

some of these seem to be covered by the maskInputOptions parameter of the rrweb.record function. The other cases could be handled in the maskInputFn and maskTextFn parameters of rrweb.record. https://github.com/rrweb-io/rrweb/blob/master/guide.md#options

MaskInputOptions: https://github.com/rrweb-io/rrweb/blob/588164aa12f1d94576f89ae0210b98f6e971c895/packages/rrweb-snapshot/src/types.ts#L77-L95

Probably still makes sense to build some kind of test-suite with mock events for rrweb.record to ensure that all edge cases are covered.

louislva commented 1 year ago

That actually looks pretty suitable! I'm curious whether maskInputFn & maskTextFn can also replace the masked value with a placeholder? Even if they can't, for shipping V1 we just need to censor personal data, not nessacarily do the placeholders (although they'd be really useful to train with). I think we'll just put a "anonymization_scheme_version" column in the database, so you can see what's what.

Also, how do you think we'll go about censoring data we don't know is personally identifiable? For example, if I'm logged into Google, it'll display my full name in certain places.

One idea I had was to automatically scrape it (or simply ask the user for all their personal details), save it locally, and then use maskTextFn to look for the data which we know to be personal.

louislva commented 1 year ago

Looked into it, you set a maskTextSelector (could probably be *), and then maskTextFn get triggered, which basically maps from old text to new text. So yes, we can do placeholders 🥳

louislva commented 1 year ago

Another important test case: profile picture anonymization! (in the top right of Github for example; pretty easy to recover someone's identity with a picture of their face)