De-identification is no good

martinthomson commented 1 year ago

I'm sure that this is likely to stimulate discussion, but I don't believe that de-identification is good enough.

It might be good for certain classes of information (access logs on a site, for instance) but the document presents it as a generally good practice. It isn't.

This view of de-identification is a dated notion and more reliant on trust than alternative approaches, like differential privacy (which also has limitations, but not quite of the same magnitude). The core problem is that de-identification techniques are generally inadequate if your goal is to defend against a motivated attacker. Information erasure that focuses on identifiers tends not to effectively obscure secondary identifiers that exist in datasets.

See https://arxiv.org/abs/2202.13470 for some basic discussion, but you can look to many examples of re-identification that are out there.

What is missing from this discussion is the general approach touched on briefly in Section 2.2.1 (on ancillary data), where ways are found to use data without being able to collect it.

(I will point out that "ancillary" is possibly a bad framing. It's accurate, but only from certain perspectives. It's a word that dodges making a judgment about whether a particular usage is moral or not.)

The last sentence of this section also touches on an important part of that concept, but fails to expand on it.

Note that controlled de-identified data, on its own, is not sufficient to make data processing appropriate.

Or, put differently, the use of the data might in itself comprise a privacy violation. Which is where we need to come back to trust, or - my preference - governance structures that allow for agreements to be reached on what is - or is not - moral usage.

npdoty commented 1 year ago

The current text goes into some detail about the caveats for de-identification, including referring to collective governance and the last sentence that notes that it doesn't necessarily make processing appropriate.

Could you expand on what would better address your concern?

Maybe we should cite out to research, or be explicit about how attacks on de-identification are possible? Or is there a different framing of the principle (itself a little vague and exhortative) about when it is useful to work with controlled de-identified data?

pes10k commented 1 year ago

I've created a PR to implement the fix we discussed on the call https://github.com/w3ctag/privacy-principles/pull/337

w3ctag / privacy-principles

De-identification is no good #285