theodi / open-data-certificate

The mark of quality and trust for open data
https://certificates.theodi.org/
MIT License
46 stars 39 forks source link

change 'aggregated' to 'anonymised' in question about people being identifiable #388

Closed JeniT closed 11 years ago

JeniT commented 11 years ago

based on feedback during training delivered by @statshero

JeniT commented 11 years ago

@statshero I don't think this is right: the option that talks about 'aggregated' is talking about summarising the data, with appropriate statistical disclosure controls, which is subtly different from anonymisation, which is more about removing personal details from individual-level data. I've made an attempt to add some clarification around the questions.

statzhero commented 11 years ago

@JeniT, anonymisation includes techniques such as aggregation. I believe the focus here is on "can people be identified" and not the nature of the data processing. Indeed, the following question is

Has your anonymisation process been independently audited?

Aggregation does not include methods such as surpression, sampling or perturbation. Also with aggregated data there is a risk that people can be identified.

JeniT commented 11 years ago

@statshero If you look at the change that I made (https://github.com/theodi/open-data-certificate/commit/d1aff1b214e30a9f1c319b211611ce8b424af5fd), you'll see that I changed the following question to Have your statistical disclosure controls been independently audited?. Does this work? If not, can you suggest an alternative? Does there need to be a distinction between the answers:

or do you think the two answers should be combined and every dataset that is about people or their activities require a PIA for Pilot level? The discussions we had previously indicated that wasn't required.

statzhero commented 11 years ago

@JeniT Perhaps it is a discussion whether you want to distinguish between statistical disclosure control (SDC) and anonymisation. However, I propose not to do so because they are achieving the same. See also the (shortened) ICO definitions of

Disclosure Control: A technique used to control the risk of individuals being identified from statistical data Anonymisation: The process of rendering data into a form which does not identify individuals

The difference between the two answers, in my understanding after the privacy workshop, is qualitative. We assume that the person filling out the questionnaire is not an expert on anonymisation. "no" gives the user the option to be confident in their anonymisation process (e.g. through aggregation). "yes" exists if

I would not suggest a PIA for pilot level. Keeping the three options with the change in the wording ("anonymised" instead of "aggregated") and more clarification should achieve this.

JeniT commented 11 years ago

My concern is that if people have an option that is no, the data has been anonymised so individuals can't be identified then they will think that they can select this option if they have attempted anonymisation, whether it's any good or not (eg if they've just removed names & addresses). The point is that aggregated/summarised data has less risk of disclosure than individual-level data.

statzhero commented 11 years ago

@JeniT you raise a valid concern. Thus, I suggest the following:

(I know you aware that less is not zero. A conservative expert would have to choose "yes" if the data is derived from individuals because virtually all aggregation carries a non-zero risk of re-identification.)

JeniT commented 11 years ago

Done, thanks.