w3c / dpv

Data Privacy Vocabularies and Controls CG (DPVCG)
https://w3id.org/dpv
Other
43 stars 26 forks source link

Re-Evaluate Anonymisation and Security Measure names for Correctness #15

Closed coolharsh55 closed 6 months ago

coolharsh55 commented 3 years ago

| Migrated ISSUE-33: The categorisation of Pseudoanonymisation and Encryption is not (semantically) correct

State: RAISED Raised by: Harshvardhan J. Pandit Opened on: 2019-11-26 Description: (from presentation to Kantara CISWG) Anonymisation is a subclass of Pseudoanonymisation which is conflicting in semantics as it specifies anonymisation is a type of pseudoanonymisation, which might not be intended. Also, Pseudoanonymisation and Encryption should not be grouping together (as a concept). Reporter: Harsh Notes: suggested to start a discussion on this issue.

mayaborges commented 2 years ago

I agree that Anonymisation should not be a subclass of Pseudoanonymisation, given that data cannot be both anonymised and pseudoanynomised. It could be argued that Anonymisation could be either Full (or True) Anonymisation or Psuedoanonymisation, in which case Pseudoanonymisation would be a subclass of Anonymisation, but that may introduce confusion between Anonymisation and Full Anonymisation and therefore be undesirable. So having Anonymisation and Pseudoanonymisation as parallels may be the best solution.

A possible name for a superclass for both types of anonymisation as well as encryption might be Data Obfuscation.

coolharsh55 commented 2 years ago

Hi Maya, thanks for the input, I agree with your arguments. I tried looking up EDPB and ISO definitions for these terms and how they are used, and it is similar to what you propose. But other uses (e.g. industry, technical) considers 'anonymisation' as a broad range of techniques which also includes pseudo-anonmisation.

Then there is further confusion as to what data is produced as an outcome of these processes. An anonmisation process may still produce personal data (non-anonymous) if its associated with an identifer. For example, consider the case where an identifier is associated with a exact location. The anonymisation technique replaces this with country. Now the data is anonymised through anonymisation process but is still personal data. So there is a distinction between anonimisation as a technical term and that as applied for GDPR.

To support your proposal, maybe we can have Anonymisation as the general class of anonymisation-related techniques, and specifically PseudoAnonymisation and CompleteAnonymisation as subclasses. Data Obfuscation involves other techniques in addition to anonymisation, so it can be the parent class of Anonymisation once those other concepts have been identified.

coolharsh55 commented 2 years ago

Recording conversation at PEPR'22 about Anonymisation, where Damien pointed out this problem. The potential operation is changing "Anonymisation" to "AnonymisationMeasure" and "CompleteAnonymisation" to "Anonymisation" so as to bring these concepts in line with what is defined legally and in standards (e.g. ISO 29100) while keeping the 'taxonomy' of anonymisation approaches in tech/org measures.

TedTed commented 2 years ago

Thanks Harshvardhan! To add a bit more explanation to this, I see a fairly serious risk with calling "Anonymization" the concept that corresponds to "The class of measures/processes that are used in order to make data less identifiable": we end up in a situation where people might use "Anonymization" on their data, and end up with data that is not "anonymized" according to ISO standards & EU regulation. This confusion happens frequently in the media, due to the use of the work "anonymization" to mean "de-identification" in the US. I've seen this create problems in my previous role in a big tech company, which is partly why we decided to only call something "anonymization" if it reached that high bar of making it impossible to re-identify people.

I strongly support changing "CompleteAnonymization" to simply "Anonymization", so that something is called "Anonymization" if and only if it leads to anonymized data, and the confusion disappears. Changing "Anonymization" to "AnonymizationMeasure" helps people understand that this might not be enough, so this definitely seems much better to me. It might not be enough, though. An alternative would be to call this "DeidentificationMeasure", and rename the process of removing identifiers something like "IdentifierRedaction" to avoid confusion. Yet another alternative, clearer but verbose, would be something lie "ReidentificationRiskMitigation", to better capture this idea of "measure towards making it harder to identify people".

coolharsh55 commented 2 years ago

Thanks @TedTed ; I have updated the title on this issue to (re-)evaluate all names in tech/org measures with this perspective, and make changes where necessary.

coolharsh55 commented 2 years ago

Hi All, thanks for the feedback. The structure is now as follows:

derhagen commented 1 year ago

I fail to see the added value of introducing Deidentification over DataAnonymisationTechnique, which are defined as

DataAnonymisationTechnique: Use of anonymisation techniques that reduce the identifiability in data
Deidentification: Removal of identity or information to reduce identifiability

By definition, any measure that reduces identifiability of data needs to "remove information", in some sense. Therefore, Deidentification does not narrow down the space of techniques, and should either be further specified or ommitted. Was Deidentification included with a reference to HIPAA? Even in that case, we should consider to replace Deidentification with the "Expert Determination" and "Safe Harbor" methods as mentioned here: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html

Otherwise, even though you renamed Anonymization to DataAnonymizationTechnique, I discovered this issue because I thought "Wait, Pseudonymization is not an Anonymization technique!". What about something along the lines of DataObfuscationTechnique?

derhagen commented 1 year ago

This discussion should probably be held in parallel with NonPersonalData and its subclasses, where some tidying up might be necessary. The Note of AnonymisedData refers to AnonymisedDataWithinScope, which does not seem to exist yet (ContextuallyAnonymisedData is a proposed term), and according to the ENISA source, SyntheticData "can be personal data, which are manipulated in a way to limit the potentials for individuals’ re-identification", which is not entirely aligned with DPV's definition.

derhagen commented 1 year ago

The GDPR (Recital 26) approach to anonymity is based on a rather risk-based "reasonable likeliness", based on

Hence, these factors should be represented more precisely in the respective Class descriptions. As all of this is an active area of research and (in my opinion) not conclusively addressed by courts, it might might make sense to mark these Classes as unstable or proposed, if that is possible?

coolharsh55 commented 1 year ago

Hi.

I fail to see the added value of introducing Deidentification over DataAnonymisationTechnique, which are defined as...

Deidentification is a specific category of anonymisation techniques that focus on reducing identifiability. Anonymisation is broader than identifier removals because it also relates to potential re-combinations with other datasets to create identifiability.

Was Deidentification included with a reference to HIPAA? Even in that case, we should consider to replace Deidentification with the "Expert Determination" and "Safe Harbor" methods as mentioned here: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html

Deidentification is a common term in this domain. E.g. there's even an ISO standard (20889:2018) about it https://www.iso.org/standard/69373.html. For HIPAA - the title explicitly states de-identification which is a strong argument to represent that concept. Further types of de-identification processes should be modelled as subclasses/sub-types of Deidentification, and not replace it. I would prefer the ISO terminology over HIPAA in this case as it is broader in scope and represents greater technical consensus in this case, with HIPAA concepts added later within the resulting hierarchy (if needed). Pseudonymisation is a declared as a DataAnonymisationTechnique (and not as a type of Anonymisation) for the sake of grouping anonymisation related concepts together under an umbrella term.

The Note of AnonymisedData refers to AnonymisedDataWithinScope, which does not seem to exist yet (ContextuallyAnonymisedData is a proposed term), and according to the ENISA source, SyntheticData "can be personal data, which are manipulated in a way to limit the potentials for individuals’ re-identification", which is not entirely aligned with DPV's definition.

AnonymisedDataWithinScope has been changed to ContextuallyAnonymisedData, the note has been updated. Where SnythethicData is also a personal data, the data should be declared also as a subclass/type of PersonalData. The note states it can be personal or non-personal. The description is taken from ENISA guide on Data Protection Engineering,https://www.enisa.europa.eu/publications/data-protection-engineering

The GDPR (Recital 26) approach to anonymity is based on a rather risk-based "reasonable likeliness", based on

  • the costs of and
  • the amount of time required for identification, taking into consideration
  • the available technology
  • at the time of the processing and
  • [future] technological developments

Hence, these factors should be represented more precisely in the respective Class descriptions. As all of this is an active area of research and (in my opinion) not conclusively addressed by courts, it might might make sense to mark these Classes as unstable or proposed, if that is possible?

I see the value in representing this as a concept, but an unsure as to how it should be associated with processing information. My guess is to provide as an organisational measure, similar to policies and assessments. So IdentifiabilityAssessment as an OrganisationalMeasure with the stated recital-26 concepts as descriptions. I do not think we should represent each of those factors individually as concepts and properties for only the scope of identifiability. Costs, time for technical processes, technology availability (e.g. TRL in SotA), and future predictions are far too broad and relevant for a lot of other concepts - so should be modelled with a greater scope (and careful consideration). I can add these as proposed concepts if you or someone else is willing to take on the task of investigating these.

coolharsh55 commented 1 year ago

We discussed in today's meeting and are okay with the current list. We're keeping this open in case there are further discussions. Other we will close this in the coming weeks as completed.

TedTed commented 1 year ago

For context, does the "current list" refer to this comment or to the state of the world prior to this issue?

coolharsh55 commented 1 year ago

Current list as in the concepts that are in DPV as of now, after the comments.

derhagen commented 1 year ago

Sorry for the late response, but I continue to raise the argument that Pseudonymization is not an anonymisation technique.

Thank you for your clarification of Deidentification, I think the fact that it refers to a term from an ISO standard should be mentioned in the Class description. Strictly following the Class descriptions as they are right now, Deidentification and DataAnonymisationTechnique describe the equivalend things, without the additional knowlege of the mentioned ISO standard.

With respect to the Recital 26 criteria for anonymised data, I didn't propose to add these as organizational measures - even though that's a good idea - but simply to add a reference to Recital 26 and the mentioned criteria to the Class description or note, as they define what anonymised data is in the first place.

coolharsh55 commented 1 year ago

Hi. Thanks for your comment, I understand your point, and the need to change this.

I continue to raise the argument that Pseudonymization is not an anonymisation technique.

Yes, strictly speaking this is correct, though the concept DataAnonymisationTechnique was intended to group related concepts together as noted by Irish Data Protection Commission in their Guidance on Anonymisation and Pseudonymisation pg.12. Still, as you state, it would be better to avoid this confusion. So based on tha rationale laid out in NIST NISTIR 8053 De-Identification of Personal Information, these concepts are organised as follow:

to add a reference to Recital 26 and the mentioned criteria to the Class description or note, as they define what anonymised data is in the first place

Instead of GDPR's recitals, the techniques have been linked to ISO 29100:2011 Security Techniques -- Privacy Framework definitions which are more broadly used.

coolharsh55 commented 6 months ago

Reviewed and closed based on implementation in https://w3id.org/dpv#vocab-TOM-technical which contains the described structure.