snowplow / snowplow

The leader in Next-Generation Customer Data Infrastructure
http://snowplowanalytics.com
Apache License 2.0
6.85k stars 1.19k forks source link

Scala Common Enrich: support "pii" annotations in schemas for PII Enrichment #860

Closed alexanderdean closed 4 years ago

alexanderdean commented 10 years ago

PII = Personally Identifiable Information

The basic idea:

This would be of potential interest to users in healthcare or finance, where the ability for analysts to drill down to individual users could be a privacy concern

/cc @yalisassoon @fblundun

yalisassoon commented 10 years ago

This is an awesome idea. Would be fantatsic if we could push this functionality upstream into the individual trackers...

On Thu, Jun 26, 2014 at 9:38 AM, Alexander Dean notifications@github.com wrote:

PII = Personally Identifiable Information

The basic idea:

  • Any JSON Schema (ue or context) can be annotated with "pii": true on a per-property basis
  • If this PII Scrubber is turned on, then we encrypt any given PII field using AES - so you end up with a unique but non-PII value, e.g. "Fred Blundun" always -> "1de6e53cb23"

This would be of potential interest to users in healthcare or finance, where the ability for analysts to drill down to individual users could be a privacy concern

/cc @yalisassoon https://github.com/yalisassoon @fblundun https://github.com/fblundun

— Reply to this email directly or view it on GitHub https://github.com/snowplow/snowplow/issues/860.

Co-founder Snowplow Analytics http://snowplowanalytics.com/ The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom +44 (0)203 589 6116 +44 7841 954 117 @yalisassoon https://twitter.com/yalisassoon https://twitter.com/yalisassoon

fblundun commented 10 years ago

Is the idea that the schema would look something like:

{
 ...
  "type": "object",
  "properties": {
    "publicProperty": {
      "type": "string"
    },
    "privateProperty": {
      "type": "string",
      "pii": true
    }
  }
}
alexanderdean commented 10 years ago

What is publicProperty privateProperty - is it a JSON Schema thing?

alexanderdean commented 10 years ago

oh being dense. yes that's what I mean!

alexanderdean commented 10 years ago

can you add random annotations like pii in JSON S?

fblundun commented 10 years ago

Just tested it - it's allowed and doesn't change the results of any tests. I remember that the meta-schema for JSON Schema didn't forbid extra properties.

fblundun commented 10 years ago

An alternative format would be something like:

{
 ...
  "properties": {
    "public": {
      "type": "string"
    },
    "private": {
      "type": "string"
    }
  },
  "pii": ["private"]
}

Similar to how the required keyword works. I'm not sure which is better.

alexanderdean commented 10 years ago

Interesting alternative!

alexanderdean commented 10 years ago

Trouble with doing it in the trackers:

  1. Implementation x10
  2. Cannot safely have an AES key in the trackers, so would have to erase PII rather than transform it. Defeats the point of adding the fields into the JSON Schemas in the first place
  3. Would have to retrieve the JSON Schemas to know which fields to erase

So as an alternative, we could consider moving this enrichment forwards into the Kinesis sinks - i.e. both the raw S3 sink and the enrichment app would both apply this scrubber, so no PII ever touches disk.

alexanderdean commented 10 years ago

https://github.com/bcgit/bc-java/blob/master/prov/src/test/java/org/bouncycastle/jce/provider/test/AESTest.java

christoph-buente commented 8 years ago

I generally like the idea of getting rid of PII, because it might be a legal requirement. However, if it does not happen in the tracker (specifically the JS tracker), users could still see that PII is transferered to the collector. Using the clojure collector ensures the PII ends up in disk in the log files.

So i created https://github.com/snowplow/snowplow-javascript-tracker/issues/465 to allow to encrypt data before it is being send off. I think a key is not nessecarily needed to encrypt data like names or email addresses. MD5, SHA1, SHA256 are the industry standards and are widely used. Wouldn't it be sufficient if we could apply a scrubber/anonymizer as a callback function to the formTracking?

alexanderdean commented 8 years ago

I agree that PII scrubbing (or hashing or encryption) in the trackers is preferable, as anything we do in the enrichment process is basically "too late" from the raw logs perspective.

But I'll keep this open because it would be nice to have this available, especially for applying retrospective scrubbing in the case that something sensitive slipped through a tracker...

alexanderdean commented 7 years ago

Renamed so that it follows on from https://github.com/snowplow/snowplow/issues/3472

alexanderdean commented 6 years ago

One of the nice things about this idea is that the pii: true hint would be enough for Iglu when generating Redshift etc tables to make sure these columns are wide enough to take the hashed value.

It also just means that the work to identify that e.g. com.acme.email/send_email's email_recipient property is PII is just done in one place (at the time of schema authorship), rather than every user having to configure their own PII Enrichment.

chuwy commented 4 years ago

Migrated to https://github.com/snowplow/enrich/issues/212