Closed alexanderdean closed 4 years ago
This is an awesome idea. Would be fantatsic if we could push this functionality upstream into the individual trackers...
On Thu, Jun 26, 2014 at 9:38 AM, Alexander Dean notifications@github.com wrote:
PII = Personally Identifiable Information
The basic idea:
- Any JSON Schema (ue or context) can be annotated with "pii": true on a per-property basis
- If this PII Scrubber is turned on, then we encrypt any given PII field using AES - so you end up with a unique but non-PII value, e.g. "Fred Blundun" always -> "1de6e53cb23"
This would be of potential interest to users in healthcare or finance, where the ability for analysts to drill down to individual users could be a privacy concern
/cc @yalisassoon https://github.com/yalisassoon @fblundun https://github.com/fblundun
— Reply to this email directly or view it on GitHub https://github.com/snowplow/snowplow/issues/860.
Co-founder Snowplow Analytics http://snowplowanalytics.com/ The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom +44 (0)203 589 6116 +44 7841 954 117 @yalisassoon https://twitter.com/yalisassoon https://twitter.com/yalisassoon
Is the idea that the schema would look something like:
{
...
"type": "object",
"properties": {
"publicProperty": {
"type": "string"
},
"privateProperty": {
"type": "string",
"pii": true
}
}
}
What is publicProperty privateProperty - is it a JSON Schema thing?
oh being dense. yes that's what I mean!
can you add random annotations like pii in JSON S?
Just tested it - it's allowed and doesn't change the results of any tests. I remember that the meta-schema for JSON Schema didn't forbid extra properties.
An alternative format would be something like:
{
...
"properties": {
"public": {
"type": "string"
},
"private": {
"type": "string"
}
},
"pii": ["private"]
}
Similar to how the required
keyword works. I'm not sure which is better.
Interesting alternative!
Trouble with doing it in the trackers:
So as an alternative, we could consider moving this enrichment forwards into the Kinesis sinks - i.e. both the raw S3 sink and the enrichment app would both apply this scrubber, so no PII ever touches disk.
I generally like the idea of getting rid of PII, because it might be a legal requirement. However, if it does not happen in the tracker (specifically the JS tracker), users could still see that PII is transferered to the collector. Using the clojure collector ensures the PII ends up in disk in the log files.
So i created https://github.com/snowplow/snowplow-javascript-tracker/issues/465 to allow to encrypt data before it is being send off. I think a key is not nessecarily needed to encrypt data like names or email addresses. MD5, SHA1, SHA256 are the industry standards and are widely used. Wouldn't it be sufficient if we could apply a scrubber/anonymizer as a callback function to the formTracking?
I agree that PII scrubbing (or hashing or encryption) in the trackers is preferable, as anything we do in the enrichment process is basically "too late" from the raw logs perspective.
But I'll keep this open because it would be nice to have this available, especially for applying retrospective scrubbing in the case that something sensitive slipped through a tracker...
Renamed so that it follows on from https://github.com/snowplow/snowplow/issues/3472
One of the nice things about this idea is that the pii: true hint would be enough for Iglu when generating Redshift etc tables to make sure these columns are wide enough to take the hashed value.
It also just means that the work to identify that e.g. com.acme.email/send_email's email_recipient property is PII is just done in one place (at the time of schema authorship), rather than every user having to configure their own PII Enrichment.
Migrated to https://github.com/snowplow/enrich/issues/212
PII = Personally Identifiable Information
The basic idea:
"pii": true
on a per-property basisThis would be of potential interest to users in healthcare or finance, where the ability for analysts to drill down to individual users could be a privacy concern
/cc @yalisassoon @fblundun