igrigorik / gharchive.org

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
https://www.gharchive.org
MIT License
2.7k stars 207 forks source link

Broken email field #144

Closed notslang closed 8 years ago

notslang commented 8 years ago

As of fd53d3a80fd07289581541cc99446d2dce36c770, the email field is dropped entirely, and more recently it's obfuscated with SHA1. Since you've decided to preemptively block conversation on c9ae11426e5bcc30fe15617d009dfc602697ecde, I guess I'll reply here...

GitHub's activity API reports committer emails, which are logged in the archive and are trivial to extract and aggregate at scale via the provided tools (both for good, and sadly, less-than-good use cases) [...] If you know the email you're looking for, you can compute the hash and lookup commits by email.

What prevents someone from providing a "SHA -> email" table for every email on GitHub? Unless there's some implementation detail I'm missing, this sounds like a case of security through obscurity that is only going to make this dataset harder to use, while not preventing spam. Furthermore, email is a public identifier - it's written on every single commit and is designed to let people find and contact you. It's not a secret that's meant to be hidden.

After discussing this with the GitHub folks, the conclusion was that we should obfuscate emails (and their API is likely to do so as well; GH is simply ahead of the API).

Unless GitHub is willing to break git entirely, people will just ignore GitHub's API & read the email from the patch generated from every commit:

From fd53d3a80fd07289581541cc99446d2dce36c770 Mon Sep 17 00:00:00 2001
From: Ilya Grigorik <ilya@igvita.com>
Date: Thu, 26 May 2016 15:38:41 -0700
Subject: [PATCH] drop emails from activity events
igrigorik commented 8 years ago

What prevents someone from providing a "SHA -> email" table for every email on GitHub?

You're right, if someone is willing to compile and provide a "rainbow table" on the side, we can't stop them. However, that by itself is not an overriding argument against obfuscating said data.

I understand and sympathetic to all of your points. In fact, I've raised all the same questions and points in the past. However, this is a sensitive area with wildly different opinions (e.g. some of the current and past discussions on https://github.com/ghtorrent/ghtorrent.org) and we have to find a balance that is acceptable to all sides. After talking about these issues with the GitHub folks, we arrived at the current strategy -- you may disagree with it, I understand that.

igrigorik commented 8 years ago

Closing, feel free to reopen if needed.