Closed rimutaka closed 2 years ago
Using just the contributor commits should be a reliable back up option. It has minimal privacy implications for cross-contributor matching of reports to figure out who works with whom. Contributors A and B may be working on the same project, but their commit SHA1 list should have no overlap unless it is some 3-way merge or something unusual. Even then I'm not sure they would overlap.
On the other hand, a project from GitHub can be matched to a private contributor report using contributor commits because they are the subset of the full commit list we have access to in public projects.
Turns out just having the commit hash is not very useful for matching. If there is a conflict we'd need to look for other matches/mismatches in the project, which is expensive. Having the commit timestamp as a 2nd piece of ID will make it more unique. The timestamp can be also encoded as base58.
e29d17e6
e29d17e69100b9f0b68b41aa4deb9721d1723dec
They should be dropped. Contributions can be linked to public projects by commit hashes. Look for remote_url_hashes
and git_remote_url_regex
.
The current method of uniquely identifying projects is the list of hashes for its remote locations. That list is dynamic and can change any time. It can even alternate between commits on different machines. E.g. I may commit to GitHub and BitBucket from machine A and just to GitHub from machine B.
In most cases it should be still OK to use it as the project ID.
Backup Option 1: list of contributor commit IDs, last 100. Backup Option 2: list of project commit IDs, last 100