stackmuncher / stm_app

This software engineer profile builder turns your code into a detailed list of skills for an online directory of software developers.

https://stackmuncher.com

GNU Affero General Public License v3.0

22 stars 1 forks source link

Add a list of all contributor commits to contributor report #17

Closed rimutaka closed 2 years ago

rimutaka commented 2 years ago

The current method of uniquely identifying projects is the list of hashes for its remote locations. That list is dynamic and can change any time. It can even alternate between commits on different machines. E.g. I may commit to GitHub and BitBucket from machine A and just to GitHub from machine B.

In most cases it should be still OK to use it as the project ID.

Backup Option 1: list of contributor commit IDs, last 100. Backup Option 2: list of project commit IDs, last 100

[x] Add contributor commits
[x] Add timestamps to every commit
[x] Drop remote URL hashes

rimutaka commented 2 years ago

Privacy impications

Using just the contributor commits should be a reliable back up option. It has minimal privacy implications for cross-contributor matching of reports to figure out who works with whom. Contributors A and B may be working on the same project, but their commit SHA1 list should have no overlap unless it is some 3-way merge or something unusual. Even then I'm not sure they would overlap.

On the other hand, a project from GitHub can be matched to a private contributor report using contributor commits because they are the subset of the full commit list we have access to in public projects.

rimutaka commented 2 years ago

Turns out just having the commit hash is not very useful for matching. If there is a conflict we'd need to look for other matches/mismatches in the project, which is expensive. Having the commit timestamp as a 2nd piece of ID will make it more unique. The timestamp can be also encoded as base58.

Example:

8char SHA-1: e29d17e6
full SHA-1: ... e29d17e69100b9f0b68b41aa4deb9721d1723dec
timestamp int: 1627380297
timestamp base58: 3UokTz
combined A: e29d17e6_3UokTz
combined B: e29d17e6_1627380297 <<< looks better

rimutaka commented 2 years ago

Remote URL Hashes should be dropped

They allow tracking of private projects between different members.
They are redundant if we have contributor commits.

They should be dropped. Contributions can be linked to public projects by commit hashes. Look for remote_url_hashes and git_remote_url_regex.