ghuser-io / ghuser.io

:octocat: Better GitHub profiles
https://ghuser.io
MIT License
809 stars 47 forks source link

Count committers and authors differently? // formerly: Phantom commits created when editing on GitHub.com #181

Open ocdtrekkie opened 5 years ago

ocdtrekkie commented 5 years ago

Last week I created the https://github.com/ocdtrekkie/xrf_books repo. It has 13 commits, 10 submitted via the GitHub.com website.

It did add this repo to my ghuser.io profile, but bizarrely, specifies I made only 57% of the commits in the repo, which makes very little sense. Looking closely, ghuser.io seems to believe the repo has 23 commits (which it doesn't), and 13 of them (my actual 13 commits) are mine. So I posit that ghuser.io might be detecting some sort of additional phantom commit for each edit made directly in the GitHub.com web UI.

lourot commented 5 years ago

A known issue in which we count too many commits is when people "push force", because when crawling daily, if the last commit crawled yesterday has now been replaced, it's hard to know where we left off. I need to check if editing via GitHub's website is replacing/editing commits as well.

I'll have a closer look hopefully this week-end. Thanks for reporting!

JPBotelho commented 5 years ago

I have this problem too.

lourot commented 5 years ago

OK I see. Sometimes, commits have a different author A and committer B, and we count them as if both A and B made a commit. It leads to a higher overall amount of commits, but:

So I like this "feature" but what I didn't think about is that when you do some work over GitHub's website, the committer is web-flow ( https://github.com/web-flow ) and you are the author, and we count the double amount of commits and it looks like you did only half of the work. And this is where this feature becomes a problem.

I'm now preparing a special handling for that special user so that we don't get these phantom commits. It should be quite easy.

ocdtrekkie commented 5 years ago

IMHO, a commit should be attributed solely to it's author. A lot of hijinks can happen in the process of committing and merging code, and arguably the use of web-flow seems to indicate GitHub doesn't treat the committer identity as noteworthy from an authorship standpoint. I certainly don't think if someone cherry-picks my work they have done equal-weight work to my work in writing it.

ocdtrekkie commented 5 years ago

Also, is the committer clearly revealed anywhere in the GitHub UI? I think if the goal is to have a clearly understood data source and a fairly predictable metric calculation, a system which occasionally creates double the commits for 5-10% of the commits, and attributes those extra commits to someone who is not the author who does not appear in the GitHub UI, to be an inconsistent magic that is likely to confuse and confound.

lourot commented 5 years ago

TL;DR true (and this is very useful feedback, see more below) but that's too painful to improve right now and it can cause other regressions, sorry 🙁

Details:

Also, is the committer clearly revealed anywhere in the GitHub UI?

yes, they appear with two avatars, you can see many of them here: https://github.com/brandon-rhodes/uncommitted/commits/master

FYI, the reasons why we crawl commits are:

For that we have now crawled all the commits of 156000+ repositories and stored their amount per user per day, but we haven't stored whether author or committer. (Stupid, right? 😄 not entirely as we try to keep the DB small).

What you're saying makes sense and it's really useful to get this opinion (thanks!) but if we want now to keep only the authors, we need to re-crawl everything and with our current API rate limit it will take weeks. It will disturb the daily crawling (the API rate limit is our bottleneck) and we'll have to do some merging between the output of this long crawling and the daily one. I'd rather go through this painful process if there is at least another issue we're trying to solve at the same time or if this imprecision turns out to be very problematic. Also I'd need to think about this more, because I know other users who think that any contribution (documentation, user-support, marketing, design, code review, etc.) should be visible on ghuser. Depending on how you merge a PR, you can end up being committer and since you reviewed that PR, this is a contribution. If you cherry-pick a commit to an older branch (i.e. you do a backport of a feature) and you even need to solve a conflict, you will be committer and this is a contribution. Users having this in mind will consider the current mechanism as "better".

So here is what I'll do: I implement that special web-flow case I talked about for now and I keep this issue open for the more general issue of counting committers.

ocdtrekkie commented 5 years ago

@AurelienLourot That all sounds quite reasonable, and fixing for web-flow will definitely remove the most visible case of confoundment. :)

Do note that when I express my "IMHOs", a big capital letter on the H part. Not expecting you to reinvent the wheel based on one person's opinion.

lourot commented 5 years ago

The web-flow phantom commits are now gone :)

xrf_books - XRF Library Module
this repo has 29 commits
ocdtrekkie wrote 29 commits (100% of all 29 commits)

(https://github.com/ocdtrekkie/xrf_books has 30 commits right now but at the moment it got crawled the last commit wasn't pushed yet)

Keeping this issue open for the more general issue of counting committers.