As @ccerv1 discovered, there are pockets of dates that are simply missing data (this applies to all events). On top of that, the commit data that we derive from PushEvents seem to be off for some projects when we look at the old style "Collector" collected data. We need to decide how we want to move forward with this information. I have some thoughts, but I'm trying to capture the scope of the problem in this issue.
Background information:
Commits don't appear as events.
PushEvents are batches of commits that are pushed by users of github.
PushEvents has two possible fields for a user which are author or actor
The author on has a name and an email
The actor is the github user login that pushed these commits to github
Issues with PushEvent
We only get up to 20 commits per PushEvent. If a committer pushed more than 20 commits then we will not get those in GH archive
PushEvents don't reflect the actual authoring date. The commits can be authored at some different time and then pushed to github at a later time. I'm also assuming that some fast-forwardable commits on a PR are also this way.
Due to the way that author's are reported on commits, we don't properly get the correct login information for the users who are making commits. We get the login for the user making the PushEvent which can sometimes be something like the github-merge-bot. This makes it impossible in some cases without some other out of band solution to relate the author of a commit inside.
Options
@ccerv1 and I discussed the options that we have for moving forward. I'd also like to try some things to ensure we've looked at all of the possible commit data in github.
Option 1:
Continue to use GH archive without additional tooling through the API and currently accept some data resolution loss.
Option 2:
Hybrid, Use GH archive and create a way to conditionally retrieve data that we may be missing
Option 3:
Continue to use the Collectors for Commit collection.
Things to explore
I think it could be possible that some of the useful commit data is lurking behind pull requests. We should see what it would look like to include all the commits from a merged pull-request and if that assists with at least the totals that we expect (it may cause things to go over the expected numbers)
What is it?
As @ccerv1 discovered, there are pockets of dates that are simply missing data (this applies to all events). On top of that, the commit data that we derive from
PushEvent
s seem to be off for some projects when we look at the old style "Collector" collected data. We need to decide how we want to move forward with this information. I have some thoughts, but I'm trying to capture the scope of the problem in this issue.Background information:
PushEvents
are batches of commits that are pushed by users of github.PushEvents
has two possible fields for a user which areauthor
oractor
author
on has aname
and anemail
actor
is the github user login that pushed these commits to githubIssues with
PushEvent
PushEvent
. If a committer pushed more than 20 commits then we will not get those in GH archivePushEvents
don't reflect the actual authoring date. The commits can be authored at some different time and then pushed to github at a later time. I'm also assuming that some fast-forwardable commits on a PR are also this way.PushEvent
which can sometimes be something like thegithub-merge-bot
. This makes it impossible in some cases without some other out of band solution to relate the author of a commit inside.Options
@ccerv1 and I discussed the options that we have for moving forward. I'd also like to try some things to ensure we've looked at all of the possible commit data in github.
Collector
s for Commit collection.Things to explore