igrigorik / gharchive.org

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
https://www.gharchive.org
MIT License
2.67k stars 207 forks source link

Only about 1/7 of all Linux commits recorded in GHA #213

Open lukaszgryglicki opened 5 years ago

lukaszgryglicki commented 5 years ago

Hi, I have a question: Should all commits stored in git repository be present on the GitHub archives? I have a local clone of torvalds/linux repository. I'm counting all distinct commit's SHA's: git log --pretty=format:"%H" --since="2018-04-01" --until="2019-04-01" | uniq | wc -l. It gives 75141 commits for one year period 2018-04-01 - 2019-04-01. When analyzing all torvalds/linux commits stored in GitHub archives I can only get 10542 distinct SHAs. Maybe the problem is that Linux GitHub repository is only a mirror and most commits are not stored in GitHub API then?

lukaszgryglicki commented 5 years ago

I've saved all single events JSONs which refer to torvalds/linux into a big JSON file (it is about 600M in size). There are many SHAs that are not present in this JSON. Google BigQuery also don't return them - but it uses githubarchive data-source, so I would expect the rootcause is here. I know that there were problems with getting all events in the past (GitHub API limits) - is it possible that only 1/7 of Linux commits are recorded in GHA due to this?

igrigorik commented 5 years ago

Yes, I would expect some events to be missing, but no.. not 6/7th of them.

Can you share the query that you're running? Are you accounting for multiple commits in a single push?

lukaszgryglicki commented 5 years ago

This is the query to count distinct SHAs (I'm taking all commits from push events, otherwise Linux only has about 700 pushes/year (while 75k commits for the same range - this is essential):

#standardSQL
create temp function get_shas(json string)
returns array<string>
language js as """
  try {
    return JSON.parse(json).commits.map(x=>x.sha);
  }
  catch(error) {
    return []
  }
""";
with pushes as (
  select
    get_shas(payload) as shas
  from
    `githubarchive.{{table}}`
  where
    type = 'PushEvent'
    and repo.name = 'torvalds/linux'
)
select
  count(distinct sha) as commits
from (
  select
    sha
  from
    pushes
  cross join
    unnest(pushes.shas) as sha
  where
    sha is not null
)

-- Returns 9841 for a year range {[table}} -> year.2018.

And this returns list of distinct SHAs for a given table and condition (which can be for example filter by repo.name = torvalds/linux):

#standardSQL
create temp function get_shas(json string)
returns array<string>
language js as """
  try {
    return JSON.parse(json).commits.map(x=>x.sha);
  }
  catch(error) {
    return []
  }
""";
with pushes as (
  select
    get_shas(payload) as shas
  from
    `githubarchive.{{table}}`
  where
    type = 'PushEvent'
    and {{cond}}
)
select
  distinct sha
from
  pushes
cross join
  unnest(pushes.shas) as sha
order by
  sha

But I was also grepping all JSON files (all that refer to torvalds/linux) and many many SHAs were not found in full JSON body.

igrigorik commented 5 years ago

Hmm, this is an interesting puzzle.

I don't see any obvious issues with the query. Curious, have you tried capturing the error instead of just returning an empty array? Wondering if we're swallowing any errors there that we ought to pay attention to.

Re, grepping JSON: do you see a difference in counts between JSON archives and BigQuery?

lukaszgryglicki commented 5 years ago

The error returning [] was for event types other than 'PushEvent' (so there was no commits field). The BQ number of commits and number of commits from DevStats (which parses all GHA JSON files) are "almost" equal (the difference is minute, less than 0.5% and that was probably due to slightly different date ranges I've checked recently). If needed I can check for the exact data later (it takes more than 1TB to process) but really DevStats returned about 10250 for a 4/2018-4/2019 and BQ returned 9840 for the year 2018 AFAIk. git returns 70-80k so the difference is huge.

igrigorik commented 5 years ago

Gotcha, thanks for the context.

Have you tried looking at some of the missing commits: are there any common patterns or distinctions for those vs. what's in the archives? Are there time gaps, or maybe other distinctions?

lukaszgryglicki commented 5 years ago

No I didn't make such an analysis, sorry.

chillu commented 4 years ago

The push event docs state:

The array includes a maximum of 20 commits

If devs on this repo have a habit of pushing up larger amounts of commits at the same time, this might explain the discrepancy. You could do some spot checks through the commits API to compare push event payloads and actual commit counts.

xloem commented 1 year ago

The precise date and time and hash of any missing commit would likely help diagnose this precisely.