igrigorik / gharchive.org

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
https://www.gharchive.org
MIT License
2.67k stars 207 forks source link

Missing Close Events on Some Issues #189

Open sehtia opened 6 years ago

sehtia commented 6 years ago

Hello! I'm attempting to understand why some issues show up with an 'open' state whereas they are closed on GitHub. For example, in the image below. Like this As far as I understand, it seems that when certain issues are closed by a PullRequest merge using the keywords in the body, the issue is closed on the GitHub side, but no 'close' IssueEvent is created so to GH Archive that issue remains in its open state. However, there does seem to be a close record for some issues that were closed by a PullRequest, as shown below, like this so that wouldn't entirely make sense. Would you be able to shed any light on this as I'm quite confused? Thank You!

igrigorik commented 6 years ago

Thanks for the detailed detective work!

Based on the fact that there are some issues that are being closed, I'm inclined to think that perhaps we may be missing/dropping some activities through our polling mechanism. That said.. @annafil are you aware of any gotchas with close events, or can you think of any other reasons?

annafil commented 6 years ago

@sehtia Thanks for the sleuthing you've done on this so far!

Looking at the API output for this particular issue: https://api.github.com/repos/GoogleContainerTools/jib/issues/268/events there seems to be a closed event that should have registered. That lends support to @igrigorik's theory that this might be due to dropping events because of polling issues.

Depending on what you're using the data for, and how much you need, there may be a workaround. Even if GHArchive is dropping some events from the /events API endpoint, the API actually provides you all the events for a given issue in a given repo if you fetch directly (like in the above example link). If you know the project, or set of projects you are looking for, you can manually go through the issues for a project via the API and grab all associated issue events (including some that are not sent by default via /events, like 'assigned' and 'labelled'). There is no limit on the history of those events, unlike the /events endpoint which only goes back 300 events per repo. If you need to do this for a large number of projects you'll probably run into rate limit issues, but you can use the built support mechanisms to help you throttle your requests to keep up with the rate limit.

sehtia commented 6 years ago

@igrigorik @annafil No problem and thanks for getting back to me promptly!

Ah yes, that makes a lot more sense. @annafil My goal was to get all the currently open issues for a specific set of repositories I was interested in for tracking/analysis purposes. I'm now attempting to solve this goal by mostly following your advice, specifically by making a connection to the specific repo API (api.github.com/repos/user/repoName/issues&status=open) and working from there. However, I'm open to suggestions if you recommend a different approach for my need and/or guides to throttle the rate limiting as it doesn't seem this will scale for a large number of repos.

Also, @igrigorik is there any documentation where I can read about the polling mechanism/dropped events used in GHArchive to further understand the issue (out of curiosity)?

annafil commented 6 years ago

@sehtia You can check out https://developer.github.com/v3/#rate-limiting for advice on working with REST API rate limits. If you can say a little more about what you're tracking/analyzing, and perhaps how many repos you estimate to poll, I can point you in a more specific direction :)

lucianoviola commented 3 years ago

@annafil @igrigorik I'm also experiencing issues with PRs that are missing events for "action=closed". For example:

In my investigation, I found this to be the case for 2509 PRs. Here are some examples:

But, unlike "issues", I can't get the past events for PRs.

This seems to be a bug that is happening recurrently and to this day. Are there any plans to resolve it? :)

bored-engineer commented 2 years ago

FYI #275 may at least explain why missed events haven't been identified/logged by the crawler, even if it doesn't actually solve the problem