Handle rate limiting - Githubissues

adunkman commented 3 years ago

I’m attempting to use gh2md to export 8,000+ issues in a GitHub Action, and as you can imagine, I’m hitting GitHub’s rate limits.

I don’t think there’s an easy fix here, but wanted to open an issue to discuss if anyone sees a potential path forward.

The only potential solution I see is to monitor API rate limiting and slow down requests, but I can imagine that would get pretty tricky.

mattduck commented 3 years ago

Hey @adunkman, I see a couple of things we can do here:

Right now we're iterating over the issues and making an additional request per issue to fetch the comments, which gets through the rate limit very quickly. It looks like it's possible to instead just fetch all the comments for the repository and then correlate them in the program (https://docs.github.com/en/rest/reference/issues#list-issue-comments-for-a-repository), which would reduce the number of requests significantly.
We could also add support for the since parameter - so that rather than doing a full sync every run, you can incrementally store the issues that have changed since the last run, or you can just grab issues that have changed in the last N days.
I can also look into the Github GraphQL API - curious if that provides a way to fetch the data more efficiently.
Those changes might be enough. If you have 8000 issues each with 10 comments and we're limited to 100 issues/comments per page, it'll be 80 requests for the issues and 800 requests for the comments. If the GITHUB_TOKEN rate limit is 1000/hour, then you might just be able to do it. And it should definitely be possible authenticating via Oauth (5000/hour).
Regardless, we could catch rate limit exceptions and wait in the program. It looks like Github provides a reset timestamp, so we could wait until then before executing the next request. This should probably be an opt-in feature so that the program doesn't unexpectedly take hours to run. If the number of issues is particularly large there could still be problems running via Github actions - super quick searching for the maximum job timeout suggests it can only run for 6 hours?

I won't be able to look into this further this coming week, but I have some time off work next week so I should be able to get a few hours to work on it. Let me know if you think these changes sound viable. I've been meaning to look into the one-request-per-issue problem for a while so will at least fix that.

adunkman commented 3 years ago

I know things are stressful in the world these days, and if you’re taking time off, I hope you can use it to relax and recharge. If that’s this project, great! Otherwise, it’s on anyone to write up a PR; not your responsibility. 😄

I ended up writing a quick app to handle this because the GraphQL endpoint was significantly less expensive — for my needs, each query returns 100 issues with all of their attached metadata (comments, authors, labels, etc), and consumed 2 request tokens (the GraphQL API uses a token calculation to enforce resource limits).

For GitHub Actions, the rate limit is 1,000 REST requests or 1,000 GraphQL tokens, so that meant I was well within the resource limits by switching to GraphQL — 2 tokens per GraphQL query handles repositories with up to 50,000 issues.

Here’s the GraphQL query I used:

query ($owner: String!, $repo: String!, $nextPageCursor: String) {
  rateLimit {
    limit
    cost
    remaining
    resetAt
  }
  repository(owner: $owner, name: $repo) {
    issues(first: 100, after: $nextPageCursor, orderBy: { field: CREATED_AT, direction: ASC }) {
      totalCount
      pageInfo {
        endCursor
        hasNextPage
      }
      nodes {
        number
        url
        title
        body
        closed
        closedAt
        createdAt
        author {
          login,
          url
        }
        labels(first: 100) {
          totalCount
          nodes {
            name
            url
          }
        }
        comments(first: 100) {
          totalCount
          nodes {
            body
            createdAt
            author {
              login,
              url
            }
          }
        }
      }
    }
  }
}

… and I called using TypeScript iteratively:

const { repository, rateLimit } = await octokit.graphql(ISSUE_BATCH_QUERY, {
  owner,
  repo,
  nextPageCursor,
}) as IssueQueryResponse;

hasNextPage = repository.issues.pageInfo.hasNextPage;
nextPageCursor = repository.issues.pageInfo.endCursor;

adunkman commented 3 years ago

Oh — and I’m a GraphQL newbie, so I don’t really know if I wrote that query "the right way" — if anyone has a better suggestion, I’m all ears!

mattduck commented 3 years ago

Hey @adunkman, thanks for that graphql example! I finally worked on this today and ported the API over - as you say it's way faster and significantly more efficient on rate limits than previously.

My approach was very similar to yours, except that I paginated both issues and PRs in the same query. And then at the end I looked for any issues/PRs that still had additional pages of comments to fetch, and retrieved those separately - which is the only approach I'm aware of for paginating with nested cursors.

There are definitely a few things we could do to make this better but this is a massive improvement and will make it usable for a lot more medium/large repos. Thanks for the report + thoughts.

Gonna close this but feel free to reopen

mattduck / gh2md

Handle rate limiting #21