mattduck / gh2md

Export Github repository issues, pull requests and comments to markdown.
MIT License
254 stars 45 forks source link

Handle rate limiting #21

Closed adunkman closed 3 years ago

adunkman commented 3 years ago

I’m attempting to use gh2md to export 8,000+ issues in a GitHub Action, and as you can imagine, I’m hitting GitHub’s rate limits.

I don’t think there’s an easy fix here, but wanted to open an issue to discuss if anyone sees a potential path forward.

The only potential solution I see is to monitor API rate limiting and slow down requests, but I can imagine that would get pretty tricky.

mattduck commented 3 years ago

Hey @adunkman, I see a couple of things we can do here:

I won't be able to look into this further this coming week, but I have some time off work next week so I should be able to get a few hours to work on it. Let me know if you think these changes sound viable. I've been meaning to look into the one-request-per-issue problem for a while so will at least fix that.

adunkman commented 3 years ago

I know things are stressful in the world these days, and if you’re taking time off, I hope you can use it to relax and recharge. If that’s this project, great! Otherwise, it’s on anyone to write up a PR; not your responsibility. 😄

I ended up writing a quick app to handle this because the GraphQL endpoint was significantly less expensive — for my needs, each query returns 100 issues with all of their attached metadata (comments, authors, labels, etc), and consumed 2 request tokens (the GraphQL API uses a token calculation to enforce resource limits).

For GitHub Actions, the rate limit is 1,000 REST requests or 1,000 GraphQL tokens, so that meant I was well within the resource limits by switching to GraphQL — 2 tokens per GraphQL query handles repositories with up to 50,000 issues.

Here’s the GraphQL query I used:

query ($owner: String!, $repo: String!, $nextPageCursor: String) {
  rateLimit {
    limit
    cost
    remaining
    resetAt
  }
  repository(owner: $owner, name: $repo) {
    issues(first: 100, after: $nextPageCursor, orderBy: { field: CREATED_AT, direction: ASC }) {
      totalCount
      pageInfo {
        endCursor
        hasNextPage
      }
      nodes {
        number
        url
        title
        body
        closed
        closedAt
        createdAt
        author {
          login,
          url
        }
        labels(first: 100) {
          totalCount
          nodes {
            name
            url
          }
        }
        comments(first: 100) {
          totalCount
          nodes {
            body
            createdAt
            author {
              login,
              url
            }
          }
        }
      }
    }
  }
}

… and I called using TypeScript iteratively:

const { repository, rateLimit } = await octokit.graphql(ISSUE_BATCH_QUERY, {
  owner,
  repo,
  nextPageCursor,
}) as IssueQueryResponse;

hasNextPage = repository.issues.pageInfo.hasNextPage;
nextPageCursor = repository.issues.pageInfo.endCursor;
adunkman commented 3 years ago

Oh — and I’m a GraphQL newbie, so I don’t really know if I wrote that query "the right way" — if anyone has a better suggestion, I’m all ears!

mattduck commented 3 years ago

Hey @adunkman, thanks for that graphql example! I finally worked on this today and ported the API over - as you say it's way faster and significantly more efficient on rate limits than previously.

My approach was very similar to yours, except that I paginated both issues and PRs in the same query. And then at the end I looked for any issues/PRs that still had additional pages of comments to fetch, and retrieved those separately - which is the only approach I'm aware of for paginating with nested cursors.

There are definitely a few things we could do to make this better but this is a massive improvement and will make it usable for a lot more medium/large repos. Thanks for the report + thoughts.

Gonna close this but feel free to reopen