hub4j / github-api

Java API for GitHub
https://github-api.kohsuke.org/
MIT License
1.12k stars 715 forks source link

Github Repository Search resulting in Secondary Rate Limit #1842

Open vikas-maersk opened 2 months ago

vikas-maersk commented 2 months ago

Describe the bug Getting the below error when searching for repositories with a keyword that returns a lot of repositories in the result. My usecase can be fulfilled by just getting say top 50 results, but I do not see any way to do that on the server side but only post getting the results.

org.kohsuke.github.HttpException: {
  "documentation_url": "https://docs.github.com/free-pro-team@latest/rest/overview/rate-limits-for-the-rest-api#about-secondary-rate-limits",
  "message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again. If you reach out to GitHub Support for help, please include the request ID ...."
}

I was trying the search for repositories within my org and keywords that matched even around 850 repositories were hitting this rate limit. The github documentation quotes a much larger number.

I even tried the in:name and in:displayName within the query but even that did not seem to help.

To Reproduce Steps to reproduce the behavior:

new GitHubBuilder().withAppInstallationToken(token).build()
                .searchRepositories()
                .org(orgName)
                .q(repoName)
                .list();

Just pass in a repo name like "test" or something that results in a lot of repository results in the above code.

Expected Behaviour It should return the repositories or else provide a clear reason of why the error. Or else there should be a way to limit the search results on the GitHub side itself.

Additional Info Version: org.kohsuke:github-api:1.316

vikas-maersk commented 2 months ago

I was able to achieve this using:

new GitHubBuilder().withAppInstallationToken(token).build()
                .searchRepositories()
                .org(orgName)
                .q(query)
                .list()
                .withPageSize(appConfig.getGithubRepositorySearchLimit());

And then instead of calling the toList on the PagedSearchIterable, I iterated the result using nextPage, like this:

repositoriesByNameQuery.iterator().nextPage();
bitwiseman commented 1 month ago

Reopening as this is an interesting case for Secondary rate limits.

holly-cummins commented 2 weeks ago

See also https://github.com/hub4j/github-api/issues/1805

bitwiseman commented 2 weeks ago

This docs page describes secondary rate limit behavior: https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#about-secondary-rate-limits

As of this reading it says:

You may encounter a secondary rate limit if you:

  • Make too many concurrent requests. No more than 100 concurrent requests are allowed. This limit is shared across the REST API and GraphQL API.
  • Make too many requests to a single endpoint per minute. No more than 900 points per minute are allowed for REST API endpoints, and no more than 2,000 points per minute are allowed for the GraphQL API endpoint. For more information about points, see "Calculating points for the secondary rate limit."
  • Make too many requests per minute. No more than 90 seconds of CPU time per 60 seconds of real time is allowed. No more than 60 seconds of this CPU time may be for the GraphQL API. You can roughly estimate the CPU time by measuring the total response time for your API requests.
  • Create too much content on GitHub in a short amount of time. In general, no more than 80 content-generating requests per minute and no more than 500 content-generating requests per hour are allowed. Some endpoints have lower content creation limits. Content creation limits include actions taken on the GitHub web interface as well as via the REST API and GraphQL API.

These secondary rate limits are subject to change without notice. You may also encounter a secondary rate limit for undisclosed reasons.

These are incredibly loosely defined guides and you cannot query for them ahead of time. 👎 It looks like we need to take the path some users have suggested and make rate limiting much more resilient, potentially allowing users to write their own rate limit strategies for handling secondary rate limits.

holly-cummins commented 2 weeks ago

From my perspective, I'd be happy with a default strategy which just detects the error and sleeps. In the worst case, detection could be based on pattern matching to the "message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again.." message in the response. For the sleep, a minute's sleep, perhaps with an exponential backoff, seems reasonable.

That's more or less what I'll have to implement in the client, but it would be convenient if it were implemented centrally. A customisable strategy might be a nice-to-have on top of that, but I think "go slow, but make the request succeed" is probably what most clients would want. Or maybe that's an assumption I'm making?

bitwiseman commented 2 weeks ago

@holly-cummins The waiting should work as of the latest release v1.322 via the fix to AbuseLimit. Please give it a try.

Unfortunately, GitHub has previously made it clear that it is the responsibility of clients toby well-behaved. Triggering the secondary rate limit too often may eventually get accounts or apps flagged for "abuse", resulting in even longer waits.

The problem is "good behavior" is poorly defined and subject to change without notice. "Go slow" sure, But how slow? They have a metric, but they expect the client to track it, rather than reporting it like they do with the primary rate limit.

holly-cummins commented 2 weeks ago

@holly-cummins The waiting should work as of the latest release v1.322 via the fix to AbuseLimit. Please give it a try.

I was really excited to try 1.322, but am still seeing fairly regular failures because of the rate limit even with 1.322.

I have some node.js code that uses the GitHub API, and I did manage to work around the secondary errors with promise-retry and an exponential backoff. It doesn't translate directly, because it's javascript, but it gives me confidence in the principle. I agree that exponential backoff is pretty crude compared to the clear 'try after this time' guidance that you get when the main rate limiter is triggered, and it does carry the risk of waiting too long, or not waiting long enough, and having to wait even longer.