gruntwork-io / git-xargs

git-xargs is a command-line tool (CLI) for making updates across multiple Github repositories with a single command.
https://blog.gruntwork.io/introducing-git-xargs-an-open-source-tool-to-update-multiple-github-repos-753f9f3675ec
Apache License 2.0
933 stars 63 forks source link

Implement rate-limit-aware http transport #62

Closed zackproser closed 2 years ago

zackproser commented 2 years ago

Description

In the excellent bug report #59, @mkowaliszyn-coursera points out that git-xargs should respect the Github API's rate-limit headers, and I agree. This approach was very heavily "inspired" by how terraform-provider-github handles the exact same issue.

These changes implement a rate-limit-aware HTTP transport that:

  1. Sleeps 1 second in between API calls made on behalf of the same client, thereby following the guidelines specified in Github's best practices for integrators doc
  2. Inspects Github API responses' Retry-After header when rate-limited, sleeping for the number of seconds specified by the same

These changes also configure git-xarg's Github API client to use the rate-limit-aware http transport.

Fixes #59

Documentation

TODOs

Please ensure all of these TODOs are completed before asking for a review.

Related Issues

jphuynh commented 2 years ago

Looking forward to seeing that merged 😄

pacoard commented 2 years ago

Hi! Really looking forward to this feature.

I'm running this PR's build with a simple script (a few sed commands) against ~30 repos and opening Draft PRs on each one, still bumping into the secondary rate limits:

...
[git-xargs] DEBU[2021-12-28T14:48:58-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:48:58-08:00] Successfully opened pull request              Pull Request URL="https://github.com/REDACTED_ORG/REDACTED1/pull/41"
[git-xargs] INFO[2021-12-28T14:48:58-08:00] Repository successfully processed             Repo name=REDACTED1
[git-xargs] DEBU[2021-12-28T14:49:00-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:49:00-08:00] Successfully opened pull request              Pull Request URL="https://github.com/REDACTED_ORG/REDACTED2/pull/48"
[git-xargs] INFO[2021-12-28T14:49:00-08:00] Repository successfully processed             Repo name=REDACTED2
[git-xargs] DEBU[2021-12-28T14:49:01-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:49:01-08:00] Error opening pull request                    Base=main Body="Update deprecated namespace circleci/ to cimg/" Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED3/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Head=refs/heads/update-circleci-namespace-3
[git-xargs] DEBU[2021-12-28T14:49:01-08:00] Error encountered while processing repo       Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED3/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Repo name=REDACTED3
[git-xargs] DEBU[2021-12-28T14:49:03-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:49:03-08:00] Error opening pull request                    Base=main Body="Update deprecated namespace circleci/ to cimg/" Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED4/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Head=refs/heads/update-circleci-namespace-3
[git-xargs] DEBU[2021-12-28T14:49:03-08:00] Error encountered while processing repo       Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED4/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Repo name=REDACTED4
[git-xargs] DEBU[2021-12-28T14:49:04-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:49:04-08:00] Error opening pull request                    Base=main Body="Update deprecated namespace circleci/ to cimg/" Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED5/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Head=refs/heads/update-circleci-namespace-3
...

I wonder if that rountrip sleeping delay needs to be adjusted? 🤔 . In any case, thank you so much for your work @zackproser !

zackproser commented 2 years ago

Hi! Really looking forward to this feature.

I'm running this PR's build with a simple script (a few sed commands) against ~30 repos and opening Draft PRs on each one, still bumping into the secondary rate limits:

...
[git-xargs] DEBU[2021-12-28T14:48:58-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:48:58-08:00] Successfully opened pull request              Pull Request URL="https://github.com/REDACTED_ORG/REDACTED1/pull/41"
[git-xargs] INFO[2021-12-28T14:48:58-08:00] Repository successfully processed             Repo name=REDACTED1
[git-xargs] DEBU[2021-12-28T14:49:00-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:49:00-08:00] Successfully opened pull request              Pull Request URL="https://github.com/REDACTED_ORG/REDACTED2/pull/48"
[git-xargs] INFO[2021-12-28T14:49:00-08:00] Repository successfully processed             Repo name=REDACTED2
[git-xargs] DEBU[2021-12-28T14:49:01-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:49:01-08:00] Error opening pull request                    Base=main Body="Update deprecated namespace circleci/ to cimg/" Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED3/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Head=refs/heads/update-circleci-namespace-3
[git-xargs] DEBU[2021-12-28T14:49:01-08:00] Error encountered while processing repo       Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED3/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Repo name=REDACTED3
[git-xargs] DEBU[2021-12-28T14:49:03-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:49:03-08:00] Error opening pull request                    Base=main Body="Update deprecated namespace circleci/ to cimg/" Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED4/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Head=refs/heads/update-circleci-namespace-3
[git-xargs] DEBU[2021-12-28T14:49:03-08:00] Error encountered while processing repo       Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED4/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Repo name=REDACTED4
[git-xargs] DEBU[2021-12-28T14:49:04-08:00] git-xargs HTTP RoundTrip sleeping between write operations  Write delay=1s
[git-xargs] DEBU[2021-12-28T14:49:04-08:00] Error opening pull request                    Base=main Body="Update deprecated namespace circleci/ to cimg/" Error="POST https://api.github.com/repos/REDACTED_ORG/REDACTED5/pulls: 403 You have exceeded a secondary rate limit and have been temporarily blocked from content creation. Please retry your request again later. []" Head=refs/heads/update-circleci-namespace-3
...

I wonder if that rountrip sleeping delay needs to be adjusted? thinking . In any case, thank you so much for your work @zackproser !

@pacoard Thanks so much for your helpful feedback! Seems there's some opportunity for improvement here - perhaps the delay should be configurable at the very least. I'll try to find some time this week to experiment a bit more.

mkowaliszyn-coursera commented 2 years ago

Thanks for testing @pacoard. I'll need to create 100+ PRs early this week.

@zackproser is the wait between ops a fixed value or is it taken from the GitHub response headers? The log seems to indicate the former. The latter is preferred. A configurable wait time could also work, though the user would need to be aware of what that delay should be, and it may change. If the headers don't contain the required wait time, perhaps a step-back retry strategy could be viable.

zackproser commented 2 years ago

Thanks for testing @pacoard. I'll need to create 100+ PRs early this week.

@zackproser is the wait between ops a fixed value or is it taken from the GitHub response headers? The log seems to indicate the former. The latter is preferred. A configurable wait time could also work, though the user would need to be aware of what that delay should be, and it may change. If the headers don't contain the required wait time, perhaps a step-back retry strategy could be viable.

After spending some more time with this, there were a couple of issues at play:

  1. git-xargs vendors github.com/google/go-github/v32. The logic used by that version of the CheckResponse function is looking for a Github API response message that has since been updated. This was one of the issues preventing the dynamic Retry-After values from being used. I've now updated to go-github v41 locally, and now the string being used to determine if a response contains a rate limiting error is updated. Once this is merged in, if git-xargs receives a). An abuse rate limiting error or b). A regular rate limit error, as determined by go-github, then git-xargs will use the Retry-After value returned by Github or the rate limit object, respectively.
  2. Unfortunately, this actually doesn't help much in our case, because we're tripping the secondary rate limits (since we're creating lots of pull requests in rapid succession) - and in this case Github intentionally omits the Retry-After header and instead asks that we wait a reasonable amount of time in between requests:

Requests that create content which triggers notifications, such as issues, comments and pull requests, may be further limited and will not include a Retry-After header in the response. Please create this content at a reasonable pace to avoid further limiting.

I now have the detection of secondary rate limiting working locally - and I'm experimenting with semi-random sleep durations (since we won't have the Retry-After header to key off of).

Needs a bit more testing and polish, but I hope to push this up soon.

zackproser commented 2 years ago

I've increased the default write delay to 2 seconds and fixed a crash I was aware of. In local testing against 30+ test repos of various configurations, the secondary rate limits are now being recognized and a random sleep delay is being added when they are tripped. The 2 second delay also seems to be reducing the frequency of the secondary rate limits being tripped.

The latest is pushed up - so if anyone has additional bandwidth to assist with testing this out - that would be most appreciated!

mkowaliszyn-coursera commented 2 years ago

I haven't been able to try the new build, mainly because of time constraints here. Though, I seem to have discovered another layer of GitHub's secondary limits. Seems that you can only create only so many PRs in 1 hour and the number may be 69, or close to that, but I'm not 100% sure. The only solution there is to wait 1 hour for it to reset. I'll be breaking up my input files into 60 repos per. This is a bit tedious, but I understand why GitHub has the limits. Just an FYI.

zackproser commented 2 years ago

I haven't been able to try the new build, mainly because of time constraints here. Though, I seem to have discovered another layer of GitHub's secondary limits. Seems that you can only create only so many PRs in 1 hour and the number may be 69, or close to that, but I'm not 100% sure. The only solution there is to wait 1 hour for it to reset. I'll be breaking up my input files into 60 repos per. This is a bit tedious, but I understand why GitHub has the limits. Just an FYI.

Understood, and thanks @mkowaliszyn-coursera for the helpful additional context!

mkowaliszyn-coursera commented 2 years ago

@zackproser is there a way in git-xargs to show the returning headers from GitHub requests for debugging purposes? Wondering for the next time I need to deal with this.

zackproser commented 2 years ago

@zackproser is there a way in git-xargs to show the returning headers from GitHub requests for debugging purposes? Wondering for the next time I need to deal with this.

@mkowaliszyn-coursera Not yet but that's a good idea - I filed https://github.com/gruntwork-io/git-xargs/issues/65 to track it

zackproser commented 2 years ago

@zackproser is there a way in git-xargs to show the returning headers from GitHub requests for debugging purposes? Wondering for the next time I need to deal with this.

@mkowaliszyn-coursera Not yet but that's a good idea - I filed #65 to track it

@mkowaliszyn-coursera In the meantime you can fmt.Printf("%#v\n", ghErr) after this line: https://github.com/gruntwork-io/git-xargs/blob/feature/rate-limiting-aware-backoff-59/ratelimiting/transport.go#L87 if you are building from source.

zackproser commented 2 years ago

@pacoard Do you have any additional free time to test out the latest changes on this branch? Thanks either way!

rhoboat commented 2 years ago

I've got this queued up to review this week!

rhoboat commented 2 years ago

I finally got around to testing this. I've attached the output for running the particular commands. I did three runs but failed to save the output from the first run. Not all repos got the PRs opened according to the output, but maybe that's due to rate limits?

  1. https://gist.github.com/rhoboat/c3218e5250d1f6e31f1d6716495b972e
  2. https://gist.github.com/rhoboat/64fcfeffe332de185687b369981c555f

I'm going to test again on the trunk (without the PR) to see the failing behavior! I pulled the trunk, rebuilt the git-xargs binary and ran again, and the output looks similar, so I'm guessing I didn't hit rate limits?

  1. https://gist.github.com/rhoboat/985a04e46403209e817aae8f1652df85
zackproser commented 2 years ago

Having spent a good bit more time with this, here are my current thoughts:

  1. The rate limits are being tripped because we're currently doing exactly what GitHub's API docs ask us not to: call APIs concurrently on behalf of a single user or client ID.
  2. In local testing, refactoring out the goroutines for serial processing slows things down, but not to such a degree that the tool is unusable. In fact, doing the processing serially is so far much more stable, and the same test data sets that were reliably tripping rate limits previously now run to completion without issue. I've pushed these changes up now for anyone who wants to experiment.
  3. I still think we want some semi-intelligent detection of rate limit status and exponential backoff (the retry-after headers are not reliably returned by GitHub when you're actively being rate limited). I'm mulling over implementation options for this.
  4. My previous forays into addressing this with a rate limit aware http transport were very informative - but I'm thinking that is not the way forward now.
zackproser commented 2 years ago

I finally got around to testing this. I've attached the output for running the particular commands. I did three runs but failed to save the output from the first run. Not all repos got the PRs opened according to the output, but maybe that's due to rate limits?

1. https://gist.github.com/rhoboat/c3218e5250d1f6e31f1d6716495b972e

2. https://gist.github.com/rhoboat/64fcfeffe332de185687b369981c555f

I'm going to test again on the trunk (without the PR) to see the failing behavior! I pulled the trunk, rebuilt the git-xargs binary and ran again, and the output looks similar, so I'm guessing I didn't hit rate limits?

1. https://gist.github.com/rhoboat/985a04e46403209e817aae8f1652df85

Thanks for testing!

You can pass --loglevel debug in your command, which will dump everything so you can see whether or not you're rate limited.

Here's the exact command I've been running to reliably trip secondary rate limits:

/git-xargs --github-org zack-test-org --branch-name t37 --loglevel debug touch test.txt

I think much of the rate limiting issue is resolved by the changes I'm proposing here and that I've pushed up on this branch.

mkowaliszyn-coursera commented 2 years ago

In local testing, refactoring out the goroutines for serial processing slows things down, but not to such a degree that the tool is unusable. In fact, doing the processing serially is so far much more stable, and the same test data sets that were reliably tripping rate limits previously now run to completion without issue. I've pushed these changes up now for anyone who wants to experiment.

Sorry, I've been unable to help with testing this. My previous tests showed that you can create ~60 PRs in 1 hour, with sufficient sleep 30 in processing of each repo and a concurrency of 4, then have to wait an hour before doing the next 60, and so on. It does take a while though. The test set above may be too small to hit all the limits GitHub has. It would be so much simpler if they just documented them. :/

The issue, or a portion of the issue, sounds like it is related to the concurrent creation of PRs. Would it make sense to create PRs serially while still processing repos concurrently? The process could wait until all repos are processed before starting to create PRs for them. The first rate limit response could then be observed, before proceeding without having to deal with multiple threads trying to do the same.

zackproser commented 2 years ago

In local testing, refactoring out the goroutines for serial processing slows things down, but not to such a degree that the tool is unusable. In fact, doing the processing serially is so far much more stable, and the same test data sets that were reliably tripping rate limits previously now run to completion without issue. I've pushed these changes up now for anyone who wants to experiment.

Sorry, I've been unable to help with testing this.

No problem at all @mkowaliszyn-coursera

My previous tests showed that you can create ~60 PRs in 1 hour, with sufficient sleep 30 in processing of each repo and a concurrency of 4, then have to wait an hour before doing the next 60, and so on. It does take a while though. The test set above may be too small to hit all the limits GitHub has. It would be so much simpler if they just documented them. :/

Indeed. I've recently been facing GitHub API limits in another project. The inconsistent nature of the errors and headers that are returned, and having to piece things together from various docs and GitHub issues does not make for a great time. I would guess that GitHub is reluctant to be any more forthcoming about the internals of their anti-abuse system for fear of any such information being useful to potential attackers.

The issue, or a portion of the issue, sounds like it is related to the concurrent creation of PRs. Would it make sense to create PRs serially while still processing repos concurrently? The process could wait until all repos are processed before starting to create PRs for them. The first rate limit response could then be observed, before proceeding without having to deal with multiple threads trying to do the same.

I agree part of the issue is the concurrent creation of PRs, since the docs explicitly state that certain actions, specifically creating PRs, generates notifications internal to GitHub and is therefore expensive (and hence very likely to get rate limited).

I was walking down this same path you suggest yesterday and over the past week or so while finding time to poke at this. I suppose I'm now wondering what the actual benefit concurrent processing with goroutines confers to us in this use case, and the more I think about, the more I'm inclined to go with serial processing of repos.

Here's a couple of reasons that I'm currently inclining toward serial processing:

zackproser commented 2 years ago

Thanks all for the helpful feedback on this issue. I'm closing this pr because I'm working on an alternate solution that I hope to be able to share soon.