integrations / terraform-provider-github

Terraform GitHub provider
https://www.terraform.io/docs/providers/github/
MIT License
904 stars 743 forks source link

Github Provider: Support API Rate Limiting #5

Closed hashibot closed 6 years ago

hashibot commented 7 years ago

This issue was originally opened by @johnrengelman as hashicorp/terraform#6016. It was migrated here as part of the provider split. The original body of the issue is below.


The Github API has a rate limit than can be easily overcome when managing an organization. This ticket is simply a spot to start a conversation about how to properly handle the rate limit.

Details: https://developer.github.com/v3/#rate-limiting

DWSR commented 7 years ago

Checking the X-RateLimit-Remaining and X-RateLimit-Reset headers seems like it should be straightforward enough. This would result in activity spikes as the provider would run full speed until the limit is reached, however.

joestump commented 7 years ago

I'd assume behavior would look similar to how TF behaves when it's doing a large AWS modification?

majormoses commented 7 years ago

I thought about this more and the rate limits are still quite different in the sense that they are done hourly on github. So I am not sure how much value there is in this to slow down for an hour...Most rate limits are over a shorter period of time.

joestump commented 7 years ago

Our rate limit is like 5000 calls. I doubt this will be an issue for anyone anytime soon. Our GitHub TF runs are very low volume.

majormoses commented 7 years ago

I hit this probably a half dozen times trying to get through one of the organizations I am a maintainer for: https://github.com/sensu-plugins which I was just finishing the import last night but took at least several passes to get all the way through importing some ones that were previously missed in initial attempts.

majormoses commented 7 years ago

That's one of the problems with hourly rate limits, there is no opportunity to back off and slow down rate. It's just wait for an hour and then retry.

joestump commented 7 years ago

@majormoses agreed on the hourly bucket size. That's really only 1.5'ish calls per second when the bucket is that big. Though I do think, on an ongoing basis, 5000 calls an hour is probably fine. Not everyone is quite as popular as Sensu! 😄

DWSR commented 7 years ago

I suspect just having some kind of simple logic to detect that we are out of calls, pause, and display feedback is perfectly sufficient. It would feel bad to have TF fail because of this and then succeed when you re-run it a bit later.

majormoses commented 7 years ago

SGTM, not sure about pausing for ~60 minutes but def would be nice to have some feedback and then return with a try again later. I would feel much better about it if it was say even a half hour (IMHO should be something like 5m) bucket. Waiting for 60 minutes is a lot of wasted resources to keep all that it needs in RAM. What I ended up doing in my case which yes I know is probably much larger than the vast majority of projects on github was just setting a timer and walking away. As this was something I was doing at home for some open source projects it was not as big a deal for me to wait a few times and go do something else but I could see that being maddening at work.

DWSR commented 7 years ago

I'm making that suggestion more to go from a state where TF just breaks when running into the limit to a place where it doesn't break. An MVP, if you will.

A secondary evolution could be to precount the number of transactions against the API and subdivide that into buckets

majormoses commented 7 years ago

I now have too many github objects managed by terraform that I can no longer get through a terraform plan. :cry: / :joy:

Any chance of getting someone to implement rate limiting per https://github.com/google/go-github#rate-limiting. If no one thinks they can get to it sometime soon I can try learning enough go to hack something together (probably will need some help). If I can't get it working soon I see a couple of options going forward:

  1. break up into multiple states so that we could limit every time a plan is run it is on a subset of the repositories
  2. look into if there is anything that can be optimized to reduce the number of requests made as this truly is insane that with ~200 repos and a few dozen labels you can chew through 5000 requests in a few minutes.
  3. Look for another tool to manage github via code. This would make me very sad.
DWSR commented 7 years ago

Have you tried disabling TF's state refresh on plan run via -refresh=false?

Also, have you tried limiting your plan runs to only certain repos via -target=?

mbbroberg commented 7 years ago

👋 been working with @majormoses on this one.

Have you tried disabling TF's state refresh on plan run via -refresh=false?

I haven't, I'll give it a go.

Also, have you tried limiting your plan runs to only certain repos via -target=?

Could use that sometimes, but not often. The overall goal is to manage the 200+ repos so sometimes we have to hit them all.

Thanks all 🙏

majormoses commented 7 years ago

I mean those are temporary solutions, like matt said running it on a subset of repos with -target kinda defeats the purpose. Using -refresh=false permanently sounds dangerous to use (would be fine for an initial import) and defeats the benefit of terraform in the first place which is configuration drift.

DWSR commented 7 years ago

I understand that those are temporary solutions. I'm simply suggesting them because they are recommended in the official documentation as a way to work around rate limits. This can be done by breaking each step into its own action. For example, calling terraform refresh, then terraform plan -refresh=false -out=terraform.plan then terraform apply terraform.plan.

Additionally, have you tried playing around with the parallelism setting? This may also have the effect that you're looking for.

majormoses commented 7 years ago

True, I can try to see if breaking up the refresh into its own state but that essentially means that it will likely take 3 hours to run through each iteration.

Hmm I was not aware of parallelism..that might be interesting. looking at the docs It does not state what the default is, I will look through the code to see what the default is. I think that might be our most viable path forward at the moment. I'd like to get through a plan in a single go consistently even if it takes a ridiculously long time. Some other options I have considered:

DWSR commented 7 years ago

parallelism affects the number of threads that Terraform will use to make changes. the default value is 10. For GitHub, with such a large window, it might not help, but it's another potential option for you to try.

I like the idea of allowing the GitHub provider to be an application rather than acting as a user because it means that you don't have to worry about managing a seperate user only for Terraform and it nearly solves the rate limiting problem.

majormoses commented 7 years ago

Cool, thanks for the info on the default. I will try playing with that and see if I can get a short term path forward to allow us time to implement one of the real solutions.

majormoses commented 7 years ago

Unfortunately I just tried and even using -parallelism=1 still can not get through a refresh. Looks like I need to do some state surgery just to get it back to a point where things can get through a refresh.

majormoses commented 7 years ago

Even after trying to prune some stuff out I hit 5000 requests in about a minute and 40 seconds...

DWSR commented 7 years ago

@majormoses Can you get through a state refresh without hitting the API rate limit?

majormoses commented 7 years ago

nope I almost did and with a bit more pruned I think I can as there were only 5 that failed, command:

$ time terraform refresh
TRUNCATED REFRESH
Error refreshing state: 5 error(s) occurred:

* module.sensu-plugins-splunk.github_repository.plugin: 1 error(s) occurred:

* module.sensu-plugins-splunk.github_repository.plugin: github_repository.plugin: GET https://api.github.com/repos/sensu-plugins/sensu-plugins-splunk: 403 API rate limit exceeded for majormoses.; rate reset in 58m52.923753888s
* module.sensu-plugins-unicorn.github_repository.plugin: 1 error(s) occurred:

* module.sensu-plugins-unicorn.github_repository.plugin: github_repository.plugin: GET https://api.github.com/repos/sensu-plugins/sensu-plugins-unicorn: 403 API rate limit of 5000 still exceeded until 2017-10-01 15:23:59 -0700 PDT, not making remote request.; rate reset in 58m52.919876599s
* module.sensu-plugins-feature-requests.github_repository.plugin: 1 error(s) occurred:

* module.sensu-plugins-feature-requests.github_repository.plugin: github_repository.plugin: GET https://api.github.com/repos/sensu-plugins/sensu-plugins-feature-requests: 403 API rate limit of 5000 still exceeded until 2017-10-01 15:23:59 -0700 PDT, not making remote request.; rate reset in 58m52.915608461s
* module.sensu-plugins-gearman.github_repository.plugin: 1 error(s) occurred:

* module.sensu-plugins-gearman.github_repository.plugin: github_repository.plugin: GET https://api.github.com/repos/sensu-plugins/sensu-plugins-gearman: 403 API rate limit exceeded for majormoses.; rate reset in 58m52.78593165s
* module.sensu-plugins-gpg.github_repository.plugin: 1 error(s) occurred:

* module.sensu-plugins-gpg.github_repository.plugin: github_repository.plugin: GET https://api.github.com/repos/sensu-plugins/sensu-plugins-gpg: 403 API rate limit of 5000 still exceeded until 2017-10-01 15:23:59 -0700 PDT, not making remote request.; rate reset in 58m52.930738518s

real    1m40.071s
user    1m9.196s
sys 0m4.108s

I will let you know in an hour or so if I can make it through it.

majormoses commented 7 years ago

I also need to add another half dozen repos and I only see this problem getting worst not better...

DWSR commented 7 years ago

@majormoses I know that this is probably not ideal, but have you tried using Terragrunt and its pattern? If all of your repos stem from the same few configurations, you could easily define those configurations as modules, then iterate through them using Terragrunt plus a shell script to space out Terragrunt executions. Obviously these are workarounds to the actual problem.

majormoses commented 7 years ago

We are using modules and that is possible something I might have to create a utility and have 200+ states. But this is ugly... If I'll gonna do that I am considering just writing a custom utility with octocat if there is no reason path forward with terraform.

Please excuse brevity on mobile, Ben Abrams

On Oct 1, 2017 2:45 PM, "Brandon Andrews" notifications@github.com wrote:

@majormoses https://github.com/majormoses I know that this is probably not ideal, but have you tried using Terragrunt and its pattern? If all of your repos stem from the same few configurations, you could easily define those configurations as modules, then iterate through them using Terragrunt plus a shell script to space out Terragrunt executions. Obviously these are workarounds to the actual problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraform-providers/terraform-provider-github/issues/5#issuecomment-333409221, or mute the thread https://github.com/notifications/unsubscribe-auth/AC_9pzH00FmpYIKjrwNO96R7ffPYs4Xeks5soAf6gaJpZM4N4xYr .

DWSR commented 7 years ago

Genuinely curious: why does the number of states matter? To me, it's more ideal having individual states for each repo because I can change out configurations on a repository individually and update it without having to disturb my entire GH account. Additionally, I can just run a recursive search and replace to update all modules to their latest versions, should I need to. Everything's handled by the backend and the configuration is templated out anyway, so it's not like there's a ton of management overhead to worry about.

majormoses commented 7 years ago

laziness, defining the defaults for a bunch of things once, backup and restore, etc. I guess it might just make sense to have each repo have its own terraform config. My biggest issue is setting the same default 200 times. Also it creates the chicken and the egg scenario and while often the repos are imported and managed in terraform after creation where possible it's much easier to create a resource from terraform than have to import it. Also it means if someone accidentally deleted some repos we would have a harder time re-creating them. It shows if maintainers are making out of band changes across all the repos in the org. I guess basically it comes down to not wanting to have the exact same overhead 200 times.

majormoses commented 7 years ago

Yay I got it to refresh and with a whole 98 requests to spare :cry:

real    1m36.256s
user    1m6.320s
sys 0m3.032s

{
  "resources": {
    "core": {
      "limit": 5000,
      "remaining": 98,
      "reset": 1506903500
    },
    "search": {
      "limit": 30,
      "remaining": 30,
      "reset": 1506900081
    },
    "graphql": {
      "limit": 5000,
      "remaining": 5000,
      "reset": 1506903621
    }
  },
  "rate": {
    "limit": 5000,
    "remaining": 98,
    "reset": 1506903500
  }
}
DWSR commented 7 years ago

@majormoses If it's a default, then setting the default value for the variable in the module should suffice for propagating the changes to all repositories using said default. No need to declare it in ~200 places. If someone deletes a repository, simply refreshing the state and recreating it as a single repo is much easier and lighter than refreshing your entire account and picking up the changes.

Regarding OOB changes, that's a workflow problem. The easiest way to deal with that is revoke everyone's administrative ability and then set aside an account for Terraform to use in combination with a CI/CD pipeline. Use PRs as a method of change control (team membership updates, repo owner wants to update the config, etc.) and you're off to the races.

I don't see many downsides to breaking out each repository into its own module (reliance on another utility is one, certainly), but see tremendous upside when allowing fully automated changes.

And YAY for getting it to refresh.

majormoses commented 7 years ago

Sorry not default I misspoke. I mean having to define the same vaiables.tf. It's just a lot of extra duplication of stuff.

Regarding out of band yes I do that at my job but I have a slew of Jenkins machines that run that on our behalf and have created very complex flows. This is an open source project and the closest we have is travis. I am not sure how much effort it would be to make something like this with travis but that is a hell of a lot more time/effort than I have time for right now. I do much of this contribution on nights and weekends. Also we would still need some maintainers to have admin to accept new repos into the org as I don't see terraform supporting a feature like transfer repo to/from user/org.

If someone deletes a repository, simply refreshing the state and recreating it as a single repo

I think I was not clear I was refering to my contemplation to actually stick each repos terraform into the repo itself but was offering a counter offer why a mono repo with more restrictive permissions/access is better than a bunch of repos that are more loose with access. The same thought process was applied to the chicken and egg. As this means that you create the github repo and then import to its own terraform state seems odd.

I don't see many downsides to breaking out each repository into its own module (reliance on another utility is one, certainly), but see tremendous upside when allowing fully automated changes.

Like I said it comes down to the time, effort, and lack of budget to split all that out and duplicate a lot of work.

DWSR commented 7 years ago

Defining the same variables over and over again is something Terragrunt is very strong at. You can even template out your CLI flags so that you don't need to remember them..

Importing a repository with Terragrunt vs. Terraform is almost identical. Transferring in a repository as an admin should be a very easy process. I suspect you could even register a webhook from Travis so that all newly imported repositories are automatically imported to Terraform/Terragrunt.

Sticking each repository's definition into itself (a la Dockerfile or Jenkinsfile) is an interesting idea...not sure how to work it all out, but that's an idea worth exploring. :)

majormoses commented 7 years ago

I took a closer look at terragrunt and while it does solve redefining variables it does not solve the base problem which is that we want to ensure that all config drift is caught in a single command without hitting api limits even if it sits there for a really long time.

abridgett commented 6 years ago

FYI there was https://github.com/hashicorp/terraform/pull/11670 which if it's suitable might be a quick win here. It seems to have been lost in the split of the providers. It's rather frustrating having to wait for the next set of tokens to become available that's for sure :)

majormoses commented 6 years ago

Looking at the upstream library docs https://github.com/google/go-github/blob/v15.0.0/github/doc.go#L124-L149 it looks like it supports conditional requests (not directly).

ghost commented 6 years ago

Working with an internal Enterprise version of Github, we regularly hit the rate limit. When the limit is hit while running tf plan it assumes that plan succeeded, and then lists everything it failed to update on as new to be added. It would be nice if TF provider checked for the rate limit value and at least aborted or failed if it hit the limit.

majormoses commented 6 years ago

@hwhopemaIBM since you are on github enterprise you could try discussing with your github admin to change your rate limits: https://help.github.com/enterprise/2.13/admin/guides/installation/configuring-rate-limits that being said the issue is still a big one for both enterprise and hosted solutions and I think there are several things we can do (not all scoped to this issue):

ss-remygreinhofer commented 6 years ago

To add some feedback to the conversation, I am trying to manage a large GitHub organization with this provider and I am constantly blocked by the rate limit: 403 You have triggered an abuse detection mechanism. Please wait a few minutes before you try again..

docwhat commented 6 years ago

If we switched to making the changes via GraphQL (the v4 API) then it would prevent the case where only a partial change happens (in theory).

docwhat commented 6 years ago

@ss-remygreinhofer

And don't forget:

nicksantamaria commented 5 years ago

To add some feedback to the conversation, I am trying to manage a large GitHub organization with this provider and I am constantly blocked by the rate limit: 403 You have triggered an abuse detection mechanism. Please wait a few minutes before you try again..

@ss-remygreinhofer This issue appears to be separate to the rate limit. Try adding -parallelism=1 to your plan/apply.