AWSCluster can trigger re-enqueue of all machines when ingress rules fail to update, hitting AWS API limits

randomvariable commented 3 years ago

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

In a scenario where CAPA is trying to reconcile security groups and fails, this triggers all machines to be requeued, and then causes API rate limiting against the EC2 API.

What did you expect to happen:

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

We deliberately re-enqueue all machines when a AWSCluster is updated in unpaused state in: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/v0.6.4/controllers/awsmachine_controller.go#L239-L240

We wanted to do that so that we have fast reconciliation, however, it's definitely causing the API rate limit to hit quickly. There's a simple reproduction too:

Create a cluster with CAPA, and then change the description of one of the rules, but keep the rules otherwise the same. Or do the following:

Environment:

Cluster-api-provider-aws version: v0.6.4
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/priority important-soon /area networking

randomvariable commented 3 years ago

We probably should discuss in next office hours.

shivi28 commented 3 years ago

/assign @shivi28

sedefsavas commented 3 years ago

@shivi28 is there any update on this bug?

shivi28 commented 3 years ago

Hey @sedefsavas, I am going to add debouncing logic and introduce a debouncing window in AWSCluster and AWSMachine reconcilers. Will raise a PR in this week

sedefsavas commented 2 years ago

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/2794#issuecomment-1160374446): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

richardcase commented 2 years ago

/reopen

k8s-ci-robot commented 2 years ago

@richardcase: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/2794#issuecomment-1160531573): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/2794#issuecomment-1190445087): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

Ankitasw commented 1 year ago

/reopen /triage accepted /priority important-soon

k8s-ci-robot commented 1 year ago

@Ankitasw: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/2794#issuecomment-1350907952): >/reopen >/triage accepted >/priority important-soon Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 1 year ago

The issue has been marked as an important bug and triaged. Such issues are automatically marked as frozen when hitting the rotten state to avoid missing important bugs.

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle frozen

k8s-triage-robot commented 1 year ago

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

richardcase commented 1 year ago

From office hours 2023-04-03:

Do we care about re-reconciling if we don't hit api rate limits?
We have exponential backoffs
What about caching?
Also related is #4192

/triage accepted /priority important-soon

k8s-triage-robot commented 1 year ago

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

kubernetes-sigs / cluster-api-provider-aws

AWSCluster can trigger re-enqueue of all machines when ingress rules fail to update, hitting AWS API limits #2794