kubernetes / cloud-provider-aws

Cloud provider for AWS
https://cloud-provider-aws.sigs.k8s.io/
Apache License 2.0
395 stars 302 forks source link

Drop Node event when EC2 instance does not exist #753

Closed cartermckinnon closed 11 months ago

cartermckinnon commented 11 months ago

What type of PR is this?

/kind bug

What this PR does / why we need it:

This avoids unnecessary retries when the ec2:CreateTags call fails with an InvalidInstanceId.NotFound error. Excessive retries for each event can lead to a growing work queue that may increase dequeue latency dramatically.

If the Node is newly-created, we requeue the event and retry, to handle eventual-consistency of this API. If the Node is not newly-created, we drop the event.

This PR is only concerned with retries for a single event; all nodes still have the implicit "retry" that results from each update event (every ~5m).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE
k8s-ci-robot commented 11 months ago

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cartermckinnon commented 11 months ago

Something seems wonky with the CI, I'll look into it.

cartermckinnon commented 11 months ago

/retest

All failures are related to persistent volumes, doesn't seem related to this change.

cartermckinnon commented 11 months ago

/retest

cartermckinnon commented 11 months ago

The CI is definitely hosed, same cases have been red in k/k since ~11/24: https://testgrid.k8s.io/presubmits-ec2#pull-kubernetes-e2e-ec2

cartermckinnon commented 11 months ago

@tzneal added a couple unit test cases 👍

tzneal commented 11 months ago

Seems ok, any idea on the CI problem?

cartermckinnon commented 11 months ago

Haven't had a chance to go down the rabbit hole. Looks like things broke when the kubekins image was bumped in the test spec, I assume a change in cloud-provider-aws-test-infra is the issue but I don't see anything obvious in the commit log.

@dims do you have a guess?

dims commented 11 months ago

@cartermckinnon no, i have not looked at this yet ..

cartermckinnon commented 11 months ago

@dims I'll try to get a fix up 👍

cartermckinnon commented 11 months ago

CI should be fixed by this: https://github.com/kubernetes-sigs/provider-aws-test-infra/pull/232

cartermckinnon commented 11 months ago

/retest

k8s-ci-robot commented 11 months ago

@ndbaker1: changing LGTM is restricted to collaborators

In response to [this](https://github.com/kubernetes/cloud-provider-aws/pull/753#pullrequestreview-1760396657): >Thanks for picking this up! did we also want to make the queue size visible with some logging around [here](https://github.com/kubernetes/cloud-provider-aws/blob/11088788f267d02cf32ad177467312a11c5692ec/pkg/controllers/tagging/tagging_controller.go#L205)? (since dequeuing latency isn't the most direct metric) Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cartermckinnon commented 11 months ago

did we also want to make the queue size visible

I plan to add a metric for this in a separate PR, because it'd be helpful to debug in the future; but I think dequeue latency is still the more important metric to track and alarm on. There can be many events in the queue that are no-ops, and that doesn't necessarily have an impact e.g. how quickly a new Node is tagged.

k8s-ci-robot commented 11 months ago

@ndbaker1: changing LGTM is restricted to collaborators

In response to [this](https://github.com/kubernetes/cloud-provider-aws/pull/753#pullrequestreview-1760448495): > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dims commented 11 months ago

/approve /lgtm

k8s-ci-robot commented 11 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, mmerkes, ndbaker1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes/cloud-provider-aws/blob/master/OWNERS)~~ [dims] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment