Closed cartermckinnon closed 11 months ago
This issue is currently awaiting triage.
If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Something seems wonky with the CI, I'll look into it.
/retest
All failures are related to persistent volumes, doesn't seem related to this change.
/retest
The CI is definitely hosed, same cases have been red in k/k since ~11/24: https://testgrid.k8s.io/presubmits-ec2#pull-kubernetes-e2e-ec2
@tzneal added a couple unit test cases 👍
Seems ok, any idea on the CI problem?
Haven't had a chance to go down the rabbit hole. Looks like things broke when the kubekins
image was bumped in the test spec, I assume a change in cloud-provider-aws-test-infra is the issue but I don't see anything obvious in the commit log.
@dims do you have a guess?
@cartermckinnon no, i have not looked at this yet ..
@dims I'll try to get a fix up 👍
CI should be fixed by this: https://github.com/kubernetes-sigs/provider-aws-test-infra/pull/232
/retest
@ndbaker1: changing LGTM is restricted to collaborators
did we also want to make the queue size visible
I plan to add a metric for this in a separate PR, because it'd be helpful to debug in the future; but I think dequeue latency is still the more important metric to track and alarm on. There can be many events in the queue that are no-ops, and that doesn't necessarily have an impact e.g. how quickly a new Node is tagged.
@ndbaker1: changing LGTM is restricted to collaborators
/approve /lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: dims, mmerkes, ndbaker1
The full list of commands accepted by this bot can be found here.
The pull request process is described here
What type of PR is this?
/kind bug
What this PR does / why we need it:
This avoids unnecessary retries when the
ec2:CreateTags
call fails with anInvalidInstanceId.NotFound
error. Excessive retries for each event can lead to a growing work queue that may increase dequeue latency dramatically.If the Node is newly-created, we requeue the event and retry, to handle eventual-consistency of this API. If the Node is not newly-created, we drop the event.
This PR is only concerned with retries for a single event; all nodes still have the implicit "retry" that results from each update event (every ~5m).
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: