kubernetes / cloud-provider-aws

Cloud provider for AWS
https://cloud-provider-aws.sigs.k8s.io/
Apache License 2.0
397 stars 305 forks source link

EBS CSI Driver issue causing kubetest2 failures - IMDS metadata and Kubernetes metadata are both unavailable #1061

Open mmerkes opened 7 hours ago

mmerkes commented 7 hours ago

Which jobs are failing:

pull-cloud-provider-aws-e2e-kubetest2-quick
pull-cloud-provider-aws-e2e-kubetest2

Which test(s) are failing: BeforeSuite is failing because CPI nodes aren't stabilizing.

Since when has it been failing: This one passed on 10/31.

This one failed on 11/6. So sometime between these two.

Testgrid link:

  1. First seen failure
  2. Failed 11/25

Reason for failure:

EBS CSI pod is not stabilizing:

2024-11-25T18:30:42.52251214Z stderr F I1125 18:30:42.522404       1 main.go:157] "Initializing metadata"
2024-11-25T18:30:47.523520821Z stderr F E1125 18:30:47.523424       1 metadata.go:51] "Retrieving IMDS metadata failed, falling back to Kubernetes metadata" err="could not get EC2 instance identity metadata: operation error ec2imds: GetInstanceIdentityDocument, canceled, context deadline exceeded"
2024-11-25T18:30:47.530862069Z stderr F E1125 18:30:47.530760       1 metadata.go:58] "Retrieving Kubernetes metadata failed" err="could not retrieve instance type from topology label"
2024-11-25T18:30:47.530928736Z stderr F E1125 18:30:47.530882       1 main.go:162] "Failed to initialize metadata when it is required" err="IMDS metadata and Kubernetes metadata are both unavailable"

Anything else we need to know:

/kind failing-test

mmerkes commented 7 hours ago

/triage accepted

dims commented 7 hours ago

cc @ConnorJC3 @torredil

mmerkes commented 7 hours ago

Not sure if they're related to each other, but also see this error in kubelet:

Nov 25 18:34:03 ip-172-31-24-156 kubelet[6298]: E1125 18:34:03.425509 6298 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"aws-cloud-controller-manager\" with ImagePullBackOff: \"Back-off pulling image \\\"209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea\\\": ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image \\\"209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea\\\": failed to resolve reference \\\"209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea\\\": 209411653980.dkr.ecr.us-east-1.amazonaws.com/provider-aws/cloud-controller-manager:v1.30.0-beta.0-110-gac63fea: not found\"" pod="kube-system/aws-cloud-controller-manager-cq6m2" podUID="b6d43d27-1967-414e-86f8-72b3e9375664"

ConnorJC3 commented 7 hours ago

Not sure if they're related to each other, but also see this error in kubelet:

Very likely related - as I believe it is the AWS CCM that adds the labels we rely on for metadata to the node.

mmerkes commented 7 hours ago

Very likely related - as I believe it is the AWS CCM that adds the labels we rely on for metadata to the node.

Sounds right. Looks like that's a red herring.