`ebs-csi-driver-test` pod errors but runs indefinitely

nikki-quant commented 4 weeks ago

/kind bug

What happened?

I upgraded a test Kubernetes cluster from 1.28 to 1.29 this afternoon and noticed a node failed to drain due to an ebs-csi-driver-test pod still running. On inspection the test pod was using a 1.31 client, had expected a set of APIs not available in our Kubernetes versions, and so had errorred out but continued to run. This is the image:

image: us-central1-docker.pkg.dev/k8s-staging-test-infra/images/kubekins-e2e:v20241011-e8871c079d-master

Looking at our live clusters their previous ebs-csi-driver-test pod had failed in all of them but moved into a failed state. That image was gcr.io/k8s-staging-test-infra/kubekins-e2e:v20240311-b09cdeb92c-master

The EBS CSI controller continued working normally but the newer image blocked kubectl drain:

% kubectl drain -l eks.amazonaws.com/nodegroup=<my-ng> --ignore-daemonsets --delete-emptydir-data
node/<node> cordoned
error: unable to drain node "<node>" due to error:cannot delete Pods declare no controller (use --force to override): kube-system/ebs-csi-driver-test, continuing command...
There are pending nodes to be drained:
<node>
cannot delete Pods declare no controller (use --force to override): kube-system/ebs-csi-driver-test

What you expected to happen?

The readme for the EBS CSI Driver project says that it is "compatible with all Kubernetes versions supported by the Kubernetes project and/or Amazon EKS (including extended support versions)".

As such, I'd expect that either (in order of preference):

Automated tests could run successfully against that range of supported Kubernetes version (1.28 - 1.31) using the default chart values
There was a set of test container images for previous Kubernetes versions available to provide as chart values.
It is documented that tests will fail against versions prior to 1.30

How to reproduce it (as minimally and precisely as possible)?

Run the current default us-central1-docker.pkg.dev/k8s-staging-test-infra/images/kubekins-e2e:v20241011-e8871c079d-master against a Kubernetes API server on version 1.29 or previous.

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version): 1.28, 1.29.
Chart version: 2.31.0
Driver version: 1.31.0

torredil commented 3 weeks ago

Hi @nikki-quant, thank you for reporting this.

I was able to reproduce this behavior using the latest version of the driver in EKS v1.29.

In my experiments, ebs-csi-driver-test pod hangs because the endpoint used to retrieve the test package version is not constructed correctly. For example, as can be seen in the logs below, KUBE_VERSION is set to 1.29+ instead of 1.29, which results in the following invalid URL: https://dl.k8s.io/release/stable-1.29+.txt.

kubectl logs ebs-csi-driver-test -n kube-system -f  

Cluster "cluster" set.
Context "kubetest2" created.
User "sa" set.
Context "kubetest2" modified.
Switched to context "kubetest2".
NAME                                      CREATED AT
volumesnapshots.snapshot.storage.k8s.io   2024-10-30T20:16:02Z
Detecting Kubernetes server version
WARNING: version difference between client (1.31) and server (1.29) exceeds the supported minor version skew of +/-1
Detected KUBE_VERSION=1.29+
Fetching the stable test package version for KUBE_VERSION=1.29+
Fetched test package version <?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: 767373bbdcb8270361b96548387bf2a9ad0d48758c35/release/stable-1.29 .txt</Details></Error>
Starting kubetest2 with ginkgo tests...

Ideally, we should

a) trim any invalid trailing characters at the end of the version string. b) stop test execution if the test package version retrieval fails.

torredil commented 3 weeks ago

/assign

nikki-quant commented 3 weeks ago

Nice one, thank you @torredil

kubernetes-sigs / aws-ebs-csi-driver

`ebs-csi-driver-test` pod errors but runs indefinitely #2198