scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 95 forks source link

[K8S] AWS gp2 CSI driver may fail to provision a PV causing minio helm installation also to fail #6808

Open vponomaryov opened 12 months ago

vponomaryov commented 12 months ago

Prerequisites

Versions

Logs

Description

In our K8S deployments we install MinIO service for serving the scylla-manager as a S3-compatible backend. For the minio pod we create PVC using default storage class. In case of the EKS it is CSI driver using the gp2 AWS volumes. And it may occasionally fail like below:

97s       Normal    Provisioning              persistentvolumeclaim/minio    ebs.csi.aws.com_ebs-csi-controller-5f66766bc-flzfq_262554e3-7239-41a6-9c0e-0af1e905b1f1   External provisioner is provisioning volume for claim "minio/minio"                                                                                                                                                                                                                                                                                                                                                                 5m58s        9       minio.1796c3c8652b1783
26s       Normal    ExternalProvisioning      persistentvolumeclaim/minio    persistentvolume-controller                                                               waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator                                                                                                                                                                                                                                                                                                    5m58s        25      minio.1796c3c8656bd904
5m54s     Warning   ProvisioningFailed        persistentvolumeclaim/minio    ebs.csi.aws.com_ebs-csi-controller-5f66766bc-flzfq_262554e3-7239-41a6-9c0e-0af1e905b1f1   failed to provision volume with StorageClass "gp2": rpc error: code = Internal desc = Could not create volume "pvc-0ec6de4b-c91d-4be7-89c5-00e16b5ff876": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-0d55510121f47a969' does not exist....                                                                                                                                                   5m54s        1       minio.1796c3c936802d5e
97s       Warning   ProvisioningFailed        persistentvolumeclaim/minio    ebs.csi.aws.com_ebs-csi-controller-5f66766bc-flzfq_262554e3-7239-41a6-9c0e-0af1e905b1f1   failed to provision volume with StorageClass "gp2": rpc error: code = AlreadyExists desc = Could not create volume "pvc-0ec6de4b-c91d-4be7-89c5-00e16b5ff876": Parameters on this idempotent request are inconsistent with parameters used in previous request(s)                                                                                                                                                                   5m53s        8       minio.1796c3c980a90ed4

It causes following consequences to the MinIO pods:

minio    minio-557d94b687-fpg4r            0/1     Pending            0               5m11s   <none>       <none>
minio    minio-make-bucket-job-688m4       0/1     CrashLoopBackOff   3 (22s ago)     5m11s   10.0.2.68    ip-10-0-2-134.eu-north-1.compute.internal

And in Jenkins we see it as the Helm crash:

2023-11-12 03:52:53,735 f:__init__.py     l:453  c:sdcm.cluster_k8s     p:INFO  > eu-north-1: Create 'minio' namespace
2023-11-12 03:57:54,859 f:base.py         l:146  c:LocalCmdRunner       p:ERROR > Error executing command: "HELM_CONFIG_HOME=/home/ubuntu/sct-results/20231112-033228-042769/longevity-scylla-operator-3h-multit-334e4808/.helm helm --kubeconfig=/home/ubuntu/sct-results/20231112-033228-042769/longevity-scylla-operator-3h-multit-334e4808/.kube/config --namespace minio install minio /home/ubuntu/scylla-cluster-tests/sdcm/k8s_configs/minio --debug -f /tmp/tmpiasyrju0"; Exit status: 1
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Exception in setUp. Will call tearDown < t:2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Exception in setUp. Will call tearDown
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Traceback (most recent call last):
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 164, in wrapper
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     return method(*args, **kwargs)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 119, in inner
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     res = func(*args, **kwargs)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 849, in setUp
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     self.init_resources()
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1884, in init_resources
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     self.get_cluster_k8s_eks()
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1774, in get_cluster_k8s_eks
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     k8s_cluster.deploy_scylla_manager(pool_name=k8s_cluster.AUXILIARY_POOL_NAME)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster_k8s/eks.py", line 448, in deploy_scylla_manager
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     self.deploy_minio_s3_backend()
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 119, in inner
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     res = func(*args, **kwargs)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster_k8s/__init__.py", line 882, in deploy_minio_s3_backend
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     self.log.debug(self.helm_install(
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/k8s/__init__.py", line 718, in _helm_install_or_upgrade
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     return self.helm(
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/k8s/__init__.py", line 681, in helm
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     result = LOCALRUNNER.run(cmd)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/local_cmd_runner.py", line 87, in run
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     result = _run()
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 65, in inner
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     return func(*args, **kwargs)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/local_cmd_runner.py", line 77, in _run
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     result = self.connection.local(**command_kwargs)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 750, in local
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     return super(Connection, self).run(*args, **kwargs)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/invoke/context.py", line 95, in run
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     return self._run(runner, command, **kwargs)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/invoke/context.py", line 102, in _run
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     return runner.run(command, **kwargs)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/invoke/runners.py", line 380, in run
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     return self._run_body(command, **kwargs)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/invoke/runners.py", line 442, in _run_body
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     return self.make_promise() if self._asynchronous else self._finish()
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   File "/usr/local/lib/python3.10/site-packages/invoke/runners.py", line 509, in _finish
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >     raise UnexpectedExit(result)
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > 
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Command: 'HELM_CONFIG_HOME=/home/ubuntu/sct-results/20231112-033228-042769/longevity-scylla-operator-3h-multit-334e4808/.helm helm --kubeconfig=/home/ubuntu/sct-results/20231112-033228-042769/longevity-scylla-operator-3h-multit-334e4808/.kube/config --namespace minio install minio /home/ubuntu/scylla-cluster-tests/sdcm/k8s_configs/minio --debug -f /tmp/tmpiasyrju0'
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > 
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Exit code: 1
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > 
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Stdout:
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > 
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > 
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > 
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > Stderr:
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > 
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > github.com/spf13/cobra.(*Command).ExecuteC
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   github.com/spf13/cobra@v1.6.1/command.go:1044
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > github.com/spf13/cobra.(*Command).Execute
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   github.com/spf13/cobra@v1.6.1/command.go:968
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > main.main
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   helm.sh/helm/v3/cmd/helm/helm.go:83
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > runtime.main
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   runtime/proc.go:250
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > runtime.goexit
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR >   runtime/asm_amd64.s:1598
2023-11-12 03:57:54,860 f:tester.py       l:171  c:sdcm.tester          p:ERROR > 

Steps to Reproduce

  1. Run any EKS CI jobs until this failure happens
  2. See error

Expected behavior: MinIO service must start up correctly

Actual behavior: MinIO service happens to fail the start up

fruch commented 12 months ago

we are using:

< t:2023-11-12 03:49:09,392 f:eks.py          l:372  c:sdcm.cluster_k8s     p:DEBUG >            'addonName': 'aws-ebs-csi-driver',
< t:2023-11-12 03:49:09,392 f:eks.py          l:372  c:sdcm.cluster_k8s     p:DEBUG >            'addonVersion': 'v1.24.1-eksbuild.1',
fruch commented 12 months ago

someone is deleting the volume from under our feet, and I've found the perpetrator:

image

I'm guessing we are not labeling it in any fashion, or the cleanup script only find one which aren't attached

but now we know the cause

fruch commented 12 months ago

the cleanup script is looking for "keep: alive" label and if it's in-use, otherwise once in an hour it would get deleted

so we have a race condition that a volume can be deleted before it's in use we probably should look at the creation time, and give it an hour or so...