Open vponomaryov opened 12 months ago
we are using:
< t:2023-11-12 03:49:09,392 f:eks.py l:372 c:sdcm.cluster_k8s p:DEBUG > 'addonName': 'aws-ebs-csi-driver',
< t:2023-11-12 03:49:09,392 f:eks.py l:372 c:sdcm.cluster_k8s p:DEBUG > 'addonVersion': 'v1.24.1-eksbuild.1',
someone is deleting the volume from under our feet, and I've found the perpetrator:
I'm guessing we are not labeling it in any fashion, or the cleanup script only find one which aren't attached
but now we know the cause
the cleanup script is looking for "keep: alive" label and if it's in-use, otherwise once in an hour it would get deleted
so we have a race condition that a volume can be deleted before it's in use we probably should look at the creation time, and give it an hour or so...
Prerequisites
Versions
Logs
Description
In our K8S deployments we install
MinIO
service for serving the scylla-manager as a S3-compatible backend. For the minio pod we create PVC using default storage class. In case of the EKS it is CSI driver using thegp2
AWS volumes. And it may occasionally fail like below:It causes following consequences to the
MinIO
pods:And in Jenkins we see it as the Helm crash:
Steps to Reproduce
Expected behavior: MinIO service must start up correctly
Actual behavior: MinIO service happens to fail the start up