Closed vponomaryov closed 1 year ago
@zimnx I've hit this locally as well but what stroke me is that, because the socket is a file mounted from host (/var/lib/kubelet/plugins/local.csi.scylladb.com/csi.sock
), this won't fix itself, no matter how many restarts (in contrast with TIME_WAIT). This won't even get fixed by node reboot, so someone has to delete that file manually which sucks.
We need to prioritize a fix.
also this seems pretty consistent for ungraceful VM shutdowns
Issue description
One of the CSI driver instances failed to start with the following error:
It's liveness probe results:
Impact
Breakage of a Scylla member creation
How frequently does it reproduce?
~5-10%
Installation details
Kernel Version: 5.10.179-168.710.amzn2.x86_64 Scylla version (or git commit hash):
2022.2.9-20230618.843304f9f734
with build-ida34753ee38bccbaf461e04ae0e63e17afe45e048
K8S local-volume-provisioner image: docker.io/scylladb/k8s-local-volume-provisioner:0.1.0-rc.0
Operator Image: scylladb/scylla-operator:1.9.0-rc.1 Operator Helm Version: 1.9.0-rc.1 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 3 nodes (i3.2xlarge)
OS / Image: `` (k8s-eks: undefined_region)
Test:
perf-regression-throughput-eks
Test id:657ce2c4-2bbf-4941-8d23-8be7f0a2487d
Test name:scylla-operator/operator-1.9/performance/perf-regression-throughput-eks
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 657ce2c4-2bbf-4941-8d23-8be7f0a2487d` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=657ce2c4-2bbf-4941-8d23-8be7f0a2487d) - Show all stored logs command: `$ hydra investigate show-logs 657ce2c4-2bbf-4941-8d23-8be7f0a2487d` ## Logs: - **db-cluster-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/db-cluster-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/db-cluster-657ce2c4.tar.gz) - **sct-runner-events-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/sct-runner-events-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/sct-runner-events-657ce2c4.tar.gz) - **sct-657ce2c4.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/sct-657ce2c4.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/sct-657ce2c4.log.tar.gz) - **monitor-set-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/monitor-set-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/monitor-set-657ce2c4.tar.gz) - **loader-set-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/loader-set-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/loader-set-657ce2c4.tar.gz) - **kubernetes-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/kubernetes-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/kubernetes-657ce2c4.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-operator/job/operator-1.9/job/performance/job/perf-regression-throughput-eks/8/) [Argus](https://argus.scylladb.com/test/e8c0209f-72d9-48d0-bd13-a45fa9e23e34/runs?additionalRuns[]=657ce2c4-2bbf-4941-8d23-8be7f0a2487d)