scylladb / local-csi-driver

ScyllaDB local volume provisioner for Kubernetes based on CSI
Apache License 2.0
8 stars 6 forks source link

Error: can't listen on "/csi/csi.sock" using unix protocol: listen unix /csi/csi.sock: bind: address already in use #21

Closed vponomaryov closed 1 year ago

vponomaryov commented 1 year ago

Issue description

One of the CSI driver instances failed to start with the following error:

I0623 17:30:37.071240       1 local-csi-driver/driver.go:119] "Driver started" command="local-csi-driver" version="\"v0.1.0-beta.1-0-ga59b0f8\""
I0623 17:30:37.071291       1 flag/flags.go:64] FLAG: --driver-name="local.csi.scylladb.com"
I0623 17:30:37.071298       1 flag/flags.go:64] FLAG: --help="false"
I0623 17:30:37.071303       1 flag/flags.go:64] FLAG: --listen="/csi/csi.sock"
I0623 17:30:37.071306       1 flag/flags.go:64] FLAG: --loglevel="2"
I0623 17:30:37.071310       1 flag/flags.go:64] FLAG: --node-name="ip-10-12-7-103.ec2.internal"
I0623 17:30:37.071314       1 flag/flags.go:64] FLAG: --v="2"
I0623 17:30:37.071317       1 flag/flags.go:64] FLAG: --volumes-dir="/mnt/persistent-volumes"
Error: can't listen on "/csi/csi.sock" using unix protocol: listen unix /csi/csi.sock: bind: address already in use

It's liveness probe results:

I0623 16:26:34.852622       1 main.go:149] calling CSI driver to discover driver name
I0623 16:26:34.853747       1 main.go:155] CSI driver name: "local.csi.scylladb.com"
I0623 16:26:34.853773       1 main.go:183] ServeMux listening at "0.0.0.0:9809"
W0623 16:46:30.509744       1 connection.go:173] Still connecting to unix:///csi/csi.sock
E0623 16:47:08.559050       1 main.go:64] failed to establish connection to CSI driver: context deadline exceeded
E0623 16:47:15.126560       1 main.go:64] failed to establish connection to CSI driver: context canceled
W0623 16:47:15.345008       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:15.424984       1 connection.go:173] Still connecting to unix:///csi/csi.sock
E0623 16:47:15.430405       1 main.go:64] failed to establish connection to CSI driver: context canceled
E0623 16:47:15.474940       1 main.go:64] failed to establish connection to CSI driver: context canceled
E0623 16:47:15.561968       1 connection.go:132] Lost connection to unix:///csi/csi.sock.
E0623 16:47:15.593518       1 main.go:64] failed to establish connection to CSI driver: context canceled
W0623 16:47:18.054903       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:25.236508       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:25.349541       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:25.500791       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:25.660422       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:30.794906       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:35.503076       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:35.595067       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:35.803256       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:36.158894       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:38.055628       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:50.060415       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:48:25.593549       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:47:53.150225       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:48:07.698577       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:48:22.719607       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:48:24.376925       1 connection.go:173] Still connecting to unix:///csi/csi.sock
E0623 16:48:25.422562       1 main.go:64] failed to establish connection to CSI driver: context canceled
E0623 16:48:25.422706       1 main.go:64] failed to establish connection to CSI driver: context canceled
E0623 16:48:25.423525       1 main.go:64] failed to establish connection to CSI driver: context canceled
E0623 16:48:25.485253       1 main.go:64] failed to establish connection to CSI driver: context canceled
E0623 16:48:25.424495       1 main.go:64] failed to establish connection to CSI driver: context canceled
W0623 16:48:25.714499       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0623 16:48:25.726677       1 connection.go:173] Still connecting to unix:///csi/csi.sock
...
< ~2600 more 'Still connecting to unix:///csi/csi.sock' messages>
...
W0623 17:32:08.055199       1 connection.go:173] Still connecting to unix:///csi/csi.sock

Impact

Breakage of a Scylla member creation

How frequently does it reproduce?

~5-10%

Installation details

Kernel Version: 5.10.179-168.710.amzn2.x86_64 Scylla version (or git commit hash): 2022.2.9-20230618.843304f9f734 with build-id a34753ee38bccbaf461e04ae0e63e17afe45e048

K8S local-volume-provisioner image: docker.io/scylladb/k8s-local-volume-provisioner:0.1.0-rc.0

Operator Image: scylladb/scylla-operator:1.9.0-rc.1 Operator Helm Version: 1.9.0-rc.1 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 3 nodes (i3.2xlarge)

OS / Image: `` (k8s-eks: undefined_region)

Test: perf-regression-throughput-eks Test id: 657ce2c4-2bbf-4941-8d23-8be7f0a2487d Test name: scylla-operator/operator-1.9/performance/perf-regression-throughput-eks Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 657ce2c4-2bbf-4941-8d23-8be7f0a2487d` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=657ce2c4-2bbf-4941-8d23-8be7f0a2487d) - Show all stored logs command: `$ hydra investigate show-logs 657ce2c4-2bbf-4941-8d23-8be7f0a2487d` ## Logs: - **db-cluster-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/db-cluster-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/db-cluster-657ce2c4.tar.gz) - **sct-runner-events-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/sct-runner-events-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/sct-runner-events-657ce2c4.tar.gz) - **sct-657ce2c4.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/sct-657ce2c4.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/sct-657ce2c4.log.tar.gz) - **monitor-set-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/monitor-set-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/monitor-set-657ce2c4.tar.gz) - **loader-set-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/loader-set-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/loader-set-657ce2c4.tar.gz) - **kubernetes-657ce2c4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/kubernetes-657ce2c4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/657ce2c4-2bbf-4941-8d23-8be7f0a2487d/20230623_174133/kubernetes-657ce2c4.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-operator/job/operator-1.9/job/performance/job/perf-regression-throughput-eks/8/) [Argus](https://argus.scylladb.com/test/e8c0209f-72d9-48d0-bd13-a45fa9e23e34/runs?additionalRuns[]=657ce2c4-2bbf-4941-8d23-8be7f0a2487d)
tnozicka commented 1 year ago

@zimnx I've hit this locally as well but what stroke me is that, because the socket is a file mounted from host (/var/lib/kubelet/plugins/local.csi.scylladb.com/csi.sock), this won't fix itself, no matter how many restarts (in contrast with TIME_WAIT). This won't even get fixed by node reboot, so someone has to delete that file manually which sucks.

We need to prioritize a fix.

tnozicka commented 1 year ago

also this seems pretty consistent for ungraceful VM shutdowns