Closed shangdibufashi closed 1 year ago
Hi @shangdibufashi , please check your cluster referring to this doc. If there is no gain, upload the logs of the kubelet on the node please and we will take o look.
ps -ef | grep kubelet | grep root-dir
emptykubectl -n kube-system get po -owide | grep juicefs
juicefs-csi-controller-0 3/3 Running 0 18h 10.244.0.135 gpu4030juicefs-csi-node-2pfjp 3/3 Running 0 18h 10.244.0.134 gpu4030 juicefs-gpu4030-juicefs-static-pv-rmzbmp 1/1 Running 0 18h 10.244.0.136 gpu4030 juicefs-gpu4030-pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a-abpiyj 1/1 Running 0 18h 10.244.0.138 gpu4030
get po -owide -A | grep 'pod-ui-56d7b79fc8-t2wdm'
default pod-ui-56d7b79fc8-t2wdm 0/1 ContainerCreating 0 18hgpu4030
Hi @shangdibufashi , can you provide:
sure, thanks for you reply, once it occurs again, all of the data/logs you mentioned above will be collected.
the version of csi (which image) juicedata/juicefs-csi-driver:v0.16.1
the complete event of pod FailedMount Unable to attach or mount volumes: unmounted volumes=[photo-volume], unattached volumes=[photo-volume kube-api-access-8kwrj]: timed out waiting for the condition
the log of juicefs csi node I1120 17:25:21.097358 7 main.go:126] Pod Reconciler Started I1120 17:25:21.097456 7 driver.go:31] Driver: csi.juicefs.com version v0.16.1 commit be7b8c2fa8612b028a6279a09a72b150b8c237c3 date 2022-08-10T02:34:04Z I1120 17:25:22.273856 7 driver.go:96] Listening for connection on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"} I1120 17:33:28.559939 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:33:28.559999 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:33:33.563036 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:33:33.563069 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:33:38.565316 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:33:38.565345 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:33:43.567903 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:33:43.567935 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:33:48.570154 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:33:48.570178 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:33:53.572350 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:33:53.572382 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:33:58.575085 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:33:58.575137 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:03.577499 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:03.577547 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:08.579766 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:08.579792 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:13.582201 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:13.582239 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:18.584911 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:18.584937 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:23.586939 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:23.586974 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:28.589320 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:28.589351 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:33.591409 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:33.591436 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:38.593230 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:38.593256 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:43.595386 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:43.595411 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:48.597607 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:48.597628 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:53.599345 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:53.599376 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:34:58.603267 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:34:58.603294 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:35:03.605992 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:35:03.606021 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:35:08.607998 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:35:08.608023 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:35:13.610006 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:35:13.610034 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:35:18.612153 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:35:18.612183 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:35:23.614312 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:35:23.614347 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value I1120 17:35:28.616688 7 kubelet_client.go:170] GetNodeRunningPods err: Unauthorized E1120 17:35:28.616718 7 reconciler.go:70] doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value
the log of kubelet Nov 21 15:25:06 gpu4030 kubelet: W1121 15:25:06.881003 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod76fc4367_cf0e_4b01_91ab_a375a4733a5b.slice/docker-be0c75184cbbd39ede0de84850c59774d1b690f16514ef2d41ec7a7bcf7375cb.scope": unable to determine device info for dir: /home/docker/overlay2/731ebe38918c9cc5441877191de59821df225879a1ca54695a0e5f5b8f40dea9/diff: stat failed on /home/docker/overlay2/731ebe38918c9cc5441877191de59821df225879a1ca54695a0e5f5b8f40dea9/diff with error: no such file or directory, continuing to push stats Nov 21 15:25:16 gpu4030 kubelet: W1121 15:25:16.208905 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda0c8d074_605c_45c9_9f2e_24f5446f62cd.slice/docker-266bf6fb354b5997b8a6c6c032eb270fc89c680d4ef1b4393f3e153427dace29.scope": unable to determine device info for dir: /home/docker/overlay2/4fc80b5047247f399417531f318eac2b43ff223a9b7a614bcbf2c5ac14794540/diff: stat failed on /home/docker/overlay2/4fc80b5047247f399417531f318eac2b43ff223a9b7a614bcbf2c5ac14794540/diff with error: no such file or directory, continuing to push stats Nov 21 15:25:49 gpu4030 kubelet: W1121 15:25:49.801299 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8f8b56a524eac56eec4109ae7ff3c740.slice/docker-6e377c398391105ac243c14afa32023c4048ad6d8c50f33844cd4d9b9b2e848b.scope": unable to determine device info for dir: /home/docker/overlay2/950388762e1661ffc9add27778c41ad5d64abdc2ce4e2b1a1b24594e36c6a3f3/diff: stat failed on /home/docker/overlay2/950388762e1661ffc9add27778c41ad5d64abdc2ce4e2b1a1b24594e36c6a3f3/diff with error: no such file or directory, continuing to push stats Nov 21 15:26:22 gpu4030 kubelet: W1121 15:26:22.266066 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda0c8d074_605c_45c9_9f2e_24f5446f62cd.slice/docker-266bf6fb354b5997b8a6c6c032eb270fc89c680d4ef1b4393f3e153427dace29.scope": unable to determine device info for dir: /home/docker/overlay2/4fc80b5047247f399417531f318eac2b43ff223a9b7a614bcbf2c5ac14794540/diff: stat failed on /home/docker/overlay2/4fc80b5047247f399417531f318eac2b43ff223a9b7a614bcbf2c5ac14794540/diff with error: no such file or directory, continuing to push stats Nov 21 15:26:29 gpu4030 kubelet: E1121 15:26:29.473846 1878 kubelet.go:1720] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[photo-volume], unattached volumes=[photo-volume kube-api-access-8kwrj]: timed out waiting for the condition" pod="default/ai-serve-gradio-ui-6bffc8b5f6-v952s" Nov 21 15:26:29 gpu4030 kubelet: E1121 15:26:29.473904 1878 pod_workers.go:918] "Error syncing pod, skipping" err="unmounted volumes=[photo-volume], unattached volumes=[photo-volume kube-api-access-8kwrj]: timed out waiting for the condition" pod="default/ai-serve-gradio-ui-6bffc8b5f6-v952s" podUID=10769a93-1aed-4c27-b0c1-31bc4c3d1751 Nov 21 15:26:34 gpu4030 kubelet: W1121 15:26:34.705857 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod76fc4367_cf0e_4b01_91ab_a375a4733a5b.slice/docker-be0c75184cbbd39ede0de84850c59774d1b690f16514ef2d41ec7a7bcf7375cb.scope": unable to determine device info for dir: /home/docker/overlay2/731ebe38918c9cc5441877191de59821df225879a1ca54695a0e5f5b8f40dea9/diff: stat failed on /home/docker/overlay2/731ebe38918c9cc5441877191de59821df225879a1ca54695a0e5f5b8f40dea9/diff with error: no such file or directory, continuing to push stats Nov 21 15:26:35 gpu4030 kubelet: E1121 15:26:35.899059 1878 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/csi.juicefs.com^pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a podName: nodeName:}" failed. No retries permitted until 2022-11-21 15:28:37.899015533 +0800 CST m=+1199547.471954789 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a") pod "ai-serve-gradio-ui-6bffc8b5f6-v952s" (UID: "10769a93-1aed-4c27-b0c1-31bc4c3d1751") : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name csi.juicefs.com not found in the list of registered CSI drivers Nov 21 15:27:07 gpu4030 kubelet: W1121 15:27:07.841368 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8f8b56a524eac56eec4109ae7ff3c740.slice/docker-6e377c398391105ac243c14afa32023c4048ad6d8c50f33844cd4d9b9b2e848b.scope": unable to determine device info for dir: /home/docker/overlay2/950388762e1661ffc9add27778c41ad5d64abdc2ce4e2b1a1b24594e36c6a3f3/diff: stat failed on /home/docker/overlay2/950388762e1661ffc9add27778c41ad5d64abdc2ce4e2b1a1b24594e36c6a3f3/diff with error: no such file or directory, continuing to push stats Nov 21 15:27:26 gpu4030 kubelet: W1121 15:27:26.190373 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda0c8d074_605c_45c9_9f2e_24f5446f62cd.slice/docker-266bf6fb354b5997b8a6c6c032eb270fc89c680d4ef1b4393f3e153427dace29.scope": unable to determine device info for dir: /home/docker/overlay2/4fc80b5047247f399417531f318eac2b43ff223a9b7a614bcbf2c5ac14794540/diff: stat failed on /home/docker/overlay2/4fc80b5047247f399417531f318eac2b43ff223a9b7a614bcbf2c5ac14794540/diff with error: no such file or directory, continuing to push stats Nov 21 15:27:51 gpu4030 kubelet: W1121 15:27:51.077829 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod76fc4367_cf0e_4b01_91ab_a375a4733a5b.slice/docker-be0c75184cbbd39ede0de84850c59774d1b690f16514ef2d41ec7a7bcf7375cb.scope": unable to determine device info for dir: /home/docker/overlay2/731ebe38918c9cc5441877191de59821df225879a1ca54695a0e5f5b8f40dea9/diff: stat failed on /home/docker/overlay2/731ebe38918c9cc5441877191de59821df225879a1ca54695a0e5f5b8f40dea9/diff with error: no such file or directory, continuing to push stats Nov 21 15:28:18 gpu4030 kubelet: W1121 15:28:18.786420 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8f8b56a524eac56eec4109ae7ff3c740.slice/docker-6e377c398391105ac243c14afa32023c4048ad6d8c50f33844cd4d9b9b2e848b.scope": unable to determine device info for dir: /home/docker/overlay2/950388762e1661ffc9add27778c41ad5d64abdc2ce4e2b1a1b24594e36c6a3f3/diff: stat failed on /home/docker/overlay2/950388762e1661ffc9add27778c41ad5d64abdc2ce4e2b1a1b24594e36c6a3f3/diff with error: no such file or directory, continuing to push stats Nov 21 15:28:34 gpu4030 kubelet: W1121 15:28:34.248868 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-poda0c8d074_605c_45c9_9f2e_24f5446f62cd.slice/docker-266bf6fb354b5997b8a6c6c032eb270fc89c680d4ef1b4393f3e153427dace29.scope": unable to determine device info for dir: /home/docker/overlay2/4fc80b5047247f399417531f318eac2b43ff223a9b7a614bcbf2c5ac14794540/diff: stat failed on /home/docker/overlay2/4fc80b5047247f399417531f318eac2b43ff223a9b7a614bcbf2c5ac14794540/diff with error: no such file or directory, continuing to push stats Nov 21 15:28:37 gpu4030 kubelet: E1121 15:28:37.971844 1878 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/csi.juicefs.com^pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a podName: nodeName:}" failed. No retries permitted until 2022-11-21 15:30:39.971806389 +0800 CST m=+1199669.544745651 (durationBeforeRetry 2m2s). Error: MountVolume.SetUp failed for volume "pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a" (UniqueName: "kubernetes.io/csi/csi.juicefs.com^pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a") pod "ai-serve-gradio-ui-6bffc8b5f6-v952s" (UID: "10769a93-1aed-4c27-b0c1-31bc4c3d1751") : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name csi.juicefs.com not found in the list of registered CSI drivers Nov 21 15:28:43 gpu4030 kubelet: E1121 15:28:43.480780 1878 kubelet.go:1720] "Unable to attach or mount volumes for pod; skipping pod" err="unmounted volumes=[photo-volume], unattached volumes=[kube-api-access-8kwrj photo-volume]: timed out waiting for the condition" pod="default/ai-serve-gradio-ui-6bffc8b5f6-v952s" Nov 21 15:28:43 gpu4030 kubelet: E1121 15:28:43.480834 1878 pod_workers.go:918] "Error syncing pod, skipping" err="unmounted volumes=[photo-volume], unattached volumes=[kube-api-access-8kwrj photo-volume]: timed out waiting for the condition" pod="default/ai-serve-gradio-ui-6bffc8b5f6-v952s" podUID=10769a93-1aed-4c27-b0c1-31bc4c3d1751 Nov 21 15:29:06 gpu4030 kubelet: W1121 15:29:06.340020 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod76fc4367_cf0e_4b01_91ab_a375a4733a5b.slice/docker-be0c75184cbbd39ede0de84850c59774d1b690f16514ef2d41ec7a7bcf7375cb.scope": unable to determine device info for dir: /home/docker/overlay2/731ebe38918c9cc5441877191de59821df225879a1ca54695a0e5f5b8f40dea9/diff: stat failed on /home/docker/overlay2/731ebe38918c9cc5441877191de59821df225879a1ca54695a0e5f5b8f40dea9/diff with error: no such file or directory, continuing to push stats Nov 21 15:29:27 gpu4030 kubelet: W1121 15:29:27.745257 1878 container.go:586] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod8f8b56a524eac56eec4109ae7ff3c740.slice/docker-6e377c398391105ac243c14afa32023c4048ad6d8c50f33844cd4d9b9b2e848b.scope": unable to determine device info for dir: /home/docker/overlay2/950388762e1661ffc9add27778c41ad5d64abdc2ce4e2b1a1b24594e36c6a3f3/diff: stat failed on /home/docker/overlay2/950388762e1661ffc9add27778c41ad5d64abdc2ce4e2b1a1b24594e36c6a3f3/diff with error: no such file or directory, continuing to push stats
the log of csi controller I1120 17:25:10.779521 7 driver.go:31] Driver: csi.juicefs.com version v0.16.1 commit be7b8c2fa8612b028a6279a09a72b150b8c237c3 date 2022-08-10T02:34:04Z I1120 17:25:12.379472 7 request.go:665] Waited for 1.047745362s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/direct.csi.min.io/v1beta2?timeout=32s I1120 17:25:13.186066 7 main.go:137] Mount Manager Started I1120 17:25:13.855365 7 driver.go:96] Listening for connection on address: &net.UnixAddr{Name:"/var/lib/csi/sockets/pluginproxy/csi.sock", Net:"unix"}
It seems CSI Node pod started after application pod mounted when node is restarted, then CSI is not found in kubelet.
At the same time, CSI Node pod log doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value
, it cannot connect to kubelet.
Hi @shangdibufashi , it seems the request from your csi-node is denied by kubelet. Could you provide the kubelet configuration?
For reference https://stackoverflow.com/questions/52268367/how-to-check-kubelet-configurations-currently-in-use
thanks @Hexilee
here it is:
{
"kubeletconfig": {
"enableServer": true,
"staticPodPath": "/etc/kubernetes/manifests",
"syncFrequency": "1m0s",
"fileCheckFrequency": "20s",
"httpCheckFrequency": "20s",
"address": "0.0.0.0",
"port": 10250,
"tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
"tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
"rotateCertificates": true,
"authentication": {
"x509": {
"clientCAFile": "/etc/kubernetes/pki/ca.crt"
},
"webhook": {
"enabled": true,
"cacheTTL": "2m0s"
},
"anonymous": {
"enabled": false
}
},
"authorization": {
"mode": "Webhook",
"webhook": {
"cacheAuthorizedTTL": "5m0s",
"cacheUnauthorizedTTL": "30s"
}
},
"registryPullQPS": 5,
"registryBurst": 10,
"eventRecordQPS": 5,
"eventBurst": 10,
"enableDebuggingHandlers": true,
"healthzPort": 10248,
"healthzBindAddress": "127.0.0.1",
"oomScoreAdj": -999,
"clusterDomain": "cluster.local",
"clusterDNS": ["10.96.0.10"],
"streamingConnectionIdleTimeout": "4h0m0s",
"nodeStatusUpdateFrequency": "10s",
"nodeStatusReportFrequency": "5m0s",
"nodeLeaseDurationSeconds": 40,
"imageMinimumGCAge": "2m0s",
"imageGCHighThresholdPercent": 85,
"imageGCLowThresholdPercent": 80,
"volumeStatsAggPeriod": "1m0s",
"cgroupsPerQOS": true,
"cgroupDriver": "systemd",
"cpuManagerPolicy": "none",
"cpuManagerReconcilePeriod": "10s",
"memoryManagerPolicy": "None",
"topologyManagerPolicy": "none",
"topologyManagerScope": "container",
"runtimeRequestTimeout": "2m0s",
"hairpinMode": "promiscuous-bridge",
"maxPods": 110,
"podPidsLimit": -1,
"resolvConf": "/etc/resolv.conf",
"cpuCFSQuota": true,
"cpuCFSQuotaPeriod": "100ms",
"nodeStatusMaxImages": 50,
"maxOpenFiles": 1000000,
"contentType": "application/vnd.kubernetes.protobuf",
"kubeAPIQPS": 5,
"kubeAPIBurst": 10,
"serializeImagePulls": true,
"evictionHard": {
"imagefs.available": "15%",
"memory.available": "100Mi",
"nodefs.available": "10%",
"nodefs.inodesFree": "5%"
},
"evictionPressureTransitionPeriod": "5m0s",
"enableControllerAttachDetach": true,
"makeIPTablesUtilChains": true,
"iptablesMasqueradeBit": 14,
"iptablesDropBit": 15,
"failSwapOn": true,
"memorySwap": {},
"containerLogMaxSize": "10Mi",
"containerLogMaxFiles": 5,
"configMapAndSecretChangeDetectionStrategy": "Watch",
"enforceNodeAllocatable": ["pods"],
"volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",
"logging": {
"format": "text"
},
"enableSystemLogHandler": true,
"shutdownGracePeriod": "0s",
"shutdownGracePeriodCriticalPods": "0s",
"enableProfilingHandler": true,
"enableDebugFlagsHandler": true,
"seccompDefault": false,
"memoryThrottlingFactor": 0.8
}
}
@shangdibufashi Fine, it seems your kubelet disabled anonymous access, could you describe csi-node Pods by kubectl describe -n kube-system pods -l app=juicefs-csi-node
? And if you can open a bash
on csi-node Pods by kubectl exec
, could you execute following command?
> curl https://<hostIP>:10250/pods/ --insecure -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
describe csi-node Pods:
Name: juicefs-csi-node-q9n4v
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: gpu4030/192.168.67.179
Start Time: Mon, 07 Nov 2022 18:05:35 +0800
Labels: app=juicefs-csi-node
app.kubernetes.io/instance=juicefs-csi-driver
app.kubernetes.io/name=juicefs-csi-driver
app.kubernetes.io/version=master
controller-revision-hash=d9476846
pod-template-generation=1
Annotations: <none>
Status: Running
IP: 10.244.0.174
IPs:
IP: 10.244.0.174
Controlled By: DaemonSet/juicefs-csi-node
Containers:
juicefs-plugin:
Container ID: docker://5de96793ef20a8515b257a6113642c32b7ebc94d16407dff77bbf7f42289dc84
Image: juicedata/juicefs-csi-driver:v0.16.1
Image ID: docker-pullable://juicedata/juicefs-csi-driver@sha256:f6e438db11db8ae17bc6865f9fa96cae89a27a41acf3dd863fb51693d4334338
Port: 9909/TCP
Host Port: 0/TCP
Args:
--endpoint=$(CSI_ENDPOINT)
--logtostderr
--nodeid=$(NODE_NAME)
--v=5
--enable-manager=true
State: Running
Started: Mon, 21 Nov 2022 01:25:21 +0800
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 21 Nov 2022 01:24:57 +0800
Finished: Mon, 21 Nov 2022 01:24:57 +0800
Ready: True
Restart Count: 4
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 100m
memory: 512Mi
Liveness: http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
Environment:
CSI_ENDPOINT: unix:/csi/csi.sock
NODE_NAME: (v1:spec.nodeName)
JUICEFS_MOUNT_NAMESPACE: kube-system (v1:metadata.namespace)
POD_NAME: juicefs-csi-node-q9n4v (v1:metadata.name)
HOST_IP: (v1:status.hostIP)
KUBELET_PORT: 10250
JUICEFS_MOUNT_PATH: /var/lib/juicefs/volume
JUICEFS_CONFIG_PATH: /var/lib/juicefs/config
Mounts:
/csi from plugin-dir (rw)
/dev from device-dir (rw)
/jfs from jfs-dir (rw)
/registration from registration-dir (rw)
/root/.juicefs from jfs-root-dir (rw)
/var/lib/kubelet from kubelet-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qvhn6 (ro)
node-driver-registrar:
Container ID: docker://4d35669df68a4b5a288829944d5aced6609baaf5bcbda1c90d3a2694a856d604
Image: quay.io/k8scsi/csi-node-driver-registrar:v1.3.0
Image ID: docker-pullable://quay.io/k8scsi/csi-node-driver-registrar@sha256:9622c6a6dac7499a055a382930f4de82905a3c5735c0753f7094115c9c871309
Port: <none>
Host Port: <none>
Args:
--csi-address=$(ADDRESS)
--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
--v=5
State: Running
Started: Mon, 07 Nov 2022 18:17:46 +0800
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Mon, 07 Nov 2022 18:05:37 +0800
Finished: Mon, 07 Nov 2022 18:16:16 +0800
Ready: True
Restart Count: 1
Environment:
ADDRESS: /csi/csi.sock
DRIVER_REG_SOCK_PATH: /var/lib/kubelet/csi-plugins/csi.juicefs.com/csi.sock
Mounts:
/csi from plugin-dir (rw)
/registration from registration-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qvhn6 (ro)
liveness-probe:
Container ID: docker://b2f649d02dfff2add5d295c00defd7c92d95bbbe1a1c1c2b62efe21746e307b1
Image: quay.io/k8scsi/livenessprobe:v1.1.0
Image ID: docker-pullable://quay.io/k8scsi/livenessprobe@sha256:dde617756e0f602adc566ab71fd885f1dad451ad3fb063ac991c95a2ff47aea5
Port: <none>
Host Port: <none>
Args:
--csi-address=$(ADDRESS)
--health-port=$(HEALTH_PORT)
State: Running
Started: Mon, 07 Nov 2022 18:17:50 +0800
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Mon, 07 Nov 2022 18:05:37 +0800
Finished: Mon, 07 Nov 2022 18:16:17 +0800
Ready: True
Restart Count: 1
Environment:
ADDRESS: /csi/csi.sock
HEALTH_PORT: 9909
Mounts:
/csi from plugin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qvhn6 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kubelet-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet
HostPathType: Directory
plugin-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/csi-plugins/csi.juicefs.com/
HostPathType: DirectoryOrCreate
registration-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/plugins_registry/
HostPathType: Directory
device-dir:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType: Directory
jfs-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/juicefs/volume
HostPathType: DirectoryOrCreate
jfs-root-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/juicefs/config
HostPathType: DirectoryOrCreate
kube-api-access-qvhn6:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events: <none>
the command blew works fine in the csi-node Pods, getting json results as expected
curl https://
:10250/pods/ --insecure -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
the command blew works fine in the csi-node Pods, getting json results as expected
curl https://:10250/pods/ --insecure -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
Ok, could you recreate any pods and check if the csi-node works now?
the command blew works fine in the csi-node Pods, getting json results as expected
curl https://:10250/pods/ --insecure -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
Ok, could you recreate any pods and check if the csi-node works now?
not working
MountVolume.SetUp failed for volume "pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name csi.juicefs.com not found in the list of registered CSI drivers
for the pod of juicefs-csi-node, there is no more new logs since Nov 22nd
It seems you have 3 csi-node replicas, could you provide all logs of the 3 pods?
there is only one pod as i could see via kubetl
# kubectl get pod -n kube-system | grep juicefs
juicefs-csi-controller-0 3/3 Running 6 (3d14h ago) 16d
juicefs-csi-node-q9n4v 3/3 Running 6 (3d14h ago) 16d
juicefs-gpu4030-juicefs-static-pv-rmzbmp 1/1 Running 1 (16d ago) 16d
juicefs-gpu4030-pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a-abpiyj 1/1 Running 1 (16d ago) 16d
# kubectl get pod -n kube-system | grep juicefs juicefs-csi-controller-0 3/3 Running 6 (3d14h ago) 16d juicefs-csi-node-q9n4v 3/3 Running 6 (3d14h ago) 16d juicefs-gpu4030-juicefs-static-pv-rmzbmp 1/1 Running 1 (16d ago) 16d juicefs-gpu4030-pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a-abpiyj 1/1 Running 1 (16d ago) 16d
Right, my mistake.
Can you provide log of container node-driver-registrar
in csi node ?
kubectl -n kube-system logs juicefs-csi-node-q9n4v node-driver-registrar
Can you provide log of container
node-driver-registrar
in csi node ?kubectl -n kube-system logs juicefs-csi-node-q9n4v node-driver-registrar
I1107 10:17:48.002158 1 main.go:110] Version: v1.3.0-0-g6e9fff3e
I1107 10:17:48.047934 1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1107 10:17:48.069512 1 connection.go:151] Connecting to unix:///csi/csi.sock
W1107 10:17:58.069684 1 connection.go:170] Still connecting to unix:///csi/csi.sock
I1107 10:17:59.904574 1 main.go:127] Calling CSI driver to discover driver name
I1107 10:17:59.904602 1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I1107 10:17:59.904609 1 connection.go:181] GRPC request: {}
I1107 10:18:00.069575 1 connection.go:183] GRPC response: {"name":"csi.juicefs.com","vendor_version":"v0.16.1"}
I1107 10:18:00.070185 1 connection.go:184] GRPC error: <nil>
I1107 10:18:00.070193 1 main.go:137] CSI driver name: "csi.juicefs.com"
I1107 10:18:00.070266 1 node_register.go:51] Starting Registration Server at: /registration/csi.juicefs.com-reg.sock
I1107 10:18:00.070404 1 node_register.go:60] Registration Server started at: /registration/csi.juicefs.com-reg.sock
I1107 10:18:01.291839 1 main.go:77] Received GetInfo call: &InfoRequest{}
I1107 10:18:01.828748 1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
I1107 10:18:04.807622 1 main.go:77] Received GetInfo call: &InfoRequest{}
I1107 10:18:06.925381 1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E1120 17:24:37.377562 1 connection.go:129] Lost connection to unix:///csi/csi.sock.
@shangdibufashi , hi, it seems juicefs plugin container in csi node pod restarted and removed socket before exit. We fix it in release 0.17.0. Please upgrade it and have a retry.
Got it, will upgrade it right away
problem resolved, thank you guys
issue occurred again. juicefs-csi-node Lost connection to unix:///csi/csi.sock. after pod restart.
# kubectl -n kube-system logs juicefs-csi-node-mwgdg node-driver-registrar
I1124 09:45:35.400499 1 main.go:110] Version: v1.3.0-0-g6e9fff3e
I1124 09:45:35.401534 1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1124 09:45:35.402220 1 connection.go:151] Connecting to unix:///csi/csi.sock
I1124 09:45:35.416008 1 main.go:127] Calling CSI driver to discover driver name
I1124 09:45:35.416033 1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I1124 09:45:35.416042 1 connection.go:181] GRPC request: {}
I1124 09:45:35.439612 1 connection.go:183] GRPC response: {"name":"csi.juicefs.com","vendor_version":"v0.17.2"}
I1124 09:45:35.440050 1 connection.go:184] GRPC error: <nil>
I1124 09:45:35.440058 1 main.go:137] CSI driver name: "csi.juicefs.com"
I1124 09:45:35.440076 1 node_register.go:51] Starting Registration Server at: /registration/csi.juicefs.com-reg.sock
I1124 09:45:35.440233 1 node_register.go:60] Registration Server started at: /registration/csi.juicefs.com-reg.sock
I1124 09:45:35.725717 1 main.go:77] Received GetInfo call: &InfoRequest{}
I1124 09:45:35.774214 1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E1208 03:10:03.028675 1 connection.go:129] Lost connection to unix:///csi/csi.sock.
version: juicedata/juicefs-csi-driver:v0.17.2
mean while, the health check seems ok:
I1208 03:37:34.615489 1 main.go:71] Health check succeeded
I1208 03:37:44.614920 1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:37:44.615571 1 main.go:71] Health check succeeded
I1208 03:37:54.614959 1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:37:54.615634 1 main.go:71] Health check succeeded
I1208 03:38:04.615641 1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:04.616337 1 main.go:71] Health check succeeded
I1208 03:38:14.614819 1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:14.615469 1 main.go:71] Health check succeeded
I1208 03:38:24.614566 1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:24.615267 1 main.go:71] Health check succeeded
I1208 03:38:34.615021 1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:34.615639 1 main.go:71] Health check succeeded
I1208 03:38:44.615280 1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:44.615987 1 main.go:71] Health check succeeded
There is no health checker for node-driver-registrar
- name: node-driver-registrar
image: quay.io/k8scsi/csi-node-driver-registrar:v1.3.0
args:
- '--csi-address=$(ADDRESS)'
- '--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)'
- '--v=5'
env:
- name: ADDRESS
value: /csi/csi.sock
- name: DRIVER_REG_SOCK_PATH
value: /var/lib/kubelet/csi-plugins/csi.juicefs.com/csi.sock
resources: {}
volumeMounts:
- name: plugin-dir
mountPath: /csi
- name: registration-dir
mountPath: /registration
- name: kube-api-access-dnjk4
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
What's the log of juicefs-plugin ?
What's the log of juicefs-plugin ?
The log is missing as the pod has to be restarted to fix the problem. Will send the log when the error occurs.
Reopen if feedback. closing.
What happened: 在K8S中搭配minio使用JuiceFS, 单个节点, 4个磁盘, 当机器关机重启后, 创建新的pod, 会出现 MountVolume.SetUp failed for volume "juicefs-static-pv" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name csi.juicefs.com not found in the list of registered CSI drivers 的错误
What you expected to happen: POD正常部署
How to reproduce it (as minimally and precisely as possible): 在K8S中搭配minio使用JuiceFS, 单个节点, 4个磁盘, 当机器关机重启后, 创建新的pod, 该pod的deployment使用static的方式mount juicefs
Anything else we need to know?
Environment:
juicefs --version
) or Hadoop Java SDK version:cat /etc/os-release
):uname -a
):