juicedata / juicefs-csi-driver

JuiceFS CSI Driver
https://github.com/juicedata/juicefs
Apache License 2.0
214 stars 82 forks source link

driver name csi.juicefs.com not found in the list of registered CSI drivers #452

Closed shangdibufashi closed 1 year ago

shangdibufashi commented 1 year ago

What happened: 在K8S中搭配minio使用JuiceFS, 单个节点, 4个磁盘, 当机器关机重启后, 创建新的pod, 会出现 MountVolume.SetUp failed for volume "juicefs-static-pv" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name csi.juicefs.com not found in the list of registered CSI drivers 的错误

What you expected to happen: POD正常部署

How to reproduce it (as minimally and precisely as possible): 在K8S中搭配minio使用JuiceFS, 单个节点, 4个磁盘, 当机器关机重启后, 创建新的pod, 该pod的deployment使用static的方式mount juicefs

Anything else we need to know?

Environment:

zwwhdls commented 1 year ago

Hi @shangdibufashi , please check your cluster referring to this doc. If there is no gain, upload the logs of the kubelet on the node please and we will take o look.

shangdibufashi commented 1 year ago
  1. ps -ef | grep kubelet | grep root-dir empty
  2. kubectl -n kube-system get po -owide | grep juicefs
    juicefs-csi-controller-0                                          3/3     Running   0                 18h    10.244.0.135     gpu4030              
    juicefs-csi-node-2pfjp                                            3/3     Running   0                 18h    10.244.0.134     gpu4030              
    juicefs-gpu4030-juicefs-static-pv-rmzbmp                          1/1     Running   0                 18h    10.244.0.136     gpu4030              
    juicefs-gpu4030-pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a-abpiyj   1/1     Running   0                 18h    10.244.0.138     gpu4030              
    
  3. get po -owide -A | grep 'pod-ui-56d7b79fc8-t2wdm'
    default             pod-ui-56d7b79fc8-t2wdm                               0/1     ContainerCreating   0                 18h                gpu4030              
    
zwwhdls commented 1 year ago

Hi @shangdibufashi , can you provide:

  1. the version of csi (which image)
  2. the complete event of pod
  3. the log of juicefs csi node
  4. the log of kubelet
shangdibufashi commented 1 year ago

sure, thanks for you reply, once it occurs again, all of the data/logs you mentioned above will be collected.

shangdibufashi commented 1 year ago
zwwhdls commented 1 year ago

It seems CSI Node pod started after application pod mounted when node is restarted, then CSI is not found in kubelet.

At the same time, CSI Node pod log doReconcile GetNodeRunningPods: invalid character 'U' looking for beginning of value , it cannot connect to kubelet.

Hexilee commented 1 year ago

Hi @shangdibufashi , it seems the request from your csi-node is denied by kubelet. Could you provide the kubelet configuration?

For reference https://stackoverflow.com/questions/52268367/how-to-check-kubelet-configurations-currently-in-use

shangdibufashi commented 1 year ago

thanks @Hexilee

here it is:

{
    "kubeletconfig": {
        "enableServer": true,
        "staticPodPath": "/etc/kubernetes/manifests",
        "syncFrequency": "1m0s",
        "fileCheckFrequency": "20s",
        "httpCheckFrequency": "20s",
        "address": "0.0.0.0",
        "port": 10250,
        "tlsCertFile": "/var/lib/kubelet/pki/kubelet.crt",
        "tlsPrivateKeyFile": "/var/lib/kubelet/pki/kubelet.key",
        "rotateCertificates": true,
        "authentication": {
            "x509": {
                "clientCAFile": "/etc/kubernetes/pki/ca.crt"
            },
            "webhook": {
                "enabled": true,
                "cacheTTL": "2m0s"
            },
            "anonymous": {
                "enabled": false
            }
        },
        "authorization": {
            "mode": "Webhook",
            "webhook": {
                "cacheAuthorizedTTL": "5m0s",
                "cacheUnauthorizedTTL": "30s"
            }
        },
        "registryPullQPS": 5,
        "registryBurst": 10,
        "eventRecordQPS": 5,
        "eventBurst": 10,
        "enableDebuggingHandlers": true,
        "healthzPort": 10248,
        "healthzBindAddress": "127.0.0.1",
        "oomScoreAdj": -999,
        "clusterDomain": "cluster.local",
        "clusterDNS": ["10.96.0.10"],
        "streamingConnectionIdleTimeout": "4h0m0s",
        "nodeStatusUpdateFrequency": "10s",
        "nodeStatusReportFrequency": "5m0s",
        "nodeLeaseDurationSeconds": 40,
        "imageMinimumGCAge": "2m0s",
        "imageGCHighThresholdPercent": 85,
        "imageGCLowThresholdPercent": 80,
        "volumeStatsAggPeriod": "1m0s",
        "cgroupsPerQOS": true,
        "cgroupDriver": "systemd",
        "cpuManagerPolicy": "none",
        "cpuManagerReconcilePeriod": "10s",
        "memoryManagerPolicy": "None",
        "topologyManagerPolicy": "none",
        "topologyManagerScope": "container",
        "runtimeRequestTimeout": "2m0s",
        "hairpinMode": "promiscuous-bridge",
        "maxPods": 110,
        "podPidsLimit": -1,
        "resolvConf": "/etc/resolv.conf",
        "cpuCFSQuota": true,
        "cpuCFSQuotaPeriod": "100ms",
        "nodeStatusMaxImages": 50,
        "maxOpenFiles": 1000000,
        "contentType": "application/vnd.kubernetes.protobuf",
        "kubeAPIQPS": 5,
        "kubeAPIBurst": 10,
        "serializeImagePulls": true,
        "evictionHard": {
            "imagefs.available": "15%",
            "memory.available": "100Mi",
            "nodefs.available": "10%",
            "nodefs.inodesFree": "5%"
        },
        "evictionPressureTransitionPeriod": "5m0s",
        "enableControllerAttachDetach": true,
        "makeIPTablesUtilChains": true,
        "iptablesMasqueradeBit": 14,
        "iptablesDropBit": 15,
        "failSwapOn": true,
        "memorySwap": {},
        "containerLogMaxSize": "10Mi",
        "containerLogMaxFiles": 5,
        "configMapAndSecretChangeDetectionStrategy": "Watch",
        "enforceNodeAllocatable": ["pods"],
        "volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",
        "logging": {
            "format": "text"
        },
        "enableSystemLogHandler": true,
        "shutdownGracePeriod": "0s",
        "shutdownGracePeriodCriticalPods": "0s",
        "enableProfilingHandler": true,
        "enableDebugFlagsHandler": true,
        "seccompDefault": false,
        "memoryThrottlingFactor": 0.8
    }
}
Hexilee commented 1 year ago

@shangdibufashi Fine, it seems your kubelet disabled anonymous access, could you describe csi-node Pods by kubectl describe -n kube-system pods -l app=juicefs-csi-node? And if you can open a bash on csi-node Pods by kubectl exec, could you execute following command?

> curl https://<hostIP>:10250/pods/ --insecure -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
shangdibufashi commented 1 year ago

describe csi-node Pods:


Name:                 juicefs-csi-node-q9n4v
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 gpu4030/192.168.67.179
Start Time:           Mon, 07 Nov 2022 18:05:35 +0800
Labels:               app=juicefs-csi-node
                      app.kubernetes.io/instance=juicefs-csi-driver
                      app.kubernetes.io/name=juicefs-csi-driver
                      app.kubernetes.io/version=master
                      controller-revision-hash=d9476846
                      pod-template-generation=1
Annotations:          <none>
Status:               Running
IP:                   10.244.0.174
IPs:
  IP:           10.244.0.174
Controlled By:  DaemonSet/juicefs-csi-node
Containers:
  juicefs-plugin:
    Container ID:  docker://5de96793ef20a8515b257a6113642c32b7ebc94d16407dff77bbf7f42289dc84
    Image:         juicedata/juicefs-csi-driver:v0.16.1
    Image ID:      docker-pullable://juicedata/juicefs-csi-driver@sha256:f6e438db11db8ae17bc6865f9fa96cae89a27a41acf3dd863fb51693d4334338
    Port:          9909/TCP
    Host Port:     0/TCP
    Args:
      --endpoint=$(CSI_ENDPOINT)
      --logtostderr
      --nodeid=$(NODE_NAME)
      --v=5
      --enable-manager=true
    State:          Running
      Started:      Mon, 21 Nov 2022 01:25:21 +0800
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 21 Nov 2022 01:24:57 +0800
      Finished:     Mon, 21 Nov 2022 01:24:57 +0800
    Ready:          True
    Restart Count:  4
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     100m
      memory:  512Mi
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
    Environment:
      CSI_ENDPOINT:             unix:/csi/csi.sock
      NODE_NAME:                 (v1:spec.nodeName)
      JUICEFS_MOUNT_NAMESPACE:  kube-system (v1:metadata.namespace)
      POD_NAME:                 juicefs-csi-node-q9n4v (v1:metadata.name)
      HOST_IP:                   (v1:status.hostIP)
      KUBELET_PORT:             10250
      JUICEFS_MOUNT_PATH:       /var/lib/juicefs/volume
      JUICEFS_CONFIG_PATH:      /var/lib/juicefs/config
    Mounts:
      /csi from plugin-dir (rw)
      /dev from device-dir (rw)
      /jfs from jfs-dir (rw)
      /registration from registration-dir (rw)
      /root/.juicefs from jfs-root-dir (rw)
      /var/lib/kubelet from kubelet-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qvhn6 (ro)
  node-driver-registrar:
    Container ID:  docker://4d35669df68a4b5a288829944d5aced6609baaf5bcbda1c90d3a2694a856d604
    Image:         quay.io/k8scsi/csi-node-driver-registrar:v1.3.0
    Image ID:      docker-pullable://quay.io/k8scsi/csi-node-driver-registrar@sha256:9622c6a6dac7499a055a382930f4de82905a3c5735c0753f7094115c9c871309
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=$(ADDRESS)
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
      --v=5
    State:          Running
      Started:      Mon, 07 Nov 2022 18:17:46 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Mon, 07 Nov 2022 18:05:37 +0800
      Finished:     Mon, 07 Nov 2022 18:16:16 +0800
    Ready:          True
    Restart Count:  1
    Environment:
      ADDRESS:               /csi/csi.sock
      DRIVER_REG_SOCK_PATH:  /var/lib/kubelet/csi-plugins/csi.juicefs.com/csi.sock
    Mounts:
      /csi from plugin-dir (rw)
      /registration from registration-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qvhn6 (ro)
  liveness-probe:
    Container ID:  docker://b2f649d02dfff2add5d295c00defd7c92d95bbbe1a1c1c2b62efe21746e307b1
    Image:         quay.io/k8scsi/livenessprobe:v1.1.0
    Image ID:      docker-pullable://quay.io/k8scsi/livenessprobe@sha256:dde617756e0f602adc566ab71fd885f1dad451ad3fb063ac991c95a2ff47aea5
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=$(ADDRESS)
      --health-port=$(HEALTH_PORT)
    State:          Running
      Started:      Mon, 07 Nov 2022 18:17:50 +0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Mon, 07 Nov 2022 18:05:37 +0800
      Finished:     Mon, 07 Nov 2022 18:16:17 +0800
    Ready:          True
    Restart Count:  1
    Environment:
      ADDRESS:      /csi/csi.sock
      HEALTH_PORT:  9909
    Mounts:
      /csi from plugin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qvhn6 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kubelet-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:  Directory
  plugin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/csi-plugins/csi.juicefs.com/
    HostPathType:  DirectoryOrCreate
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry/
    HostPathType:  Directory
  device-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  Directory
  jfs-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/juicefs/volume
    HostPathType:  DirectoryOrCreate
  jfs-root-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/juicefs/config
    HostPathType:  DirectoryOrCreate
  kube-api-access-qvhn6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>
shangdibufashi commented 1 year ago

the command blew works fine in the csi-node Pods, getting json results as expected

curl https://:10250/pods/ --insecure -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"

Hexilee commented 1 year ago

the command blew works fine in the csi-node Pods, getting json results as expected

curl https://:10250/pods/ --insecure -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"

Ok, could you recreate any pods and check if the csi-node works now?

shangdibufashi commented 1 year ago

the command blew works fine in the csi-node Pods, getting json results as expected

curl https://:10250/pods/ --insecure -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"

Ok, could you recreate any pods and check if the csi-node works now?

not working

MountVolume.SetUp failed for volume "pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name csi.juicefs.com not found in the list of registered CSI drivers
shangdibufashi commented 1 year ago

for the pod of juicefs-csi-node, there is no more new logs since Nov 22nd

Hexilee commented 1 year ago

It seems you have 3 csi-node replicas, could you provide all logs of the 3 pods?

shangdibufashi commented 1 year ago

there is only one pod as i could see via kubetl

shangdibufashi commented 1 year ago
# kubectl get pod -n kube-system | grep juicefs
juicefs-csi-controller-0                                          3/3     Running   6 (3d14h ago)    16d
juicefs-csi-node-q9n4v                                            3/3     Running   6 (3d14h ago)    16d
juicefs-gpu4030-juicefs-static-pv-rmzbmp                          1/1     Running   1 (16d ago)      16d
juicefs-gpu4030-pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a-abpiyj   1/1     Running   1 (16d ago)      16d
Hexilee commented 1 year ago
# kubectl get pod -n kube-system | grep juicefs
juicefs-csi-controller-0                                          3/3     Running   6 (3d14h ago)    16d
juicefs-csi-node-q9n4v                                            3/3     Running   6 (3d14h ago)    16d
juicefs-gpu4030-juicefs-static-pv-rmzbmp                          1/1     Running   1 (16d ago)      16d
juicefs-gpu4030-pvc-8c4ac23c-30ce-48d1-91de-6b7872eaa14a-abpiyj   1/1     Running   1 (16d ago)      16d

Right, my mistake.

zwwhdls commented 1 year ago

Can you provide log of container node-driver-registrar in csi node ?

kubectl -n kube-system logs juicefs-csi-node-q9n4v node-driver-registrar
shangdibufashi commented 1 year ago

Can you provide log of container node-driver-registrar in csi node ?

kubectl -n kube-system logs juicefs-csi-node-q9n4v node-driver-registrar
I1107 10:17:48.002158       1 main.go:110] Version: v1.3.0-0-g6e9fff3e
I1107 10:17:48.047934       1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1107 10:17:48.069512       1 connection.go:151] Connecting to unix:///csi/csi.sock
W1107 10:17:58.069684       1 connection.go:170] Still connecting to unix:///csi/csi.sock
I1107 10:17:59.904574       1 main.go:127] Calling CSI driver to discover driver name
I1107 10:17:59.904602       1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I1107 10:17:59.904609       1 connection.go:181] GRPC request: {}
I1107 10:18:00.069575       1 connection.go:183] GRPC response: {"name":"csi.juicefs.com","vendor_version":"v0.16.1"}
I1107 10:18:00.070185       1 connection.go:184] GRPC error: <nil>
I1107 10:18:00.070193       1 main.go:137] CSI driver name: "csi.juicefs.com"
I1107 10:18:00.070266       1 node_register.go:51] Starting Registration Server at: /registration/csi.juicefs.com-reg.sock
I1107 10:18:00.070404       1 node_register.go:60] Registration Server started at: /registration/csi.juicefs.com-reg.sock
I1107 10:18:01.291839       1 main.go:77] Received GetInfo call: &InfoRequest{}
I1107 10:18:01.828748       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
I1107 10:18:04.807622       1 main.go:77] Received GetInfo call: &InfoRequest{}
I1107 10:18:06.925381       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E1120 17:24:37.377562       1 connection.go:129] Lost connection to unix:///csi/csi.sock.
zwwhdls commented 1 year ago

@shangdibufashi , hi, it seems juicefs plugin container in csi node pod restarted and removed socket before exit. We fix it in release 0.17.0. Please upgrade it and have a retry.

shangdibufashi commented 1 year ago

Got it, will upgrade it right away

shangdibufashi commented 1 year ago

problem resolved, thank you guys

shangdibufashi commented 1 year ago

issue occurred again. juicefs-csi-node Lost connection to unix:///csi/csi.sock. after pod restart.

# kubectl -n kube-system logs juicefs-csi-node-mwgdg node-driver-registrar
I1124 09:45:35.400499       1 main.go:110] Version: v1.3.0-0-g6e9fff3e
I1124 09:45:35.401534       1 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1124 09:45:35.402220       1 connection.go:151] Connecting to unix:///csi/csi.sock
I1124 09:45:35.416008       1 main.go:127] Calling CSI driver to discover driver name
I1124 09:45:35.416033       1 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I1124 09:45:35.416042       1 connection.go:181] GRPC request: {}
I1124 09:45:35.439612       1 connection.go:183] GRPC response: {"name":"csi.juicefs.com","vendor_version":"v0.17.2"}
I1124 09:45:35.440050       1 connection.go:184] GRPC error: <nil>
I1124 09:45:35.440058       1 main.go:137] CSI driver name: "csi.juicefs.com"
I1124 09:45:35.440076       1 node_register.go:51] Starting Registration Server at: /registration/csi.juicefs.com-reg.sock
I1124 09:45:35.440233       1 node_register.go:60] Registration Server started at: /registration/csi.juicefs.com-reg.sock
I1124 09:45:35.725717       1 main.go:77] Received GetInfo call: &InfoRequest{}
I1124 09:45:35.774214       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E1208 03:10:03.028675       1 connection.go:129]  Lost connection to unix:///csi/csi.sock.

image

shangdibufashi commented 1 year ago

version: juicedata/juicefs-csi-driver:v0.17.2

shangdibufashi commented 1 year ago

mean while, the health check seems ok:


I1208 03:37:34.615489       1 main.go:71] Health check succeeded
I1208 03:37:44.614920       1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:37:44.615571       1 main.go:71] Health check succeeded
I1208 03:37:54.614959       1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:37:54.615634       1 main.go:71] Health check succeeded
I1208 03:38:04.615641       1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:04.616337       1 main.go:71] Health check succeeded
I1208 03:38:14.614819       1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:14.615469       1 main.go:71] Health check succeeded
I1208 03:38:24.614566       1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:24.615267       1 main.go:71] Health check succeeded
I1208 03:38:34.615021       1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:34.615639       1 main.go:71] Health check succeeded
I1208 03:38:44.615280       1 main.go:53] Sending probe request to CSI driver "csi.juicefs.com"
I1208 03:38:44.615987       1 main.go:71] Health check succeeded
shangdibufashi commented 1 year ago

There is no health checker for node-driver-registrar

 - name: node-driver-registrar
      image: quay.io/k8scsi/csi-node-driver-registrar:v1.3.0
      args:
        - '--csi-address=$(ADDRESS)'
        - '--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)'
        - '--v=5'
      env:
        - name: ADDRESS
          value: /csi/csi.sock
        - name: DRIVER_REG_SOCK_PATH
          value: /var/lib/kubelet/csi-plugins/csi.juicefs.com/csi.sock
      resources: {}
      volumeMounts:
        - name: plugin-dir
          mountPath: /csi
        - name: registration-dir
          mountPath: /registration
        - name: kube-api-access-dnjk4
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
zwwhdls commented 1 year ago

What's the log of juicefs-plugin ?

shangdibufashi commented 1 year ago

What's the log of juicefs-plugin ?

The log is missing as the pod has to be restarted to fix the problem. Will send the log when the error occurs.

zwwhdls commented 1 year ago

Reopen if feedback. closing.