Still connecting to unix:///var/lib/kubelet/csi-plugins/*.csi.alibabacloud.com/csi.sock

lliiang commented 3 months ago

What happened:

集群上其中两个节点一直csi-plugin-h4qhz 报错重启，以下是日志截图

以下是container日志 csi-plugin-h4qhz-nas-driver-registrar.log csi-plugin-h4qhz-disk-driver-registrar.log

csi-plugin-h4qhz-csi-plugin.log csi-plugin-h4qhz-oss-driver-registrar.log

What you expected to happen:

集群有十几个节点，就其中两个节点报错,下面是DaemonSet的yaml `kind: DaemonSet apiVersion: apps/v1 metadata: name: csi-plugin namespace: kube-system uid: 509d3cfc-0dbe-4ebd-8d79-3b8c52774d17 resourceVersion: '601102482' generation: 5 creationTimestamp: '2023-03-21T14:45:10Z' annotations: deprecated.daemonset.template.generation: '5' spec: selector: matchLabels: app: csi-plugin template: metadata: creationTimestamp: null labels: app: csi-plugin annotations: kubectl.kubernetes.io/restartedAt: '2024-06-19T22:22:37+08:00' spec: nodeSelector: kubernetes.io/os: linux restartPolicy: Always serviceAccountName: csi-admin hostPID: true schedulerName: default-scheduler hostNetwork: true affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:

matchExpressions:
- key: type operator: NotIn values:
  - virtual-kubelet terminationGracePeriodSeconds: 30 securityContext: {} containers:
    - name: disk-driver-registrar image: >- registry-cn-hangzhou.ack.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.1-038aeb6-aliyun args:
    - '--v=5'
    - - --csi-address=/var/lib/kubelet/csi-plugins/diskplugin.csi.alibabacloud.com/csi.sock
    - - --kubelet-registration-path=/var/lib/kubelet/csi-plugins/diskplugin.csi.alibabacloud.com/csi.sock resources: limits: cpu: 500m memory: 1Gi requests: cpu: 10m memory: 16Mi volumeMounts:
    - name: kubelet-dir mountPath: /var/lib/kubelet
    - name: registration-dir mountPath: /registration terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent
    - name: nas-driver-registrar image: >- registry-cn-hangzhou.ack.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.1-038aeb6-aliyun args:
    - '--v=5'
    - - --csi-address=/var/lib/kubelet/csi-plugins/nasplugin.csi.alibabacloud.com/csi.sock
    - - --kubelet-registration-path=/var/lib/kubelet/csi-plugins/nasplugin.csi.alibabacloud.com/csi.sock resources: limits: cpu: 500m memory: 1Gi requests: cpu: 10m memory: 16Mi volumeMounts:
    - name: kubelet-dir mountPath: /var/lib/kubelet/
    - name: registration-dir mountPath: /registration terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent
    - name: oss-driver-registrar image: >- registry-cn-hangzhou.ack.aliyuncs.com/acs/csi-node-driver-registrar:v2.3.1-038aeb6-aliyun args:
    - '--v=5'
    - - --csi-address=/var/lib/kubelet/csi-plugins/ossplugin.csi.alibabacloud.com/csi.sock
    - - --kubelet-registration-path=/var/lib/kubelet/csi-plugins/ossplugin.csi.alibabacloud.com/csi.sock resources: limits: cpu: 500m memory: 1Gi requests: cpu: 10m memory: 16Mi volumeMounts:
    - name: kubelet-dir mountPath: /var/lib/kubelet/
    - name: registration-dir mountPath: /registration terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent
    - resources: limits: cpu: 500m memory: 1Gi requests: cpu: 100m memory: 128Mi readinessProbe: httpGet: path: /healthz port: healthz scheme: HTTP initialDelaySeconds: 10 timeoutSeconds: 5 periodSeconds: 30 successThreshold: 1 failureThreshold: 5 terminationMessagePath: /dev/termination-log name: csi-plugin livenessProbe: httpGet: path: /healthz port: healthz scheme: HTTP initialDelaySeconds: 10 timeoutSeconds: 5 periodSeconds: 30 successThreshold: 1 failureThreshold: 5 env:
    - name: KUBE_NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName
    - name: CSI_ENDPOINT value: >- unix://var/lib/kubelet/csi-plugins/driverplugin.csi.alibabacloud.com-replace/csi.sock
    - name: MAX_VOLUMES_PERNODE value: '15'
    - name: SERVICE_TYPE value: plugin
    - name: ACCESS_KEY_ID value: LTAI5t6KKbiyequnsVeJHY55
    - name: ACCESS_KEY_SECRET value: S6UvK6rIVheVO4Y4fAiyVl2PZXNRMs securityContext: privileged: true allowPrivilegeEscalation: true ports:
    - name: healthz hostPort: 11260 containerPort: 11260 protocol: TCP imagePullPolicy: IfNotPresent volumeMounts:
    - name: kubelet-dir mountPath: /var/lib/kubelet/ mountPropagation: Bidirectional
    - name: etc mountPath: /host/etc
    - name: host-log mountPath: /var/log/
    - name: ossconnectordir mountPath: /host/usr/
    - name: container-dir mountPath: /var/lib/container mountPropagation: Bidirectional
    - name: host-dev mountPath: /dev mountPropagation: HostToContainer
    - name: addon-token readOnly: true mountPath: /var/addon
    - name: fuse-metrics-dir mountPath: /host/var/run/ terminationMessagePolicy: File image: >- registry-cn-hangzhou.ack.aliyuncs.com/acs/csi-plugin:v1.24.9-74f8490-aliyun args:
    - '--endpoint=$(CSI_ENDPOINT)'
    - '--v=2'
    - '--driver=oss,nas,disk' serviceAccount: csi-admin volumes:
    - name: fuse-metrics-dir hostPath: path: /var/run/ type: DirectoryOrCreate
    - name: registration-dir hostPath: path: /var/lib/kubelet/plugins_registry type: DirectoryOrCreate
    - name: container-dir hostPath: path: /var/lib/container type: DirectoryOrCreate
    - name: kubelet-dir hostPath: path: /var/lib/kubelet type: Directory
    - name: host-dev hostPath: path: /dev type: ''
    - name: host-log hostPath: path: /var/log/ type: ''
    - name: etc hostPath: path: /etc type: ''
    - name: ossconnectordir hostPath: path: /usr/ type: ''
    - name: addon-token secret: secretName: addon.csi.token items:
key: addon.token.config path: token-config defaultMode: 420 optional: true dnsPolicy: ClusterFirst tolerations:
- operator: Exists priorityClassName: system-node-critical updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 20% maxSurge: 0 revisionHistoryLimit: 10 status: currentNumberScheduled: 15 numberMisscheduled: 0 desiredNumberScheduled: 15 numberReady: 13 observedGeneration: 5 updatedNumberScheduled: 15 numberAvailable: 13 numberUnavailable: 2 `

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

CSI driver version (image tag of csi-plugin container):
Deployment method (where you got the YAML files, what modifications you made, etc.):
Kubernetes version (use kubectl version): k8s 1.26
Cloud provider or hardware configuration (e.g. Alibaba Cloud ECS instance type): 集群节点使用的是阿里云ecs
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Network plugin and version (if this is a network-related bug):
Others:

huww98 commented 3 months ago

Why is your filesystem read-only? Is it intentional? What OS are you using?

lliiang commented 3 months ago

Why is your filesystem read-only? Is it intentional? What OS are you using?

my cluster is openshift 4.13

the node os is coreos

Comparing logs between normal pods and abnormal pods.

huww98 commented 3 months ago

OK, maybe we should never write file into /usr, which is expected to be managed by OS package manager.

You can try set env DISABLE_CSIPLUGIN_CONNECTOR=true. Or upgrade CSI, we have limited the number of retries to 5.

Comparing logs between normal pods and abnormal pods.

I think these logs come from different CSI version.

lliiang commented 3 months ago

hello, does csi-plugin has debug log config? how to open debug log,i want to collect debug log to platform

huww98 commented 3 months ago

No. The default log level already outputs almost all the logs.

huww98 commented 3 months ago

OK, maybe we should never write file into /usr, which is expected to be managed by OS package manager.

We decided not to fix this one. Because we have planned to remove the connector all together in the future.

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kubernetes-sigs / alibaba-cloud-csi-driver