longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.11k stars 598 forks source link

[BUG] `csi-resizer` restarts due to `"Lost connection" address="unix:///csi/csi.sock"` #9509

Closed georglauterbach closed 1 month ago

georglauterbach commented 1 month ago

Describe the bug

Like #8427, my csi-resizer pod has spontaneous restarts. The log from the last post, before it was restarted, reads:

I0914 11:06:10.769349       1 feature_gate.go:254] feature gates: {map[]}
I0914 11:06:15.512545       1 common.go:143] "Probing CSI driver for readiness"
I0914 11:06:15.515084       1 main.go:162] "CSI driver name" driverName="driver.longhorn.io"
I0914 11:06:15.518549       1 leaderelection.go:250] attempting to acquire leader lease longhorn-system/external-resizer-driver-longhorn-io...
I0914 11:06:34.155383       1 leaderelection.go:260] successfully acquired lease longhorn-system/external-resizer-driver-longhorn-io
I0914 11:06:34.155540       1 leader_election.go:184] "became leader, starting"
I0914 11:06:34.155598       1 controller.go:244] "Starting external resizer" controller="driver.longhorn.io"
I0914 11:06:34.158161       1 reflector.go:359] Caches populated for *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:160
I0914 11:06:34.158425       1 reflector.go:359] Caches populated for *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:160
E0920 19:41:58.206131       1 connection.go:208] "Lost connection" address="unix:///csi/csi.sock"
I0920 19:41:58.206171       1 event.go:389] "Event occurred" object="mail/mail-state" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="Resizing" message="External resizer is resizing volume pvc-fb743448-5ca2-4d32-ae99-dd54329da390"
E0920 19:41:58.207093       1 connection.go:116] "Lost connection to CSI driver, exiting"

According to the timestamps, this happened after I started a PVC resizing. The new pod succeeds in performing the event:

I0920 19:41:59.353957       1 feature_gate.go:254] feature gates: {map[]}
I0920 19:41:59.357081       1 common.go:143] "Probing CSI driver for readiness"
I0920 19:41:59.359588       1 main.go:162] "CSI driver name" driverName="driver.longhorn.io"
I0920 19:41:59.362041       1 leaderelection.go:250] attempting to acquire leader lease longhorn-system/external-resizer-driver-longhorn-io...
I0920 19:41:59.372109       1 leaderelection.go:260] successfully acquired lease longhorn-system/external-resizer-driver-longhorn-io
I0920 19:41:59.372367       1 leader_election.go:184] "became leader, starting"
I0920 19:41:59.372802       1 controller.go:244] "Starting external resizer" controller="driver.longhorn.io"
I0920 19:41:59.375858       1 reflector.go:359] Caches populated for *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:160
I0920 19:41:59.376339       1 reflector.go:359] Caches populated for *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:160
I0920 19:41:59.479611       1 event.go:389] "Event occurred" object="mail/mail-state" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="Resizing" message="External resizer is resizing volume pvc-fb743448-5ca2-4d32-ae99-dd54329da390"
I0920 19:42:13.613483       1 event.go:389] "Event occurred" object="mail/mail-state" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="VolumeResizeSuccessful" message="Resize volume succeeded"

Like in #8427, the operation succeeds, and I wouldn't have noticed if I had not looked at all pods and their restart counts recently.

To Reproduce

I don't know how, to be honest. I am reporting it because it is akin to #8427.

Expected behavior

No restart is observed because the connection is not lost.

Support bundle for troubleshooting

https://github.com/user-attachments/files/17084623/supportbundle_68d6448a-9d96-4c64-ad69-2f6d3727f36f_2024-09-21T15-26-09Z.zip

Environment

Additional context

My Kustomize/Helm Options ```yaml --- apiVersion: kustomize.config.k8s.io/v1alpha1 kind: Component resources: - resources/01-namespaces.yaml - resources/11-secrets.yaml #- resources/41-reccurring_jobs.yaml # applied by `stage1.sh` generatorOptions: disableNameSuffixHash: true immutable: false helmCharts: - name: longhorn # https://github.com/longhorn/longhorn/blob/master/README.md#releases version: 1.7.1 namespace: longhorn-system includeCRDs: true releaseName: longhorn repo: https://charts.longhorn.io # a complete values.yaml file can be found under # https://github.com/longhorn/longhorn/blob/master/chart/values.yaml valuesInline: networkPolicies: enabled: true type: k3s persistence: defaultClass: true defaultFsType: ext4 defaultClassReplicaCount: 1 reclaimPolicy: Delete service: ui: type: ClusterIP longhornUI: replicas: 1 ingress: enabled: false preUpgradeChecker: jobEnabled: true upgradeVersionCheck: true csi: attacherReplicaCount: 1 provisionerReplicaCount: 1 resizerReplicaCount: 1 snapshotterReplicaCount: 1 defaultSettings: backupTarget: REDACTED backupTargetCredentialSecret: REDACTED defaultLonghornStaticStorageClass: longhorn defaultReplicaCount: 1 logLevel: Warn concurrentAutomaticEngineUpgradePerNodeLimit: 1 defaultDataPath: /k8s-data/ priorityClass: system-cluster-critical upgradeChecker: false patches: - target: kind: Deployment name: longhorn-ui patch: |- - op: add path: /spec/template/metadata/labels/app value: longhorn-ui - op: add path: /spec/template/metadata/labels/app.traefik.io~1communication value: 'true' - op: add path: /spec/template/spec/dnsConfig value: options: - name: ndots value: '1' - name: trust-ad - name: edns0 - target: kind: DaemonSet name: longhorn-manager patch: |- - op: add path: /spec/template/spec/dnsConfig value: options: - name: ndots value: '1' - name: trust-ad - name: edns0 ```

I am using Cilium instead of Flannel as the CNI. I also use only one replica for attacher, provisioner, resizer and snapshotter. Moreover, I adjusted the DNS settings of the manager pod (maybe this should not be done?). I also use k8tz, which altered the pod's timezone (when it actually should not have, but it did - I need to fix this); hope time zone is not a problem here.

Workaround and Mitigation

N/A

georglauterbach commented 1 month ago

supportbundle_68d6448a-9d96-4c64-ad69-2f6d3727f36f_2024-09-21T15-26-09Z.zip

shuo-wu commented 1 month ago

Weird, just checked the support bundle, the csi resizer image you are using is longhornio/csi-resizer:v1.11.1, which should already contain the fix (csi-lib-utils update) mentioned in the ticket #8427. The cause of ticket #8427 is related to a default connection idle timeout introducing. Do you have a similar network related settings in your cluster?

It seems that only the csi-resizer encounter the restart issue. Have you tried to create/attach multiple volumes?

BTW, the resizer pod csi-resizer-677bc7c46d-zzp87 containing the error log has restartCount: 1 only.

georglauterbach commented 1 month ago

Weird, just checked the support bundle, the csi resizer image you are using is longhornio/csi-resizer:v1.11.1, which should already contain the fix (csi-lib-utils update) mentioned in the ticket #8427. The cause of ticket #8427 is related to a default connection idle timeout introducing. Do you have a similar network related settings in your cluster?

I have Cilium Network Policies, but not in this namespace. For longhorn-system, only the policies provided by the Helm chart are applied.

It seems that only the csi-resizer encounter the restart issue. Have you tried to create/attach multiple volumes?

Possibly two at the same time, that could have been possible.

BTW, the resizer pod csi-resizer-677bc7c46d-zzp87 containing the error log has restartCount: 1 only.

To be honest, I have reported it here to do my due diligence. I have not seen a restart since, and I will happily close this issue if you think it's not an issue :)

shuo-wu commented 1 month ago

To be honest, I have reported it here to do my due diligence. I have not seen a restart since, and I will happily close this issue if you think it's not an issue :)

I am not sure for now. Anyway, I will keep an eye on this. In the future if you see more and more restart for CSI sidecar pods after volume creation/deletion/attachment/expansion. Please let us know.

georglauterbach commented 1 month ago

I will close this for now and re-open when the issue re-appears.