Closed georglauterbach closed 1 month ago
Weird, just checked the support bundle, the csi resizer image you are using is longhornio/csi-resizer:v1.11.1
, which should already contain the fix (csi-lib-utils update) mentioned in the ticket #8427. The cause of ticket #8427 is related to a default connection idle timeout introducing. Do you have a similar network related settings in your cluster?
It seems that only the csi-resizer encounter the restart issue. Have you tried to create/attach multiple volumes?
BTW, the resizer pod csi-resizer-677bc7c46d-zzp87
containing the error log has restartCount: 1
only.
Weird, just checked the support bundle, the csi resizer image you are using is
longhornio/csi-resizer:v1.11.1
, which should already contain the fix (csi-lib-utils update) mentioned in the ticket #8427. The cause of ticket #8427 is related to a default connection idle timeout introducing. Do you have a similar network related settings in your cluster?
I have Cilium Network Policies, but not in this namespace. For longhorn-system
, only the policies provided by the Helm chart are applied.
It seems that only the csi-resizer encounter the restart issue. Have you tried to create/attach multiple volumes?
Possibly two at the same time, that could have been possible.
BTW, the resizer pod
csi-resizer-677bc7c46d-zzp87
containing the error log hasrestartCount: 1
only.
To be honest, I have reported it here to do my due diligence. I have not seen a restart since, and I will happily close this issue if you think it's not an issue :)
To be honest, I have reported it here to do my due diligence. I have not seen a restart since, and I will happily close this issue if you think it's not an issue :)
I am not sure for now. Anyway, I will keep an eye on this. In the future if you see more and more restart for CSI sidecar pods after volume creation/deletion/attachment/expansion. Please let us know.
I will close this for now and re-open when the issue re-appears.
Describe the bug
Like #8427, my
csi-resizer
pod has spontaneous restarts. The log from the last post, before it was restarted, reads:According to the timestamps, this happened after I started a PVC resizing. The new pod succeeds in performing the event:
Like in #8427, the operation succeeds, and I wouldn't have noticed if I had not looked at all pods and their restart counts recently.
To Reproduce
I don't know how, to be honest. I am reporting it because it is akin to #8427.
Expected behavior
No restart is observed because the connection is not lost.
Support bundle for troubleshooting
https://github.com/user-attachments/files/17084623/supportbundle_68d6448a-9d96-4c64-ad69-2f6d3727f36f_2024-09-21T15-26-09Z.zip
Environment
Additional context
My Kustomize/Helm Options
```yaml --- apiVersion: kustomize.config.k8s.io/v1alpha1 kind: Component resources: - resources/01-namespaces.yaml - resources/11-secrets.yaml #- resources/41-reccurring_jobs.yaml # applied by `stage1.sh` generatorOptions: disableNameSuffixHash: true immutable: false helmCharts: - name: longhorn # https://github.com/longhorn/longhorn/blob/master/README.md#releases version: 1.7.1 namespace: longhorn-system includeCRDs: true releaseName: longhorn repo: https://charts.longhorn.io # a complete values.yaml file can be found under # https://github.com/longhorn/longhorn/blob/master/chart/values.yaml valuesInline: networkPolicies: enabled: true type: k3s persistence: defaultClass: true defaultFsType: ext4 defaultClassReplicaCount: 1 reclaimPolicy: Delete service: ui: type: ClusterIP longhornUI: replicas: 1 ingress: enabled: false preUpgradeChecker: jobEnabled: true upgradeVersionCheck: true csi: attacherReplicaCount: 1 provisionerReplicaCount: 1 resizerReplicaCount: 1 snapshotterReplicaCount: 1 defaultSettings: backupTarget: REDACTED backupTargetCredentialSecret: REDACTED defaultLonghornStaticStorageClass: longhorn defaultReplicaCount: 1 logLevel: Warn concurrentAutomaticEngineUpgradePerNodeLimit: 1 defaultDataPath: /k8s-data/ priorityClass: system-cluster-critical upgradeChecker: false patches: - target: kind: Deployment name: longhorn-ui patch: |- - op: add path: /spec/template/metadata/labels/app value: longhorn-ui - op: add path: /spec/template/metadata/labels/app.traefik.io~1communication value: 'true' - op: add path: /spec/template/spec/dnsConfig value: options: - name: ndots value: '1' - name: trust-ad - name: edns0 - target: kind: DaemonSet name: longhorn-manager patch: |- - op: add path: /spec/template/spec/dnsConfig value: options: - name: ndots value: '1' - name: trust-ad - name: edns0 ```I am using Cilium instead of Flannel as the CNI. I also use only one replica for attacher, provisioner, resizer and snapshotter. Moreover, I adjusted the DNS settings of the manager pod (maybe this should not be done?). I also use k8tz, which altered the pod's timezone (when it actually should not have, but it did - I need to fix this); hope time zone is not a problem here.
Workaround and Mitigation
N/A