DeadlineExceeded error - Githubissues

stepanovdg commented 8 months ago

TrueNAS-SCALE-23.10.1 helm upgrade --install truenas-csp truenas-csp/truenas-csp --create-namespace -n storage --set logDebug=true --set hpe-csi-driver.logLevel=debug

apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "true" name: iscsi provisioner: csi.hpe.com parameters: csi.storage.k8s.io/controller-expand-secret-name: truenas-secret csi.storage.k8s.io/controller-expand-secret-namespace: storage csi.storage.k8s.io/controller-publish-secret-name: truenas-secret csi.storage.k8s.io/controller-publish-secret-namespace: storage csi.storage.k8s.io/node-publish-secret-name: truenas-secret csi.storage.k8s.io/node-publish-secret-namespace: storage csi.storage.k8s.io/node-stage-secret-name: truenas-secret csi.storage.k8s.io/node-stage-secret-namespace: storage csi.storage.k8s.io/provisioner-secret-name: truenas-secret csi.storage.k8s.io/provisioner-secret-namespace: storage csi.storage.k8s.io/fstype: ext4 allowOverrides: sparse,compression,deduplication,volblocksize,sync,description root: Vault/K8S/iscsi reclaimPolicy: Retain allowVolumeExpansion: true

apiVersion: v1 kind: Secret metadata: name: truenas-secret namespace: storage stringData: serviceName: truenas-csp-svc servicePort: "8080" username: ladmin password: "token" backend: 10.5.6.8

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test namespace: default spec: accessModes:

ReadWriteOnce resources: requests: storage: 1Gi storageClassName: iscsi volumeMode: Filesystem

1 controller.go:1359] I1228 07:24:59.228498 I1228 07:24:59.237224 I1228 07:24:59.237284 I1228 07:24:59.297555 I1228 07:25:29.238248 I1228 07:25:29.238326 I1228 07:25:29.238448 I1228 07:25:29.239004 W1228 07:25:29.239079 E1228 07:25:29.239173 I1228 07:25:29.239294 I1228 07:26:30.583236 I1228 07:26:30.587423 I1228 07:26:30.602225 I1228 07:26:30.602283 I1228 07:26:49.311519 I1228 07:27:00.604031 I1228 07:27:00.604142 I1228 07:27:00.604287 I1228 07:27:00.604623 W1228 07:27:00.604731 provision "default/test" class "iscsi": started 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", APIVersion:"v1", ResourceVersion:"3177265", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test" 1 connection.go:193] GRPC call: /csi.v1.Controller/CreateVolume 1 connection.go:194] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0","parameters":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","csi.storage.k8s.io/pv/name":"pvc-e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0","csi.storage.k8s.io/pvc/name":"test","csi.storage.k8s.io/pvc/namespace":"default","root":"Vault/K8S/iscsi2"},"secrets":"stripped","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} 1 reflector.go:376] sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller/controller.go:845: forcing resync 1 connection.go:200] GRPC response: {} 1 connection.go:201] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded 1 controller.go:816] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = DeadlineExceeded desc = context deadline exceeded 1 controller.go:1082] Temporary error received, adding PVC e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0 to claims in progress 1 controller.go:934] Retrying syncing claim "e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", failure 10 1 controller.go:957] error syncing claim "e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0": failed to provision volume with StorageClass "iscsi": rpc error: code = DeadlineExceeded desc = context deadline exceeded 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", APIVersion:"v1", ResourceVersion:"3177265", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "iscsi": rpc error: code = DeadlineExceeded desc = context deadline exceeded 1 controller.go:1359] provision "default/test" class "iscsi": started 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", APIVersion:"v1", ResourceVersion:"3177265", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test" 1 connection.go:193] GRPC call: /csi.v1.Controller/CreateVolume 1 connection.go:194] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0","parameters":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","csi.storage.k8s.io/pv/name":"pvc-e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0","csi.storage.k8s.io/pvc/name":"test","csi.storage.k8s.io/pvc/namespace":"default","root":"Vault/K8S/iscsi2"},"secrets":"stripped","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} 1 reflector.go:788] sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller/controller.go:845: Watch close - *v1.PersistentVolume total 5 items received 1 connection.go:200] GRPC response: {} 1 connection.go:201] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded 1 controller.go:816] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = DeadlineExceeded desc = context deadline exceeded 1 controller.go:1082] Temporary error received, adding PVC e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0 to claims in progress 1 controller.go:934] Retrying syncing claim "e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", failure 11

Not sure what is wrong and which logs to check.

datamattsson commented 8 months ago

Something is timing out. I don't think the request is reaching the CSP. What does the logs from the CSP say? Also check the logs from the CSI driver controller driver.

stepanovdg commented 8 months ago

Thank you. That was logs from hpe-csi-conroller/csi-driver Added static routes on truenas for vlan interface to pods. And now worked.
Have masquerading disabled and for some reason everything else was working fine ( even nfs on the same truenas ip).

msilcher commented 8 months ago

Same here, I upgraded to TrueNAS-SCALE-23.10.1 and then K8s to 1.29. I thought it was related to the later but now I'm not sure. Adding a few logs, truenas-csp pod is giving "worker exit errors". csi-proisioner.txt truenas-csp.txt hpe-csi-driver.txt

stepanovdg commented 8 months ago

@msilcher Oh I was thinking its only me and something broken with my setup. I have k3s + cilium native routing.
To debug I connected to truenas-csp and just tried to ping to see if my server is accessible. I found that I can ping any server in my k8s vlan 10.5.6.0/24 only not truenas interface 10.5.6.8 at the same time i was able to ping truenas on other ips ( through router). Using hubble I can see that requests are forwarded through node. After connected by wireshark to node and listen using sshdump. So noticed that it is not nat and requests are source 10.42.1.212 (pod ip) dest 10.5.6.8 - failing source 10.42.1.212 (pod ip) dest 10.5.6.2 or any other server - working fine

not sure if it is related to truenas configuration for network (not really good at network) so i just added static routes in truenas like 10.42.0.0/24 -> 10.5.6.101 - node 1 10.42.1.0/24 -> 10.5.6.102 - node2 and etc for all nodes

If I will have again such problem with other resource - will probably update from native routing to bgp config for cilium. It just strange for me as other services works fine. Like nfs on the same 10.5.6.8 ip.

msilcher commented 8 months ago

@msilcher Oh I was thinking its only me and something broken with my setup. I have k3s + cilium native routing. To debug I connected to truenas-csp and just tried to ping to see if my server is accessible. I found that I can ping any server in my k8s vlan 10.5.6.0/24 only not truenas interface 10.5.6.8 at the same time i was able to ping truenas on other ips ( through router). Using hubble I can see that requests are forwarded through node. After connected by wireshark to node and listen using sshdump. So noticed that it is not nat and requests are source 10.42.1.212 (pod ip) dest 10.5.6.8 - failing source 10.42.1.212 (pod ip) dest 10.5.6.2 or any other server - working fine

not sure if it is related to truenas configuration for network (not really good at network) so i just added static routes in truenas like 10.42.0.0/24 -> 10.5.6.101 - node 1 10.42.1.0/24 -> 10.5.6.102 - node2 and etc for all nodes

If I will have again such problem with other resource - will probably update from native routing to bgp config for cilium. It just strange for me as other services works fine. Like nfs on the same 10.5.6.8 ip.

I had a working setup with TrueNAS 23.10.0.1 and k8s 1.28.4 (using cri-o). Today I upgraded TrueNAS to 23.10.1 and also kubernetes to 1.29 so I really don't know where the issue might be. As you worked it out I suspect that my issue is related to k8s 1.29. I posted my issue under your thread as I also saw "DeadlineExceeded errors" but those might be to generic to define if the root cause is the same. I'm using calico as cni and had no issues so far with networking topics, I do "outgoing nat" on the host to talk from the container network to the outer world.

datamattsson commented 8 months ago

@msilcher The HPE CSI Driver passed e2e tests on Kubernetes 1.29 so we're good there but TrueNAS SCALE 23.10 is untested at this time. I will try get a testbed up in the next few days.

hpe-storage / truenas-csp

DeadlineExceeded error #52