hpe-storage / truenas-csp

TrueNAS Container Storage Provider for HPE CSI Driver for Kubernetes
https://scod.hpedev.io
MIT License
65 stars 8 forks source link

DeadlineExceeded error #52

Closed stepanovdg closed 8 months ago

stepanovdg commented 8 months ago

TrueNAS-SCALE-23.10.1 helm upgrade --install truenas-csp truenas-csp/truenas-csp --create-namespace -n storage --set logDebug=true --set hpe-csi-driver.logLevel=debug

apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "true" name: iscsi provisioner: csi.hpe.com parameters: csi.storage.k8s.io/controller-expand-secret-name: truenas-secret csi.storage.k8s.io/controller-expand-secret-namespace: storage csi.storage.k8s.io/controller-publish-secret-name: truenas-secret csi.storage.k8s.io/controller-publish-secret-namespace: storage csi.storage.k8s.io/node-publish-secret-name: truenas-secret csi.storage.k8s.io/node-publish-secret-namespace: storage csi.storage.k8s.io/node-stage-secret-name: truenas-secret csi.storage.k8s.io/node-stage-secret-namespace: storage csi.storage.k8s.io/provisioner-secret-name: truenas-secret csi.storage.k8s.io/provisioner-secret-namespace: storage csi.storage.k8s.io/fstype: ext4 allowOverrides: sparse,compression,deduplication,volblocksize,sync,description root: Vault/K8S/iscsi reclaimPolicy: Retain allowVolumeExpansion: true

apiVersion: v1 kind: Secret metadata: name: truenas-secret namespace: storage stringData: serviceName: truenas-csp-svc servicePort: "8080" username: ladmin password: "token" backend: 10.5.6.8

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test namespace: default spec: accessModes:

Screenshot 2023-12-27 at 23 35 54

1 controller.go:1359] provision "default/test" class "iscsi": started I1228 07:24:59.228498 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", APIVersion:"v1", ResourceVersion:"3177265", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test" I1228 07:24:59.237224 1 connection.go:193] GRPC call: /csi.v1.Controller/CreateVolume I1228 07:24:59.237284 1 connection.go:194] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0","parameters":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","csi.storage.k8s.io/pv/name":"pvc-e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0","csi.storage.k8s.io/pvc/name":"test","csi.storage.k8s.io/pvc/namespace":"default","root":"Vault/K8S/iscsi2"},"secrets":"stripped","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} I1228 07:24:59.297555 1 reflector.go:376] sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller/controller.go:845: forcing resync I1228 07:25:29.238248 1 connection.go:200] GRPC response: {} I1228 07:25:29.238326 1 connection.go:201] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded I1228 07:25:29.238448 1 controller.go:816] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = DeadlineExceeded desc = context deadline exceeded I1228 07:25:29.239004 1 controller.go:1082] Temporary error received, adding PVC e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0 to claims in progress W1228 07:25:29.239079 1 controller.go:934] Retrying syncing claim "e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", failure 10 E1228 07:25:29.239173 1 controller.go:957] error syncing claim "e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0": failed to provision volume with StorageClass "iscsi": rpc error: code = DeadlineExceeded desc = context deadline exceeded I1228 07:25:29.239294 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", APIVersion:"v1", ResourceVersion:"3177265", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "iscsi": rpc error: code = DeadlineExceeded desc = context deadline exceeded I1228 07:26:30.583236 1 controller.go:1359] provision "default/test" class "iscsi": started I1228 07:26:30.587423 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test", UID:"e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", APIVersion:"v1", ResourceVersion:"3177265", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test" I1228 07:26:30.602225 1 connection.go:193] GRPC call: /csi.v1.Controller/CreateVolume I1228 07:26:30.602283 1 connection.go:194] GRPC request: {"capacity_range":{"required_bytes":1073741824},"name":"pvc-e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0","parameters":{"allowOverrides":"sparse,compression,deduplication,volblocksize,sync,description","csi.storage.k8s.io/pv/name":"pvc-e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0","csi.storage.k8s.io/pvc/name":"test","csi.storage.k8s.io/pvc/namespace":"default","root":"Vault/K8S/iscsi2"},"secrets":"stripped","volume_capabilities":[{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}}]} I1228 07:26:49.311519 1 reflector.go:788] sigs.k8s.io/sig-storage-lib-external-provisioner/v9/controller/controller.go:845: Watch close - *v1.PersistentVolume total 5 items received I1228 07:27:00.604031 1 connection.go:200] GRPC response: {} I1228 07:27:00.604142 1 connection.go:201] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded I1228 07:27:00.604287 1 controller.go:816] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Background: rpc error: code = DeadlineExceeded desc = context deadline exceeded I1228 07:27:00.604623 1 controller.go:1082] Temporary error received, adding PVC e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0 to claims in progress W1228 07:27:00.604731 1 controller.go:934] Retrying syncing claim "e9cfd1f7-3c72-4fa9-9a74-4b48fdd638e0", failure 11

Not sure what is wrong and which logs to check.

datamattsson commented 8 months ago

Something is timing out. I don't think the request is reaching the CSP. What does the logs from the CSP say? Also check the logs from the CSI driver controller driver.

stepanovdg commented 8 months ago

Thank you. That was logs from hpe-csi-conroller/csi-driver Added static routes on truenas for vlan interface to pods. And now worked.
Have masquerading disabled and for some reason everything else was working fine ( even nfs on the same truenas ip).

msilcher commented 8 months ago

Same here, I upgraded to TrueNAS-SCALE-23.10.1 and then K8s to 1.29. I thought it was related to the later but now I'm not sure. Adding a few logs, truenas-csp pod is giving "worker exit errors". csi-proisioner.txt truenas-csp.txt hpe-csi-driver.txt

stepanovdg commented 8 months ago

@msilcher Oh I was thinking its only me and something broken with my setup. I have k3s + cilium native routing.
To debug I connected to truenas-csp and just tried to ping to see if my server is accessible. I found that I can ping any server in my k8s vlan 10.5.6.0/24 only not truenas interface 10.5.6.8 at the same time i was able to ping truenas on other ips ( through router). Using hubble I can see that requests are forwarded through node. After connected by wireshark to node and listen using sshdump. So noticed that it is not nat and requests are source 10.42.1.212 (pod ip) dest 10.5.6.8 - failing source 10.42.1.212 (pod ip) dest 10.5.6.2 or any other server - working fine

not sure if it is related to truenas configuration for network (not really good at network) so i just added static routes in truenas like 10.42.0.0/24 -> 10.5.6.101 - node 1 10.42.1.0/24 -> 10.5.6.102 - node2 and etc for all nodes

If I will have again such problem with other resource - will probably update from native routing to bgp config for cilium. It just strange for me as other services works fine. Like nfs on the same 10.5.6.8 ip.

msilcher commented 8 months ago

@msilcher Oh I was thinking its only me and something broken with my setup. I have k3s + cilium native routing. To debug I connected to truenas-csp and just tried to ping to see if my server is accessible. I found that I can ping any server in my k8s vlan 10.5.6.0/24 only not truenas interface 10.5.6.8 at the same time i was able to ping truenas on other ips ( through router). Using hubble I can see that requests are forwarded through node. After connected by wireshark to node and listen using sshdump. So noticed that it is not nat and requests are source 10.42.1.212 (pod ip) dest 10.5.6.8 - failing source 10.42.1.212 (pod ip) dest 10.5.6.2 or any other server - working fine

not sure if it is related to truenas configuration for network (not really good at network) so i just added static routes in truenas like 10.42.0.0/24 -> 10.5.6.101 - node 1 10.42.1.0/24 -> 10.5.6.102 - node2 and etc for all nodes

If I will have again such problem with other resource - will probably update from native routing to bgp config for cilium. It just strange for me as other services works fine. Like nfs on the same 10.5.6.8 ip.

I had a working setup with TrueNAS 23.10.0.1 and k8s 1.28.4 (using cri-o). Today I upgraded TrueNAS to 23.10.1 and also kubernetes to 1.29 so I really don't know where the issue might be. As you worked it out I suspect that my issue is related to k8s 1.29. I posted my issue under your thread as I also saw "DeadlineExceeded errors" but those might be to generic to define if the root cause is the same. I'm using calico as cni and had no issues so far with networking topics, I do "outgoing nat" on the host to talk from the container network to the outer world.

datamattsson commented 8 months ago

@msilcher The HPE CSI Driver passed e2e tests on Kubernetes 1.29 so we're good there but TrueNAS SCALE 23.10 is untested at this time. I will try get a testbed up in the next few days.