kubernetes-sigs / vsphere-csi-driver

vSphere storage Container Storage Interface (CSI) plugin
https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/index.html
Apache License 2.0
295 stars 178 forks source link

failed to get CsiNodeTopology for the node: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container. #2284

Closed hirendave47 closed 5 months ago

hirendave47 commented 1 year ago

/kind bug

What happened: Trying to install vSphere CSI drivers v2.7.0 with RKE2 cluster v1.24.10+rke2r1.

$ cat /etc/rancher/rke2/config.yaml cloud-provider-name: external

_$ cat csi-vsphere.conf [Global] cluster-id = "${CLUSTER_NAME}" cluster-distribution = "Kubernetes"

[VirtualCenter "172.16.16.110"] insecure-flag = "true" user = "user1@vsphere.local" password = "password12345" port = "443" datacenters = "datacenter1"_

_root@urnpk8sm60:~# kubectl --namespace=vmware-system-csi get all NAME READY STATUS RESTARTS AGE pod/vsphere-csi-controller-7589ccbcf8-6w7pw 0/7 Pending 0 3m2s pod/vsphere-csi-controller-7589ccbcf8-phl5c 0/7 Pending 0 3m2s pod/vsphere-csi-controller-7589ccbcf8-wwwfc 0/7 Pending 0 3m2s pod/vsphere-csi-node-6vljg 2/3 CrashLoopBackOff 4 (79s ago) 3m2s pod/vsphere-csi-node-dpnh9 2/3 CrashLoopBackOff 5 (7s ago) 3m2s pod/vsphere-csi-node-jd4wt 2/3 CrashLoopBackOff 4 (78s ago) 3m2s pod/vsphere-csi-node-wtlp7 2/3 CrashLoopBackOff 4 (72s ago) 3m2s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/vsphere-csi-controller ClusterIP 10.43.162.210 2112/TCP,2113/TCP 3m2s

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/vsphere-csi-node 4 4 0 4 0 kubernetes.io/os=linux 3m2s daemonset.apps/vsphere-csi-node-windows 0 0 0 0 0 kubernetes.io/os=windows 3m2s

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/vsphere-csi-controller 0/3 3 0 3m2s

NAME DESIRED CURRENT READY AGE replicaset.apps/vsphere-csi-controller-7589ccbcf8 3 3 0 3m2s root@urnpk8sm60:~#_ root@urnpk8sm60:~# kubectl --namespace=vmware-system-csi logs pod/vsphere-csi-node-wtlp7 Defaulted container "node-driver-registrar" out of: node-driver-registrar, vsphere-csi-node, liveness-probe I0315 11:27:48.852737 1 main.go:166] Version: v2.5.1 I0315 11:27:48.852835 1 main.go:167] Running node-driver-registrar in mode=registration I0315 11:27:48.854993 1 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock" I0315 11:27:48.855119 1 connection.go:154] Connecting to unix:///csi/csi.sock I0315 11:27:48.859495 1 main.go:198] Calling CSI driver to discover driver name I0315 11:27:48.859554 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo I0315 11:27:48.859566 1 connection.go:184] GRPC request: {} I0315 11:27:48.875719 1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v2.7.0"} I0315 11:27:48.876170 1 connection.go:187] GRPC error: <nil> I0315 11:27:48.876774 1 main.go:208] CSI driver name: "csi.vsphere.vmware.com" I0315 11:27:48.877323 1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock I0315 11:27:48.878695 1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock I0315 11:27:48.879412 1 node_register.go:92] Skipping HTTP server because endpoint is set to: "" I0315 11:27:49.996391 1 main.go:102] Received GetInfo call: &InfoRequest{} I0315 11:27:49.998477 1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration" I0315 11:27:50.069958 1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "urnpk8sm60". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",} E0315 11:27:50.070058 1 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "urnpk8sm60". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container. root@urnpk8sm60:~#

After changing improved-volume-topology: 'true' to false in vsphere-csi-driver.yaml, pod/vsphere-csi-node are running but pod/vsphere-csi-controller are still in Pending state due to node affinity/selector.

Warning FailedScheduling 26s default-scheduler 0/4 nodes are available: 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.

What you expected to happen: Same steps are working fine with vanilla Kubernetes but not working with RKE2.

Environment:

jingxu97 commented 1 year ago

We hit similar bugs with vSphere csi driver 3.0. More details as follows.

Starting from 3.0, cns topology feature flag is removed, and cannot be turned off OSS change. By comparing the passed log and failed logs, looks like there might be race condition happened during node registration. And the logic does not has a retry so the the following logic depending on it will fail.

vsphere-csi-node-mbdjf                                       1/2     CrashLoopBackOff 

Node driver registry log

2023-03-31T12:28:22.245291043Z I0331 12:28:22.245133       1 main.go:102] Received GetInfo call: &InfoRequest{}
2023-03-31T12:28:22.245783165Z I0331 12:28:22.245690       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
2023-03-31T12:29:22.269924126Z I0331 12:29:22.269721       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = timed out while waiting for topology labels to be updated in "c01057a88824-qual-323-0afbb584" CSINodeTopology instance.,}
2023-03-31T12:29:22.269969201Z E0331 12:29:22.269834       1 main.go:123] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = timed out while waiting for topology labels to be updated in "c01057a88824-qual-323-0afbb584" CSINodeTopology instance., restarting registration container.

vsphere-csi-driver log

2023-03-31T11:33:28.319787659Z {"level":"info","time":"2023-03-31T11:33:28.319728628Z","caller":"kubernetes/kubernetes.go:395","msg":"Setting client QPS to 100.000000 and Burst to 100.","TraceId":"a7e5dbf2-7c1d-4fb3-9fb3-730e47f69001"}
2023-03-31T11:33:28.345686928Z {"level":"info","time":"2023-03-31T11:33:28.344293416Z","caller":"k8sorchestrator/topology.go:727","msg":"Topology service initiated successfully","TraceId":"a7e5dbf2-7c1d-4fb3-9fb3-730e47f69001"}
2023-03-31T11:33:28.372612710Z {"level":"info","time":"2023-03-31T11:33:28.37245886Z","caller":"k8sorchestrator/topology.go:895","msg":"Successfully created a CSINodeTopology instance for NodeName: \"c01057a88824-qual-323-0afbb584\"","TraceId":"a7e5dbf2-7c1d-4fb3-9fb3-730e47f69001"}
2023-03-31T11:34:28.375379515Z {"level":"error","time":"2023-03-31T11:34:28.374319044Z","caller":"k8sorchestrator/topology.go:837","msg":"timed out while waiting for topology labels to be updated in \"c01057a88824-qual-323-0afbb584\" CSINodeTopology instance.","TraceId":"a7e5dbf2-7c1d-4fb3-9fb3-730e47f69001","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/common/commonco/k8sorchestrator.(*nodeVolumeTopology).GetNodeTopologyLabels\n\t/build/pkg/csi/service/common/commonco/k8sorchestrator/topology.go:837\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).NodeGetInfo\n\t/build/pkg/csi/service/node.go:429\ngithub.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:6231\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
2023-03-31T11:34:30.358756419Z {"level":"info","time":"2023-03-31T11:34:30.358584895Z","caller":"service/node.go:338","msg":"NodeGetInfo: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"f21e5f14-20bf-4bb0-a70b-2322ee91fa34"}
2023-03-31T11:34:30.372726452Z {"level":"info","time":"2023-03-31T11:34:30.37258219Z","caller":"k8sorchestrator/topology.go:892","msg":"CSINodeTopology instance already exists for NodeName: \"c01057a88824-qual-323-0afbb584\"","TraceId":"f21e5f14-20bf-4bb0-a70b-2322ee91fa34"}
2023-03-31T11:35:30.375537038Z {"level":"error","time":"2023-03-31T11:35:30.374942664Z","caller":"k8sorchestrator/topology.go:837","msg":"timed out while waiting for topology labels to be updated in \"c01057a88824-qual-323-0afbb584\" CSINodeTopology instance.","TraceId":"f21e5f14-20bf-4bb0-a70b-2322ee91fa34","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/common/commonco/k8sorchestrator.(*nodeVolumeTopology).GetNodeTopologyLabels\n\t/build/pkg/csi/service/common/commonco/k8sorchestrator/topology.go:837\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).NodeGetInfo\n\t/build/pkg/csi/service/node.go:429\ngithub.com/container-storage-interface/spec/lib/go/csi._Node_NodeGetInfo_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:6231\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
"kubectl_logs_vsphere-csi-node-mbdjf_--container_vsphere-csi-node_--kubeconfig_.tmp.user-kubeconfig-56089268_--request-timeout_30s_--namespace_kube-system_--timestamps" 63L, 29225B

Some logs from vsphere-csi-controller:

  1. Passed test Passing log: Notice that "Successfully registered VC mtv-qual-vc03.anthos:443" happened first. All the following node/VM can be found
2023-03-30T03:27:10.857292378Z {"level":"info","time":"2023-03-30T03:27:10.857253925Z","caller":"vsphere/virtualcentermanager.go:123","msg":"Successfully registered VC mtv-qual-vc03.anthos:443"}
2023-03-30T03:27:10.857373990Z {"level":"info","time":"2023-03-30T03:27:10.857306174Z","caller":"vsphere/virtualcenter.go:283","msg":"VirtualCenter.connect() creating new client"}
2023-03-30T03:27:10.860343329Z {"level":"info","time":"2023-03-30T03:27:10.860255209Z","caller":"node/manager.go:128","msg":"Discovering the node vm using uuid: \"4211ec99-cb15-a5cd-3193-49bb2883a3fb\"","TraceId":"a360dbae-71de-43a2-8b2b-38fff67feebb"}
2023-03-30T03:27:10.860354679Z {"level":"info","time":"2023-03-30T03:27:10.860281279Z","caller":"vsphere/virtualmachine.go:159","msg":"Initiating asynchronous datacenter listing with uuid 4211ec99-cb15-a5cd-3193-49bb2883a3fb","TraceId":"a360dbae-71de-43a2-8b2b-38fff67feebb"}
2023-03-30T03:27:10.883477322Z {"level":"info","time":"2023-03-30T03:27:10.883355912Z","caller":"k8sorchestrator/k8sorchestrator.go:644","msg":"configMapAdded: Internal feature state values from \"internal-feature-states.csi.vsphere.vmware.com\" stored successfully: map[async-query-volume:true block-volume-snapshot:true csi-migration:true csi-windows-support:true improved-csi-idempotency:true online-volume-extend:true trigger-csi-fullsync:false]","TraceId":"bd419e1b-0881-4e2e-b9d4-b90f744c9a1b"}
2023-03-30T03:27:10.914441786Z {"level":"info","time":"2023-03-30T03:27:10.914259728Z","caller":"vsphere/virtualcenter.go:202","msg":"New session ID for 'VSPHERE.LOCAL\\herc-32210b8fe0bd' = 52611e95-9457-8d31-c11e-9edbca63b82e"}
2023-03-30T03:27:10.914464578Z {"level":"info","time":"2023-03-30T03:27:10.914314422Z","caller":"vsphere/virtualcenter.go:291","msg":"VirtualCenter.connect() successfully created new client"}
2023-03-30T03:27:10.914468743Z {"level":"info","time":"2023-03-30T03:27:10.914338151Z","caller":"vsphere/virtualcenter.go:606","msg":"vCenterInstance initialized"}
2023-03-30T03:27:10.914549752Z {"level":"info","time":"2023-03-30T03:27:10.914476906Z","caller":"volume/manager.go:193","msg":"Initializing new defaultManager..."}
2023-03-30T03:27:10.914677990Z {"level":"info","time":"2023-03-30T03:27:10.914582011Z","caller":"syncer/metadatasyncer.go:417","msg":"Adding watch on path: \"/etc/cloud\""}
2023-03-30T03:27:10.914813271Z {"level":"info","time":"2023-03-30T03:27:10.914732425Z","caller":"volume/manager.go:190","msg":"Retrieving existing defaultManager..."}
2023-03-30T03:27:10.917155580Z {"level":"info","time":"2023-03-30T03:27:10.917045953Z","caller":"kubernetes/kubernetes.go:79","msg":"k8s client using kubeconfig from /etc/kubernetes/kubeconfig.conf"}
2023-03-30T03:27:10.917932242Z {"level":"info","time":"2023-03-30T03:27:10.917828978Z","caller":"kubernetes/kubernetes.go:395","msg":"Setting client QPS to 100.000000 and Burst to 100."}
2023-03-30T03:27:10.924330377Z {"level":"info","time":"2023-03-30T03:27:10.924234917Z","caller":"vsphere/datacenter.go:154","msg":"Publishing datacenter Datacenter [Datacenter: Datacenter:datacenter-2 @ /mtv-qual-vc03, VirtualCenterHost: mtv-qual-vc03.anthos]","TraceId":"a360dbae-71de-43a2-8b2b-38fff67feebb"}
2023-03-30T03:27:10.924393747Z {"level":"info","time":"2023-03-30T03:27:10.924310978Z","caller":"vsphere/virtualmachine.go:196","msg":"AsyncGetAllDatacenters with uuid 4211ec99-cb15-a5cd-3193-49bb2883a3fb sent a dc Datacenter [Datacenter: Datacenter:datacenter-2 @ /mtv-qual-vc03, VirtualCenterHost: mtv-qual-vc03.anthos]","TraceId":"a360dbae-71de-43a2-8b2b-38fff67feebb"}
2023-03-30T03:27:10.933106473Z {"level":"info","time":"2023-03-30T03:27:10.933012951Z","caller":"vsphere/virtualmachine.go:210","msg":"Found VM VirtualMachine:vm-623336 [VirtualCenterHost: mtv-qual-vc03.anthos, UUID: 4211ec99-cb15-a5cd-3193-49bb2883a3fb, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-2 @ /mtv-qual-vc03, VirtualCenterHost: mtv-qual-vc03.anthos]] given uuid 4211ec99-cb15-a5cd-3193-49bb2883a3fb on DC Datacenter [Datacenter: Datacenter:datacenter-2 @ /mtv-qual-vc03, VirtualCenterHost: mtv-qual-vc03.anthos]","TraceId":"a360dbae-71de-43a2-8b2b-38fff67feebb"}
2023-03-30T03:27:10.933121137Z {"level":"info","time":"2023-03-30T03:27:10.93304699Z","caller":"vsphere/virtualmachine.go:221","msg":"Returning VM VirtualMachine:vm-623336 [VirtualCenterHost: mtv-qual-vc03.anthos, UUID: 4211ec99-cb15-a5cd-3193-49bb2883a3fb, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-2 @ /mtv-qual-vc03, VirtualCenterHost: mtv-qual-vc03.anthos]] for UUID 4211ec99-cb15-a5cd-3193-49bb2883a3fb","TraceId":"a360dbae-71de-43a2-8b2b-38fff67feebb"}
2023-03-30T03:27:10.933125305Z {"level":"info","time":"2023-03-30T03:27:10.933063049Z","caller":"node/manager.go:151","msg":"Successfully discovered node with nodeUUID 4211ec99-cb15-a5cd-3193-49bb2883a3fb in vm VirtualMachine:vm-623336 [VirtualCenterHost: mtv-qual-vc03.anthos, UUID: 4211ec99-cb15-a5cd-3193-49bb2883a3fb, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-2 @ /mtv-qual-vc03, VirtualCenterHost: mtv-qual-vc03.anthos]]","TraceId":"a360dbae-71de-43a2-8b2b-38fff67feebb"}
2023-03-30T03:27:10.933128956Z {"level":"info","time":"2023-03-30T03:27:10.933073905Z","caller":"node/manager.go:134","msg":"Successfully discovered node: \"32210b8fe0bd-qual-private306-15003d10\" with nodeUUID \"4211ec99-cb15-a5cd-3193-49bb2883a3fb\"","TraceId":"a360dbae-71de-43a2-8b2b-38fff67feebb"}
2023-03-30T03:27:10.933136796Z {"level":"info","time":"2023-03-30T03:27:10.933081969Z","caller":"node/manager.go:136","msg":"Successfully registered node: \"32210b8fe0bd-qual-private306-15003d10\" with nodeUUID \"4211ec99-cb15-a5cd-3193-49bb2883a3fb\"","TraceId":"a360dbae-71de-43a2-8b2b-38fff67feebb"
  1. Failed test (More logs https://gist.github.com/jingxu97/cc013868270f4d05497a7aba2b59221c) Failed log: before VC is registered, there are "VM not found " error first. After VC is registered, the rest of the nodes can be found.
2023-03-31T11:33:19.880026310Z {"level":"info","time":"2023-03-31T11:33:19.691025909Z","caller":"vsphere/virtualcentermanager.go:74","msg":"Initializing defaultVirtualCenterManager..."}
2023-03-31T11:33:19.880028344Z {"level":"info","time":"2023-03-31T11:33:19.69104239Z","caller":"vsphere/virtualcentermanager.go:76","msg":"Successfully initialized defaultVirtualCenterManager"}
2023-03-31T11:33:19.880031590Z {"level":"error","time":"2023-03-31T11:33:19.691146997Z","caller":"vsphere/virtualmachine.go:227","msg":"Returning VM not found err for UUID 420efa50-4b8b-f108-347a-0a8f21fdb714","TraceId":"973eef74-6eac-4af0-abeb-5cd217b8eaa1","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualMachineByUUID\n\t/build/pkg/common/cns-lib/vsphere/virtualmachine.go:227\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).DiscoverNode\n\t/build/pkg/common/cns-lib/node/manager.go:145\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).RegisterNode\n\t/build/pkg/common/cns-lib/node/manager.go:129\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*Nodes).nodeAdd\n\t/build/pkg/common/cns-lib/node/nodes.go:69\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/controller.go:232\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:818\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:92\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:812\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:75"}
2023-03-31T11:33:19.880046207Z {"level":"error","time":"2023-03-31T11:33:19.691213382Z","caller":"node/manager.go:147","msg":"Couldn't find VM instance with nodeUUID 420efa50-4b8b-f108-347a-0a8f21fdb714, failed to discover with err: virtual machine wasn't found","TraceId":"973eef74-6eac-4af0-abeb-5cd217b8eaa1","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).DiscoverNode\n\t/build/pkg/common/cns-lib/node/manager.go:147\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).RegisterNode\n\t/build/pkg/common/cns-lib/node/manager.go:129\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*Nodes).nodeAdd\n\t/build/pkg/common/cns-lib/node/nodes.go:69\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/controller.go:232\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:818\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:92\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:812\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:75"}
2023-03-31T11:33:19.880051146Z {"level":"error","time":"2023-03-31T11:33:19.691256924Z","caller":"node/manager.go:131","msg":"failed to discover VM with uuid: \"420efa50-4b8b-f108-347a-0a8f21fdb714\" for node: \"c01057a88824-qual-323-0afbb5d7\"","TraceId":"973eef74-6eac-4af0-abeb-5cd217b8eaa1","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).RegisterNode\n\t/build/pkg/common/cns-lib/node/manager.go:131\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*Nodes).nodeAdd\n\t/build/pkg/common/cns-lib/node/nodes.go:69\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/controller.go:232\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:818\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:92\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:812\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:75"}
2023-03-31T11:33:19.880056226Z {"level":"warn","time":"2023-03-31T11:33:19.691282823Z","caller":"node/nodes.go:72","msg":"failed to register node:\"c01057a88824-qual-323-0afbb5d7\". err=virtual machine wasn't found","TraceId":"973eef74-6eac-4af0-abeb-5cd217b8eaa1"}
2023-03-31T11:33:19.880058280Z {"level":"info","time":"2023-03-31T11:33:19.691529217Z","caller":"node/manager.go:128","msg":"Discovering the node vm using uuid: \"420e47c7-ba23-d2a8-6656-526432f8313b\"","TraceId":"2ceea003-59f4-4aa9-a37c-b2af5e4c48be"}
2023-03-31T11:33:19.880060013Z {"level":"info","time":"2023-03-31T11:33:19.691588529Z","caller":"vsphere/virtualmachine.go:159","msg":"Initiating asynchronous datacenter listing with uuid 420e47c7-ba23-d2a8-6656-526432f8313b","TraceId":"2ceea003-59f4-4aa9-a37c-b2af5e4c48be"}
2023-03-31T11:33:19.880066836Z {"level":"error","time":"2023-03-31T11:33:19.691638263Z","caller":"vsphere/virtualmachine.go:227","msg":"Returning VM not found err for UUID 420e47c7-ba23-d2a8-6656-526432f8313b","TraceId":"2ceea003-59f4-4aa9-a37c-b2af5e4c48be","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualMachineByUUID\n\t/build/pkg/common/cns-lib/vsphere/virtualmachine.go:227\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).DiscoverNode\n\t/build/pkg/common/cns-lib/node/manager.go:145\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).RegisterNode\n\t/build/pkg/common/cns-lib/node/manager.go:129\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*Nodes).nodeAdd\n\t/build/pkg/common/cns-lib/node/nodes.go:69\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/controller.go:232\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:818\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:92\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:812\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:75"}
2023-03-31T11:33:19.880069351Z {"level":"error","time":"2023-03-31T11:33:19.691715308Z","caller":"node/manager.go:147","msg":"Couldn't find VM instance with nodeUUID 420e47c7-ba23-d2a8-6656-526432f8313b, failed to discover with err: virtual machine wasn't found","TraceId":"2ceea003-59f4-4aa9-a37c-b2af5e4c48be","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).DiscoverNode\n\t/build/pkg/common/cns-lib/node/manager.go:147\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).RegisterNode\n\t/build/pkg/common/cns-lib/node/manager.go:129\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*Nodes).nodeAdd\n\t/build/pkg/common/cns-lib/node/nodes.go:69\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/controller.go:232\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:818\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:92\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:812\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:75"}
2023-03-31T11:33:19.880080141Z {"level":"error","time":"2023-03-31T11:33:19.691763489Z","caller":"node/manager.go:131","msg":"failed to discover VM with uuid: \"420e47c7-ba23-d2a8-6656-526432f8313b\" for node: \"c01057a88824-qual-323-0afbb584\"","TraceId":"2ceea003-59f4-4aa9-a37c-b2af5e4c48be","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*defaultManager).RegisterNode\n\t/build/pkg/common/cns-lib/node/manager.go:131\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/node.(*Nodes).nodeAdd\n\t/build/pkg/common/cns-lib/node/nodes.go:69\nk8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/controller.go:232\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:818\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:92\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/go/pkg/mod/k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:812\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:75"}
2023-03-31T11:33:19.880082445Z {"level":"warn","time":"2023-03-31T11:33:19.691811169Z","caller":"node/nodes.go:72","msg":"failed to register node:\"c01057a88824-qual-323-0afbb584\". err=virtual machine wasn't found","TraceId":"2ceea003-59f4-4aa9-a37c-b2af5e4c48be"}
2023-03-31T11:33:19.880084559Z {"level":"info","time":"2023-03-31T11:33:19.692076979Z","caller":"vsphere/virtualcentermanager.go:123","msg":"Successfully registered VC atl-qual-vc06.anthos:443"}
2023-03-31T11:33:19.880086253Z {"level":"info","time":"2023-03-31T11:33:19.692285863Z","caller":"node/manager.go:128","msg":"Discovering the node vm using uuid: \"420e8daf-1b9e-e49f-e9f9-2c59ab10c59f\"","TraceId":"c42c3ab6-ed72-4328-9ec1-ea34824bdbd4"}
2023-03-31T11:33:19.880088266Z {"level":"info","time":"2023-03-31T11:33:19.692375572Z","caller":"vsphere/virtualmachine.go:159","msg":"Initiating asynchronous datacenter listing with uuid 420e8daf-1b9e-e49f-e9f9-2c59ab10c59f","TraceId":"c42c3ab6-ed72-4328-9ec1-ea34824bdbd4"}
2023-03-31T11:33:19.880090110Z {"level":"info","time":"2023-03-31T11:33:19.692324796Z","caller":"vsphere/virtualcenter.go:283","msg":"VirtualCenter.connect() creating new client"}
jingxu97 commented 1 year ago

/cc @divyenpatel @xing-yang @msau42 @gnufied @jsafrane

jingxu97 commented 1 year ago

Similar issue opened before https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1661

divyenpatel commented 1 year ago

Before the driver can create CR instances, it is necessary to register the CsiNodeTopology CRD. Once this CRD has been successfully registered, the CSI Node Daemonset's pod will be able to create CsiNodeTopology instances for the node. Node discovery should then take place. If the CRD registration has not taken place yet, you may see the Node Daemonsets pod in a crash loop back off state.

jingxu97 commented 1 year ago

which component register "register the CsiNodeTopology CRD"? how to make sure CRD successfully registered before CSI node daemonset pod to create the instance?

divyenpatel commented 1 year ago

syncer component in the driver registers the CRD. Refer to https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/f762a45e4db2b59e36e80ee943bc42a84f4980cc/pkg/syncer/cnsoperator/manager/init.go#L205-L212

divyenpatel commented 1 year ago

how to make sure CRD successfully registered before CSI node daemonset pod to create the instance?

you can install the CSI controller Pod first, let all required CRDs to register and wait for complete initialization and then deploy CSI Node Daemonset pods.

jingxu97 commented 1 year ago

Please see more logs https://gist.github.com/jingxu97/cc013868270f4d05497a7aba2b59221c

From what I searched, some logic related to discover VM and register does not have a retry logic, and causing the following VM not found error.

jingxu97 commented 1 year ago

how to make sure CRD successfully registered before CSI node daemonset pod to create the instance?

you can install the CSI controller Pod first, let all required CRDs to register and wait for complete initialization and then deploy CSI Node Daemonset pods.

Could it possible to add a retry logic instead of having the strict requirement on ordering? It is hard to add the ordering logic when deploying the controller and driver, I think.

gabrieletosca commented 1 year ago

guys I have the same problem, I am using however version 3.0.0 I have the pods the CrashLoopBackOff:

vsphere-csi-controller-68c65dbdd5-cb9jb   0/7     Pending            0             19m
vsphere-csi-controller-68c65dbdd5-whswk   0/7     Pending            0             19m
vsphere-csi-node-9qlc6                    2/3     CrashLoopBackOff   5 (28s ago)   3m40s
vsphere-csi-node-h9hkq                    2/3     CrashLoopBackOff   5 (30s ago)   3m40s
vsphere-csi-node-nbvfp                    2/3     CrashLoopBackOff   5 (45s ago)   3m40s

and going into the logs in one of the pods I get this:

Defaulted container "node-driver-registrar" out of: node-driver-registrar, vsphere-csi-node, liveness-probe
I0403 22:57:51.418542       1 main.go:167] Version: v2.7.0
I0403 22:57:51.418588       1 main.go:168] Running node-driver-registrar in mode=registration
I0403 22:57:51.419473       1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0403 22:57:51.419515       1 connection.go:154] Connecting to unix:///csi/csi.sock
I0403 22:57:51.420762       1 main.go:199] Calling CSI driver to discover driver name
I0403 22:57:51.420772       1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I0403 22:57:51.420776       1 connection.go:184] GRPC request: {}
I0403 22:57:51.424195       1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.0.0"}
I0403 22:57:51.424239       1 connection.go:187] GRPC error: <nil>
I0403 22:57:51.424247       1 main.go:209] CSI driver name: "csi.vsphere.vmware.com"
I0403 22:57:51.424312       1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I0403 22:57:51.424466       1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I0403 22:57:51.424537       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0403 22:57:52.522333       1 main.go:102] Received GetInfo call: &InfoRequest{}
I0403 22:57:52.522670       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I0403 22:57:52.533985       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "k8s-worker02". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E0403 22:57:52.534009       1 main.go:123] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "k8s-worker02". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

I'm following step by step this guide: https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.0/vmware-vsphere-csp-getting-started/GUID-54BB79D2-B13F-4673-8CC2-63A772D17B3C.html

My env consists of: k8s cluster 1.26.3 1 master node 2 worker node esxi 7.0.3 vcenter 7.0.3

divyenpatel commented 1 year ago

@gabrieletosca I see you have pending vsphere-csi-controller Pods? Can you make CSI controller pod up and running and later check CSI Node Daemonset Pod status?

gabrieletosca commented 1 year ago

@gabrieletosca I see you have pending vsphere-csi-controller Pods? Can you make CSI controller pod up and running and later check CSI Node Daemonset Pod status?

I can't unfortunatly... this is the log of kubectl describe pods vsphere-csi-controller-68c65dbdd5-cb9jb --namespace=vmware-system-csi:

0/3 nodes are available: 3 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling..

and this is the grep for kubectl describe nodes | egrep "Taints:|Name:":

Name:               k8s-master01
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               k8s-worker01
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Name:               k8s-worker02
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

I see that in the guide it says to taint only the master (https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.0/vmware-vsphere-csp-getting-started/GUID-4E47B9F1-B250-4B36-8FEC-8F45E6529D23.html), but at this link (https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.0/vmware-vsphere-csp-getting-started/GUID-0AB6E692-AA47-4B6A-8CEA-38B754E16567.html) it says to do it on all nodes... I also tried to remove the taint from the workers but it still doesn't work

divyenpatel commented 1 year ago

@jingxu97 During our debugging session we observed that some of the feature gates were disabled in the v3.0.0 release you were using.

Do you see this issue getting resolved on Anthos setup after enabling all required feature gates for the release v3.0.0?

lethargosapatheia commented 1 year ago

I am running into the same issue with vanilla (kubeadm) kubernetes version 1.26.3 when installing the csi-driver version 3.0 (using this manifest: https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v3.0.0/manifests/vanilla/vsphere-csi-driver.yaml):

I0424 16:09:37.991893       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I0424 16:09:38.030150       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "omni-kube-controlplane-1". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E0424 16:09:38.030228       1 main.go:123] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "omni-kube-controlplane-1". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

From what I understand reading this thread is that there's no hack to solve this in the meantime, right?

lethargosapatheia commented 1 year ago

I tried installing only the csi controller deployment first, but I've come across these logs (that might have existed initially too):

vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-csi-controller {"level":"error","time":"2023-04-24T16:38:27.945536568Z","caller":"vsphere/virtualcenter.go:171","msg":"failed to create new client with err: Post \"https://vmcenter.example.com:443/sdk\": dial tcp: lookup vmcenter.example.com on 127.0.0.1:53: read udp 127.0.0.1:38377->127.0.0.1:53: read: connection refused","TraceId":"36e48351-db5d-4deb-9681-9e8f7fca695b","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.(*VirtualCenter).NewClient\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:171\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.(*VirtualCenter).connect\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:284\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:259\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualCenterInstanceForVCenterConfig\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:645\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:234\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-csi-controller {"level":"error","time":"2023-04-24T16:38:27.945727725Z","caller":"vsphere/virtualcenter.go:285","msg":"failed to create govmomi client with err: Post \"https://vmcenter.example.com:443/sdk\": dial tcp: lookup vmcenter.example.com on 127.0.0.1:53: read udp 127.0.0.1:38377->127.0.0.1:53: read: connection refused","TraceId":"36e48351-db5d-4deb-9681-9e8f7fca695b","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.(*VirtualCenter).connect\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:285\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:259\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualCenterInstanceForVCenterConfig\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:645\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:234\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-csi-controller {"level":"error","time":"2023-04-24T16:38:27.946527487Z","caller":"vsphere/virtualcenter.go:287","msg":"failed to connect to vCenter using CA file: \"\"","TraceId":"36e48351-db5d-4deb-9681-9e8f7fca695b","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.(*VirtualCenter).connect\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:287\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:259\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualCenterInstanceForVCenterConfig\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:645\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:234\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-csi-controller {"level":"error","time":"2023-04-24T16:38:27.947057249Z","caller":"vsphere/virtualcenter.go:261","msg":"Cannot connect to vCenter with err: Post \"https://vmcenter.example.com:443/sdk\": dial tcp: lookup vmcenter.example.com on 127.0.0.1:53: read udp 127.0.0.1:38377->127.0.0.1:53: read: connection refused","TraceId":"36e48351-db5d-4deb-9681-9e8f7fca695b","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:261\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualCenterInstanceForVCenterConfig\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:645\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:234\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-csi-controller {"level":"error","time":"2023-04-24T16:38:27.947548438Z","caller":"vsphere/virtualcenter.go:647","msg":"failed to connect to VirtualCenter host: \"vmcenter.example.com\". Err: Post \"https://vmcenter.example.com:443/sdk\": dial tcp: lookup vmcenter.example.com on 127.0.0.1:53: read udp 127.0.0.1:38377->127.0.0.1:53: read: connection refused","TraceId":"36e48351-db5d-4deb-9681-9e8f7fca695b","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualCenterInstanceForVCenterConfig\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:647\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:234\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-csi-controller {"level":"error","time":"2023-04-24T16:38:27.947945134Z","caller":"vanilla/controller.go:236","msg":"failed to get vCenterInstance for vCenter \"vmcenter.example.com\"err=Post \"https://vmcenter.example.com:443/sdk\": dial tcp: lookup vmcenter.example.com on 127.0.0.1:53: read udp 127.0.0.1:38377->127.0.0.1:53: read: connection refused","TraceId":"36e48351-db5d-4deb-9681-9e8f7fca695b","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:236\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-csi-controller {"level":"error","time":"2023-04-24T16:38:27.948286142Z","caller":"service/driver.go:189","msg":"failed to init controller. Error: failed to get vCenterInstance for vCenter \"vmcenter.example.com\"err=Post \"https://vmcenter.example.com:443/sdk\": dial tcp: lookup vmcenter.example.com on 127.0.0.1:53: read udp 127.0.0.1:38377->127.0.0.1:53: read: connection refused","TraceId":"734c882c-aa75-4f16-be5c-21146bdf9e4b","TraceId":"e0546129-339a-4946-a9cc-195f5b5549b3","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:189\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-csi-controller {"level":"info","time":"2023-04-24T16:38:27.948620496Z","caller":"service/driver.go:109","msg":"Configured: \"csi.vsphere.vmware.com\" with clusterFlavor: \"VANILLA\" and mode: \"controller\"","TraceId":"734c882c-aa75-4f16-be5c-21146bdf9e4b","TraceId":"e0546129-339a-4946-a9cc-195f5b5549b3"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-csi-controller {"level":"error","time":"2023-04-24T16:38:27.948939301Z","caller":"service/driver.go:203","msg":"failed to run the driver. Err: +failed to get vCenterInstance for vCenter \"vmcenter.example.com\"err=Post \"https://vmcenter.example.com:443/sdk\": dial tcp: lookup vmcenter.example.com on 127.0.0.1:53: read udp 127.0.0.1:38377->127.0.0.1:53: read: connection refused","TraceId":"734c882c-aa75-4f16-be5c-21146bdf9e4b","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:203\nmain.main\n\t/build/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
Stream closed EOF for vmware-system-csi/vsphere-csi-controller-68c65dbdd5-z6g85 (vsphere-csi-controller)
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-syncer {"level":"info","time":"2023-04-24T16:35:30.08636366Z","caller":"logger/logger.go:41","msg":"Setting default log level to :\"PRODUCTION\""}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-syncer {"level":"info","time":"2023-04-24T16:35:30.090380782Z","caller":"syncer/main.go:86","msg":"Version : v3.0.0","TraceId":"920d74a9-f82a-47d6-8a0d-20633809c584"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-syncer {"level":"info","time":"2023-04-24T16:35:30.090656443Z","caller":"syncer/main.go:103","msg":"Starting container with operation mode: METADATA_SYNC","TraceId":"920d74a9-f82a-47d6-8a0d-20633809c584"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-syncer {"level":"info","time":"2023-04-24T16:35:30.090779546Z","caller":"kubernetes/kubernetes.go:86","msg":"k8s client using in-cluster config","TraceId":"920d74a9-f82a-47d6-8a0d-20633809c584"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-syncer {"level":"info","time":"2023-04-24T16:35:30.091272287Z","caller":"kubernetes/kubernetes.go:395","msg":"Setting client QPS to 100.000000 and Burst to 100.","TraceId":"920d74a9-f82a-47d6-8a0d-20633809c584"}
vsphere-csi-controller-68c65dbdd5-z6g85 vsphere-syncer {"level":"info","time":"2023-04-24T16:35:30.092658207Z","caller":"syncer/main.go:125","msg":"Starting the http server to expose Prometheus metrics..","TraceId":"920d74a9-f82a-47d6-8a0d-20633809c584"}

What stands out is:

dial tcp: lookup vmcenter.example.com on 127.0.0.1:53: read udp 127.0.0.1:38377->127.0.0.1:53: read: connection refused

From what I understand, the pod is trying to connect to a DNS on the localhost, which isn't responding, which for me doesn't make any sense, because it's supposed to be connecting to coredns. The pod has its own ip, so it's using a separate network namespace. It's rather hard to follow the logic of it all.

divyenpatel commented 1 year ago

@lethargosapatheia Once you fix connectivity between vSpehre CSI Controller Pod and vCenter server, the issue regarding CSINodeTopology CRD is not found will be fixed.

divyenpatel commented 1 year ago

basically, registration of CRD happens in the syncer container, and if it crashes before registering CRDs then Node Daemonset Pods will not be unable to create required CRD instances.

lethargosapatheia commented 1 year ago

@divyenpatel You're right, something isn't actually working properly, I had tinkered with the cluster a little bit before and the coredns service doesn't respond correctly, even if the pods themselves work. I'll have to have a look at that and get back if it all works ok. Thank you for your fast answer!

lethargosapatheia commented 1 year ago

Ok, I've actually mixed some things up. The thing is, there's no connectivity issue to the DNS. The service actually works fine. Entering the network namespace of a random container (csi-snapshotter) inside the same pod works perfectly:

nsenter -n -t 107665 dig A vmcenter.example.com
; <<>> DiG 9.18.12-0ubuntu0.22.04.1-Ubuntu <<>> A vmcenter.example.com @10.96.0.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18982
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 56dddadc2663ae20 (echoed)
;; QUESTION SECTION:
;vmcenter.example.com.      IN  A

;; ANSWER SECTION:
vmcenter.example.com.   30  IN  A   10.0.0.1

;; Query time: 4 msec
;; SERVER: 10.96.0.10#53(10.96.0.10) (UDP)
;; WHEN: Mon Apr 24 17:42:53 UTC 2023
;; MSG SIZE  rcvd: 91

Is the container explicitly trying to connect to a different DNS server (localhost) than the one it's supposed to (coredns service)? I know it's a stupid question, but I don't get the error at all :)

I also see that the deployment assumes there are three controlplane nodes. I also have two controlplane nodes (and three etcd nodes), so I don't need more. Do you strictly need three controller pods to create a cluster or would it work with two replicas too? Or should I maybe let then run on the worker nodes also?

lethargosapatheia commented 1 year ago

Having looked again at that pod, I see that the bindmount to /etc/resolv.conf leads to a file on the host whose contents are:

nameserver 127.0.0.1

On a normal deployment, I should have something like:

nameserver 10.96.0.10
options ndots:5

On the exact same node I have a calico-kube-controller pod which also has the right resolver (10.96.0.10).

lethargosapatheia commented 1 year ago

I've added what I consider to be an issue here: https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/2354 It doesn't look as though the dns behaves correctly inside the pod. But I'm not sure how this is happening.

divyenpatel commented 1 year ago

We will move the registration of CRs in the deployment YAML file so we do not have internal container dependencies while Pod is coming up.

cc: @vdkotkar

divyenpatel commented 1 year ago

/assign @vdkotkar

k8s-ci-robot commented 1 year ago

@divyenpatel: GitHub didn't allow me to assign the following users: vdkotkar.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/2284#issuecomment-1548680312): >/assign @vdkotkar Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
vdkotkar commented 1 year ago

/assign vdkotkar

vu3oim commented 1 year ago

Hello Guys,

I am seeing the same problem as discussed in this issue. Same is the case with 3.0.0 and 3.0.2.

I also had a look at https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1661

Wanted to check if there are other workarounds / real-fix to this? Or am I making some mistakes in configuring the vSphere CSI Driver.

I am following: https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-6DBD2645-FFCF-4076-80BE-AD44D7141521.html

More details =>

Pods are the CrashLoopBackOff:

vmware-system-csi   vsphere-csi-controller-5867b9fc45-5kft8                 0/7     Pending          
vmware-system-csi   vsphere-csi-controller-5867b9fc45-hc9c8                 0/7     Pending          
vmware-system-csi   vsphere-csi-controller-5867b9fc45-z7pqf                 0/7     Pending          
vmware-system-csi   vsphere-csi-node-8zlhk                                  2/3     CrashLoopBackOff 
vmware-system-csi   vsphere-csi-node-pr4bf                                  2/3     CrashLoopBackOff 
vmware-system-csi   vsphere-csi-node-t7m7b                                  2/3     CrashLoopBackOff

My setup is:

kubectl get no
NAME      STATUS   ROLES                       AGE   VERSION
rke2vm1   Ready    control-plane,etcd,master   41m   v1.25.10+rke2r1
rke2vm2   Ready    control-plane,etcd,master   37m   v1.25.10+rke2r1
rke2vm3   Ready    control-plane,etcd,master   37m   v1.25.10+rke2r1
kubectl -n vmware-system-csi logs vsphere-csi-node-8zlhk
Defaulted container "node-driver-registrar" out of: node-driver-registrar, vsphere-csi-node, liveness-probe
I0809 15:57:23.745216       1 main.go:167] Version: v2.7.0
I0809 15:57:23.745254       1 main.go:168] Running node-driver-registrar in mode=registration
I0809 15:57:23.745845       1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0809 15:57:23.745873       1 connection.go:154] Connecting to unix:///csi/csi.sock
I0809 15:57:23.746532       1 main.go:199] Calling CSI driver to discover driver name
I0809 15:57:23.746545       1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I0809 15:57:23.746553       1 connection.go:184] GRPC request: {}
I0809 15:57:23.751808       1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.0.2"}
I0809 15:57:23.751927       1 connection.go:187] GRPC error: <nil>
I0809 15:57:23.751939       1 main.go:209] CSI driver name: "csi.vsphere.vmware.com"
I0809 15:57:23.752785       1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I0809 15:57:23.753406       1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I0809 15:57:23.753607       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0809 15:57:25.116945       1 main.go:102] Received GetInfo call: &InfoRequest{}
I0809 15:57:25.117192       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I0809 15:57:25.134647       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "rke2vm2". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E0809 15:57:25.134679       1 main.go:123] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "rke2vm2". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

When trying with CSI v3.0.2, I have used as-is this => https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v3.0.2/manifests/vanilla/vsphere-csi-driver.yaml

Any help please?

vdkotkar commented 1 year ago

Hi Venkat, I see that your controller pods are in pending state, that is eventually causing node pods to go in CrashLoopBackOff state. Please check why your controller pods are in pending state. You can paste the output of kubectl describe pod vsphere-csi-controller-5867b9fc45-5kft8 -n vmware-system-csi. Check if some affinity, anti-affinity rules on pod are causing this issue.

Also, in kubectl get nodes output, I can see only master nodes. Don't you have any worker nodes?

vu3oim commented 1 year ago

Hi Vipul, thanks for the reply. I could get past that problem and get to the finish successfully (vSphere CSI Driver v3.0.2). I have a 3 Node RKE2 cluster, no dedicated Compute nodes. Control and Compute runs on those VMs

But I had a question..

In this https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v3.0.2/manifests/vanilla/vsphere-csi-driver.yaml YAML, I see =>

      nodeSelector:
        node-role.kubernetes.io/control-plane: ""

Due to that, those PODS were not coming up.

Labels on the nodes are like this =>

# kubectl get nodes --show-labels
NAME      STATUS   ROLES                       AGE   VERSION           LABELS
rke2vm1   Ready    control-plane,etcd,master   67m   v1.25.10+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=rke2vm1,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,product=mantaray
rke2vm2   Ready    control-plane,etcd,master   64m   v1.25.10+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=rke2vm2,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,product=mantaray
rke2vm3   Ready    control-plane,etcd,master   63m   v1.25.10+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=rke2vm3,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,product=mantaray

So, I changed to =>

      nodeSelector:
        node-role.kubernetes.io/control-plane: "true"

Then all PODS came up and everything is OK.

I wonder why the earlier selector did not select my nodes. I was thinking the that selector being empty, it must act as wildcard? Any help?

Anyway, thanks for all the help.

divyenpatel commented 1 year ago

@vu3oim only key is needed in the label for node-role-kubernetes-io-control-plane https://kubernetes.io/docs/reference/labels-annotations-taints/#node-role-kubernetes-io-control-plane

How are you deploying k8s cluster?

vu3oim commented 1 year ago

Hi Divyen, I am installing RKE2

RKE2 config file => /etc/rancher/rke2/config.yaml =>

token: mytoken
write-kubeconfig-mode: "0644"
cluster-cidr: "10.128.0.0/14,fd02::/48"
service-cidr: "172.30.0.0/16,fd03::/112"
tls-san:
  - rke2vm1
  - rke2vm2
  - rke2vm3
node-label:
  - "product=test"
disable-cloud-controller: "true"
debug: true

Its a 3 Node K8s cluster (runs both control and compute)

# kubectl get nodes --show-labels
NAME      STATUS   ROLES                       AGE   VERSION           LABELS
rke2vm1   Ready    control-plane,etcd,master   67m   v1.25.10+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=rke2vm1,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,product=test
rke2vm2   Ready    control-plane,etcd,master   64m   v1.25.10+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=rke2vm2,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,product=test
rke2vm3   Ready    control-plane,etcd,master   63m   v1.25.10+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=rke2vm3,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-6gb.os-unknown,product=test

Then I just follow the vSphere CPI/CSI 3.0 documentation instructions.

Strangely, I need to vSphere CSI Driver yaml (vsphere-csi-driver.yaml) with this to get successful installation of vSphere CSI => node-role.kubernetes.io/control-plane: "true"

qdrddr commented 12 months ago

Getting similar issue:

/kind bug

What happened: Trying to install vSphere CSI drivers v3.0.2 with RKE2 cluster v1.26.9+rke2r1

csi-vsphere.conf is added to the secrets kubectl create secret generic vsphere-config-secret --from-file=/etc/kubernetes/secret.conf --namespace=vmware-system-csi I can ping IP of the vCenter from any k8s nodes.

[Global]
cluster-id = "my-k8s-vmw"
cluster-distribution = "native"

[VirtualCenter "172.16.1.1"]
insecure-flag = "true"
user = "k8s-vsphere-csi@local"
password = "mypwd"
port = "443"
datacenters = "mydc"

after deploying csi:

kubectl --namespace=vmware-system-csi get all
NAME                                          READY   STATUS             RESTARTS      AGE
pod/vsphere-csi-controller-5d88977cdf-6bb7w   0/7     Pending            0             12m
pod/vsphere-csi-controller-5d88977cdf-gfq52   0/7     Pending            0             12m
pod/vsphere-csi-controller-78cb9ff564-x8djs   0/7     Pending            0             12m
pod/vsphere-csi-node-2pg4x                    2/3     CrashLoopBackOff   7 (25s ago)   11m
pod/vsphere-csi-node-ffhc6                    2/3     CrashLoopBackOff   7 (35s ago)   11m
pod/vsphere-csi-node-grbwh                    2/3     CrashLoopBackOff   7 (41s ago)   11m
pod/vsphere-csi-node-k8jkb                    2/3     CrashLoopBackOff   7 (31s ago)   11m
pod/vsphere-csi-node-w8bps                    2/3     CrashLoopBackOff   7 (23s ago)   11m
pod/vsphere-csi-node-xkxvg                    2/3     CrashLoopBackOff   7 (27s ago)   11m

NAME                             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
service/vsphere-csi-controller   ClusterIP   10.43.84.182   <none>        2112/TCP,2113/TCP   3d5h

NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR              AGE
daemonset.apps/vsphere-csi-node           6         6         0       6            0           kubernetes.io/os=linux     3d5h
daemonset.apps/vsphere-csi-node-windows   0         0         0       0            0           kubernetes.io/os=windows   3d5h

NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/vsphere-csi-controller   0/3     1            0           3d5h

NAME                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/vsphere-csi-controller-5d88977cdf   2         2         0       3d5h
replicaset.apps/vsphere-csi-controller-78cb9ff564   1         1         0       44m

Getting failed to get CsiNodeTopology

Controller is not ready

kubectl get deployment --namespace=vmware-system-csi
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
vsphere-csi-controller   0/3     1            0           3d5h
kubectl logs -n vmware-system-csi vsphere-csi-node-xkxvg
Defaulted container "node-driver-registrar" out of: node-driver-registrar, vsphere-csi-node, liveness-probe
I1009 02:18:59.471111       1 main.go:167] Version: v2.7.0
I1009 02:18:59.471155       1 main.go:168] Running node-driver-registrar in mode=registration
I1009 02:18:59.471675       1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1009 02:18:59.471696       1 connection.go:154] Connecting to unix:///csi/csi.sock
I1009 02:18:59.472696       1 main.go:199] Calling CSI driver to discover driver name
I1009 02:18:59.472764       1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I1009 02:18:59.472785       1 connection.go:184] GRPC request: {}
I1009 02:18:59.474901       1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.0.2"}
I1009 02:18:59.475020       1 connection.go:187] GRPC error: <nil>
I1009 02:18:59.475050       1 main.go:209] CSI driver name: "csi.vsphere.vmware.com"
I1009 02:18:59.475132       1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I1009 02:18:59.475484       1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I1009 02:18:59.475806       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I1009 02:19:00.760178       1 main.go:102] Received GetInfo call: &InfoRequest{}
I1009 02:19:00.760487       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I1009 02:19:00.779742       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "so-m002". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E1009 02:19:00.779785       1 main.go:123] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "so-m002". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

Controller is not starting

kubectl logs -n vmware-system-csi  vsphere-csi-controller-5d88977cdf-6bb7w
Defaulted container "csi-attacher" out of: csi-attacher, csi-resizer, vsphere-csi-controller, liveness-probe, vsphere-syncer, csi-provisioner, csi-snapshotter

CSINodeTopology is not there

kubectl get csinodetopologies
error: the server doesn't have a resource type "csinodetopologies"

6 nodes, 3 master & 3 workers

kubectl get nodes
NAME      STATUS   ROLES                       AGE   VERSION
so-m001   Ready    control-plane,etcd,master   12d   v1.26.9+rke2r1
so-m002   Ready    control-plane,etcd,master   12d   v1.26.9+rke2r1
so-m003   Ready    control-plane,etcd,master   12d   v1.26.9+rke2r1
so-w001   Ready    <none>                      12d   v1.26.9+rke2r1
so-w002   Ready    <none>                      12d   v1.26.9+rke2r1
so-w003   Ready    <none>                      12d   v1.26.9+rke2r1

Examples of SC & PVC

Example of storage class & PVC with block device:

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: example-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.vsphere.vmware.com
parameters:
  storagepolicyname: "vSAN Default Storage Policy"  #Optional Parameter
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: example-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: example-sc
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-pvc
spec:
  containers:
  - name: test-container
    image: gcr.io/google_containers/busybox:1.24
    command: ["/bin/sh", "-c", "echo 'hello from appl' >> /mnt/volume1/index.html && while true ; do sleep 2 ; done"]
    volumeMounts:
    - name: volume1
      mountPath: /mnt/volume1
  restartPolicy: Always
  volumes:
  - name: volume1
    persistentVolumeClaim:
      claimName: example-pvc

Example of storage class & PVC with shared storage via nfs. NFS service is enabled in Center & any IP available, VM added to VLAN and can ping NFS IP service from vSAN:

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: nfs-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.vsphere.vmware.com
parameters:
  csi.storage.k8s.io/fstype: "nfs4"
  storagepolicyname: "ftt-1-vsan-eq"  #Optional Parameter
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
  storageClassName: nfs-sc
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-nfs-pvc1
spec:
  containers:
  - name: test-container-nfs
    image: gcr.io/google_containers/busybox:1.24
    command: ["/bin/sh", "-c", "echo 'hello from appl' >> /mnt/volume1/index.html && while true ; do sleep 2 ; done"]
    volumeMounts:
    - name: file-volume
      mountPath: /mnt/volume1
  restartPolicy: Always
  volumes:
  - name: file-volume
    persistentVolumeClaim:
      claimName: nfs-pvc
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-nfs-pvc2
spec:
  containers:
  - name: test-container-nfs
    image: gcr.io/google_containers/busybox:1.24
    command: ["/bin/sh", "-c", "echo 'hello from appl' >> /mnt/volume1/index.html && while true ; do sleep 2 ; done"]
    volumeMounts:
    - name: file-volume
      mountPath: /mnt/volume1
  restartPolicy: Always
  volumes:
  - name: file-volume
    persistentVolumeClaim:
      claimName: nfs-pvc

Environment:

csi-vsphere version: v3.0.2 Kubernetes version: v1.26.9+rke2r1 vSphere version: 7.0.3 OS (e.g. from /etc/os-release): Ubuntu 22.04 Kernel (e.g. uname -a): 5.15.0-86-generic Install tools: NA Others: MetalLB load balancer with BGP IPv4 & IPv6 CNI: Cilium with DualStack config

vdkotkar commented 12 months ago

Getting similar issue:

/kind bug

What happened: Trying to install vSphere CSI drivers v3.0.2 with RKE2 cluster v1.26.9+rke2r1

csi-vsphere.conf is added to the secrets kubectl create secret generic vsphere-config-secret --from-file=/etc/kubernetes/secret.conf --namespace=vmware-system-csi I can ping IP of the vCenter from any k8s nodes.

[Global]
cluster-id = "my-k8s-vmw"
cluster-distribution = "native"

[VirtualCenter "172.16.1.1"]
insecure-flag = "true"
user = "k8s-vsphere-csi@local"
password = "mypwd"
port = "443"
datacenters = "mydc"

after deploying csi:

kubectl --namespace=vmware-system-csi get all
NAME                                          READY   STATUS             RESTARTS      AGE
pod/vsphere-csi-controller-5d88977cdf-6bb7w   0/7     Pending            0             12m
pod/vsphere-csi-controller-5d88977cdf-gfq52   0/7     Pending            0             12m
pod/vsphere-csi-controller-78cb9ff564-x8djs   0/7     Pending            0             12m
pod/vsphere-csi-node-2pg4x                    2/3     CrashLoopBackOff   7 (25s ago)   11m
pod/vsphere-csi-node-ffhc6                    2/3     CrashLoopBackOff   7 (35s ago)   11m
pod/vsphere-csi-node-grbwh                    2/3     CrashLoopBackOff   7 (41s ago)   11m
pod/vsphere-csi-node-k8jkb                    2/3     CrashLoopBackOff   7 (31s ago)   11m
pod/vsphere-csi-node-w8bps                    2/3     CrashLoopBackOff   7 (23s ago)   11m
pod/vsphere-csi-node-xkxvg                    2/3     CrashLoopBackOff   7 (27s ago)   11m

NAME                             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
service/vsphere-csi-controller   ClusterIP   10.43.84.182   <none>        2112/TCP,2113/TCP   3d5h

NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR              AGE
daemonset.apps/vsphere-csi-node           6         6         0       6            0           kubernetes.io/os=linux     3d5h
daemonset.apps/vsphere-csi-node-windows   0         0         0       0            0           kubernetes.io/os=windows   3d5h

NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/vsphere-csi-controller   0/3     1            0           3d5h

NAME                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/vsphere-csi-controller-5d88977cdf   2         2         0       3d5h
replicaset.apps/vsphere-csi-controller-78cb9ff564   1         1         0       44m

Getting failed to get CsiNodeTopology

Controller is not ready

kubectl get deployment --namespace=vmware-system-csi
NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
vsphere-csi-controller   0/3     1            0           3d5h
kubectl logs -n vmware-system-csi vsphere-csi-node-xkxvg
Defaulted container "node-driver-registrar" out of: node-driver-registrar, vsphere-csi-node, liveness-probe
I1009 02:18:59.471111       1 main.go:167] Version: v2.7.0
I1009 02:18:59.471155       1 main.go:168] Running node-driver-registrar in mode=registration
I1009 02:18:59.471675       1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1009 02:18:59.471696       1 connection.go:154] Connecting to unix:///csi/csi.sock
I1009 02:18:59.472696       1 main.go:199] Calling CSI driver to discover driver name
I1009 02:18:59.472764       1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I1009 02:18:59.472785       1 connection.go:184] GRPC request: {}
I1009 02:18:59.474901       1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.0.2"}
I1009 02:18:59.475020       1 connection.go:187] GRPC error: <nil>
I1009 02:18:59.475050       1 main.go:209] CSI driver name: "csi.vsphere.vmware.com"
I1009 02:18:59.475132       1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I1009 02:18:59.475484       1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I1009 02:18:59.475806       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I1009 02:19:00.760178       1 main.go:102] Received GetInfo call: &InfoRequest{}
I1009 02:19:00.760487       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I1009 02:19:00.779742       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "so-m002". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E1009 02:19:00.779785       1 main.go:123] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "so-m002". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

Controller is not starting

kubectl logs -n vmware-system-csi  vsphere-csi-controller-5d88977cdf-6bb7w
Defaulted container "csi-attacher" out of: csi-attacher, csi-resizer, vsphere-csi-controller, liveness-probe, vsphere-syncer, csi-provisioner, csi-snapshotter

CSINodeTopology is not there

kubectl get csinodetopologies
error: the server doesn't have a resource type "csinodetopologies"

6 nodes, 3 master & 3 workers

kubectl get nodes
NAME      STATUS   ROLES                       AGE   VERSION
so-m001   Ready    control-plane,etcd,master   12d   v1.26.9+rke2r1
so-m002   Ready    control-plane,etcd,master   12d   v1.26.9+rke2r1
so-m003   Ready    control-plane,etcd,master   12d   v1.26.9+rke2r1
so-w001   Ready    <none>                      12d   v1.26.9+rke2r1
so-w002   Ready    <none>                      12d   v1.26.9+rke2r1
so-w003   Ready    <none>                      12d   v1.26.9+rke2r1

Examples of SC & PVC

Example of storage class & PVC with block device:

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: example-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.vsphere.vmware.com
parameters:
  storagepolicyname: "vSAN Default Storage Policy"  #Optional Parameter
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: example-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: example-sc
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-pvc
spec:
  containers:
  - name: test-container
    image: gcr.io/google_containers/busybox:1.24
    command: ["/bin/sh", "-c", "echo 'hello from appl' >> /mnt/volume1/index.html && while true ; do sleep 2 ; done"]
    volumeMounts:
    - name: volume1
      mountPath: /mnt/volume1
  restartPolicy: Always
  volumes:
  - name: volume1
    persistentVolumeClaim:
      claimName: example-pvc

Example of storage class & PVC with shared storage via nfs. NFS service is enabled in Center & any IP available, VM added to VLAN and can ping NFS IP service from vSAN:

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: nfs-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: csi.vsphere.vmware.com
parameters:
  csi.storage.k8s.io/fstype: "nfs4"
  storagepolicyname: "ftt-1-vsan-eq"  #Optional Parameter
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
  storageClassName: nfs-sc
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-nfs-pvc1
spec:
  containers:
  - name: test-container-nfs
    image: gcr.io/google_containers/busybox:1.24
    command: ["/bin/sh", "-c", "echo 'hello from appl' >> /mnt/volume1/index.html && while true ; do sleep 2 ; done"]
    volumeMounts:
    - name: file-volume
      mountPath: /mnt/volume1
  restartPolicy: Always
  volumes:
  - name: file-volume
    persistentVolumeClaim:
      claimName: nfs-pvc
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-with-nfs-pvc2
spec:
  containers:
  - name: test-container-nfs
    image: gcr.io/google_containers/busybox:1.24
    command: ["/bin/sh", "-c", "echo 'hello from appl' >> /mnt/volume1/index.html && while true ; do sleep 2 ; done"]
    volumeMounts:
    - name: file-volume
      mountPath: /mnt/volume1
  restartPolicy: Always
  volumes:
  - name: file-volume
    persistentVolumeClaim:
      claimName: nfs-pvc

Environment:

csi-vsphere version: v3.0.2 Kubernetes version: v1.26.9+rke2r1 vSphere version: 7.0.3 OS (e.g. from /etc/os-release): Ubuntu 22.04 Kernel (e.g. uname -a): 5.15.0-86-generic Install tools: NA Others: NA CNI: Cilium

I observed that your controller pods are in pending state, that is eventually causing node pods to go in CrashLoopBackOff state. Please check why your controller pods are in pending state. You can paste the output of kubectl describe pod vsphere-csi-controller-5d88977cdf-6bb7w -n vmware-system-csi. Check if some affinity, anti-affinity rules on pod are causing this issue.

qdrddr commented 12 months ago

@vdkotkar I'm using a freshly installed rancher rke2 k8s cluster; almost nothing is configured on the cluster. I have k8s in DualStack, Cilium CNI, and MetalLB load balancer with BGP & IPv4+IPv6, and installed vsphere csi.

describe pod vsphere-csi-controller

kubectl describe pod vsphere-csi-controller-5d88977cdf-6bb7w -n vmware-system-csi
Name:                 vsphere-csi-controller-5d88977cdf-6bb7w
Namespace:            vmware-system-csi
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      vsphere-csi-controller
Node:                 <none>
Labels:               app=vsphere-csi-controller
                      pod-template-hash=5d88977cdf
                      role=vsphere-csi
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/vsphere-csi-controller-5d88977cdf
Containers:
  csi-attacher:
    Image:      k8s.gcr.io/sig-storage/csi-attacher:v4.2.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
      --kube-api-qps=100
      --kube-api-burst=100
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xmcst (ro)
  csi-resizer:
    Image:      k8s.gcr.io/sig-storage/csi-resizer:v1.7.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --handle-volume-inuse-error=false
      --csi-address=$(ADDRESS)
      --kube-api-qps=100
      --kube-api-burst=100
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xmcst (ro)
  vsphere-csi-controller:
    Image:       gcr.io/cloud-provider-vsphere/csi/release/driver:v3.0.2
    Ports:       9808/TCP, 2112/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --fss-name=internal-feature-states.csi.vsphere.vmware.com
      --fss-namespace=$(CSI_NAMESPACE)
    Liveness:  http-get http://:healthz/healthz delay=30s timeout=10s period=180s #success=1 #failure=3
    Environment:
      CSI_ENDPOINT:                     unix:///csi/csi.sock
      X_CSI_MODE:                       controller
      X_CSI_SPEC_DISABLE_LEN_CHECK:     true
      X_CSI_SERIAL_VOL_ACCESS_TIMEOUT:  3m
      VSPHERE_CSI_CONFIG:               /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                     PRODUCTION
      INCLUSTER_CLIENT_QPS:             100
      INCLUSTER_CLIENT_BURST:           100
      CSI_NAMESPACE:                    vmware-system-csi (v1:metadata.namespace)
    Mounts:
      /csi from socket-dir (rw)
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xmcst (ro)
  liveness-probe:
    Image:      k8s.gcr.io/sig-storage/livenessprobe:v2.9.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --csi-address=/csi/csi.sock
    Environment:  <none>
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xmcst (ro)
  vsphere-syncer:
    Image:      gcr.io/cloud-provider-vsphere/csi/release/syncer:v3.0.2
    Port:       2113/TCP
    Host Port:  0/TCP
    Args:
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
      --fss-name=internal-feature-states.csi.vsphere.vmware.com
      --fss-namespace=$(CSI_NAMESPACE)
    Environment:
      FULL_SYNC_INTERVAL_MINUTES:  30
      VSPHERE_CSI_CONFIG:          /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                PRODUCTION
      INCLUSTER_CLIENT_QPS:        100
      INCLUSTER_CLIENT_BURST:      100
      GODEBUG:                     x509sha1=1
      CSI_NAMESPACE:               vmware-system-csi (v1:metadata.namespace)
    Mounts:
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xmcst (ro)
  csi-provisioner:
    Image:      k8s.gcr.io/sig-storage/csi-provisioner:v3.4.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --kube-api-qps=100
      --kube-api-burst=100
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
      --default-fstype=ext4
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xmcst (ro)
  csi-snapshotter:
    Image:      k8s.gcr.io/sig-storage/csi-snapshotter:v6.2.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --kube-api-qps=100
      --kube-api-burst=100
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xmcst (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  vsphere-config-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  vsphere-config-secret
    Optional:    false
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-xmcst:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  3m45s (x133 over 11h)  default-scheduler  0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..

I can see the issue 0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.. But I do not understand why it's happening and how to fix this. Would you be so kind as to help me in this direction?

Taint

kubectl describe nodes | egrep "Taints:|Name:"
Name:               so-m001
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               so-m002
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               so-m003
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               so-w001
Taints:             <none>
Name:               so-w002
Taints:             <none>
Name:               so-w003
Taints:             <none>

Labels

kubectl get nodes --show-labels
NAME      STATUS   ROLES                       AGE   VERSION          LABELS
so-m001   Ready    control-plane,etcd,master   13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-m001,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=rke2
so-m002   Ready    control-plane,etcd,master   13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-m002,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=rke2
so-m003   Ready    control-plane,etcd,master   13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-m003,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=rke2
so-w001   Ready    <none>                      13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-w001,kubernetes.io/os=linux,node.kubernetes.io/instance-type=rke2
so-w002   Ready    <none>                      13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-w002,kubernetes.io/os=linux,node.kubernetes.io/instance-type=rke2
so-w003   Ready    <none>                      13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-w003,kubernetes.io/os=linux,node.kubernetes.io/instance-type=rke2
qdrddr commented 12 months ago

I also tested with csi v3.1.0 and got the same result.

Before installing csi I have tained master nodes for the pod toleration:

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v3.1.0/manifests/vanilla/namespace.yaml

mkdir -p /root/vsan
cat <<'EOXF' > /root/vsan/csi-vsphere.conf
[Global]
cluster-id = "my-k8s-vmw"
cluster-distribution = "native"

[VirtualCenter "172.16.1.1"]
insecure-flag = "true"
user = "k8s-vsphere-csi@local"
password = "password"
port = "443"
datacenters = "dc01"
EOXF

kubectl create secret generic vsphere-config-secret --from-file=/root/vsan/csi-vsphere.conf --namespace=vmware-system-csi

kubectl taint nodes so-m001 node-role.kubernetes.io/control-plane=:NoSchedule
kubectl taint nodes so-m002 node-role.kubernetes.io/control-plane=:NoSchedule
kubectl taint nodes so-m003 node-role.kubernetes.io/control-plane=:NoSchedule

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v3.1.0/manifests/vanilla/vsphere-csi-driver.yaml

controller pod: didn't match Pod's node affinity/selector

kubectl get pods,sc,pvc,pv -n vmware-system-csi
NAME                                          READY   STATUS             RESTARTS      AGE
pod/vsphere-csi-controller-699f9799f8-7pq89   0/7     Pending            0             4m25s
pod/vsphere-csi-controller-699f9799f8-9vbcd   0/7     Pending            0             4m25s
pod/vsphere-csi-controller-699f9799f8-pdcc2   0/7     Pending            0             4m25s
pod/vsphere-csi-node-flwvd                    2/3     CrashLoopBackOff   5 (68s ago)   4m25s
pod/vsphere-csi-node-g6g5b                    2/3     CrashLoopBackOff   5 (63s ago)   4m25s
pod/vsphere-csi-node-hmgzf                    2/3     CrashLoopBackOff   5 (76s ago)   4m25s
pod/vsphere-csi-node-wt7sp                    2/3     CrashLoopBackOff   5 (58s ago)   4m25s
pod/vsphere-csi-node-wtgwj                    2/3     CrashLoopBackOff   5 (71s ago)   4m25s
pod/vsphere-csi-node-zvdl7                    2/3     CrashLoopBackOff   5 (80s ago)   4m25s

kubectl describe pod vsphere-csi-controller-699f9799f8-7pq89 -n vmware-system-csi
Name:                 vsphere-csi-controller-699f9799f8-7pq89
Namespace:            vmware-system-csi
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      vsphere-csi-controller
Node:                 <none>
Labels:               app=vsphere-csi-controller
                      pod-template-hash=699f9799f8
                      role=vsphere-csi
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/vsphere-csi-controller-699f9799f8
Containers:
  csi-attacher:
    Image:      registry.k8s.io/sig-storage/csi-attacher:v4.3.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
      --kube-api-qps=100
      --kube-api-burst=100
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  csi-resizer:
    Image:      registry.k8s.io/sig-storage/csi-resizer:v1.8.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --handle-volume-inuse-error=false
      --csi-address=$(ADDRESS)
      --kube-api-qps=100
      --kube-api-burst=100
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  vsphere-csi-controller:
    Image:       gcr.io/cloud-provider-vsphere/csi/release/driver:v3.1.0
    Ports:       9808/TCP, 2112/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --fss-name=internal-feature-states.csi.vsphere.vmware.com
      --fss-namespace=$(CSI_NAMESPACE)
    Liveness:  http-get http://:healthz/healthz delay=30s timeout=10s period=180s #success=1 #failure=3
    Environment:
      CSI_ENDPOINT:                     unix:///csi/csi.sock
      X_CSI_MODE:                       controller
      X_CSI_SPEC_DISABLE_LEN_CHECK:     true
      X_CSI_SERIAL_VOL_ACCESS_TIMEOUT:  3m
      VSPHERE_CSI_CONFIG:               /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                     PRODUCTION
      INCLUSTER_CLIENT_QPS:             100
      INCLUSTER_CLIENT_BURST:           100
      CSI_NAMESPACE:                    vmware-system-csi (v1:metadata.namespace)
    Mounts:
      /csi from socket-dir (rw)
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  liveness-probe:
    Image:      registry.k8s.io/sig-storage/livenessprobe:v2.10.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --csi-address=/csi/csi.sock
    Environment:  <none>
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  vsphere-syncer:
    Image:      gcr.io/cloud-provider-vsphere/csi/release/syncer:v3.1.0
    Port:       2113/TCP
    Host Port:  0/TCP
    Args:
      --leader-election
      --leader-election-lease-duration=30s
      --leader-election-renew-deadline=20s
      --leader-election-retry-period=10s
      --fss-name=internal-feature-states.csi.vsphere.vmware.com
      --fss-namespace=$(CSI_NAMESPACE)
    Environment:
      FULL_SYNC_INTERVAL_MINUTES:  30
      VSPHERE_CSI_CONFIG:          /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                PRODUCTION
      INCLUSTER_CLIENT_QPS:        100
      INCLUSTER_CLIENT_BURST:      100
      GODEBUG:                     x509sha1=1
      CSI_NAMESPACE:               vmware-system-csi (v1:metadata.namespace)
    Mounts:
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  csi-provisioner:
    Image:      registry.k8s.io/sig-storage/csi-provisioner:v3.5.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --kube-api-qps=100
      --kube-api-burst=100
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
      --default-fstype=ext4
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  csi-snapshotter:
    Image:      registry.k8s.io/sig-storage/csi-snapshotter:v6.2.2
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --kube-api-qps=100
      --kube-api-burst=100
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  vsphere-config-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  vsphere-config-secret
    Optional:    false
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-xm6k4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  5m3s  default-scheduler  0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..

Labels & Taints I have on the cluster

kubectl describe nodes | egrep "Taints:|Name:"
Name:               so-m001
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               so-m002
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               so-m003
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               so-w001
Taints:             <none>
Name:               so-w002
Taints:             <none>
Name:               so-w003
Taints:             <none>

kubectl get nodes --show-labels
NAME      STATUS   ROLES                       AGE   VERSION          LABELS
so-m001   Ready    control-plane,etcd,master   13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-m001,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=rke2
so-m002   Ready    control-plane,etcd,master   13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-m002,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=rke2
so-m003   Ready    control-plane,etcd,master   13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-m003,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=rke2
so-w001   Ready    <none>                      13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-w001,kubernetes.io/os=linux,node.kubernetes.io/instance-type=rke2
so-w002   Ready    <none>                      13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-w002,kubernetes.io/os=linux,node.kubernetes.io/instance-type=rke2
so-w003   Ready    <none>                      13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-w003,kubernetes.io/os=linux,node.kubernetes.io/instance-type=rke2

failed to get CsiNodeTopology from node

kubectl logs -n vmware-system-csi vsphere-csi-node-flwvd
Defaulted container "node-driver-registrar" out of: node-driver-registrar, vsphere-csi-node, liveness-probe
I1009 21:08:03.930857       1 main.go:167] Version: v2.8.0
I1009 21:08:03.930911       1 main.go:168] Running node-driver-registrar in mode=registration
I1009 21:08:03.931414       1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1009 21:08:03.931470       1 connection.go:164] Connecting to unix:///csi/csi.sock
I1009 21:08:03.932081       1 main.go:199] Calling CSI driver to discover driver name
I1009 21:08:03.932093       1 connection.go:193] GRPC call: /csi.v1.Identity/GetPluginInfo
I1009 21:08:03.932096       1 connection.go:194] GRPC request: {}
I1009 21:08:03.933865       1 connection.go:200] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.1.0"}
I1009 21:08:03.933873       1 connection.go:201] GRPC error: <nil>
I1009 21:08:03.933879       1 main.go:209] CSI driver name: "csi.vsphere.vmware.com"
I1009 21:08:03.933919       1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I1009 21:08:03.934026       1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I1009 21:08:03.934616       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I1009 21:08:05.325659       1 main.go:102] Received GetInfo call: &InfoRequest{}
I1009 21:08:05.326700       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I1009 21:08:05.344075       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "so-m003". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E1009 21:08:05.344095       1 main.go:123] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "so-m003". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

Controller csi-attacher out of...

kubectl logs -n vmware-system-csi vsphere-csi-controller-699f9799f8-7pq89
Defaulted container "csi-attacher" out of: csi-attacher, csi-resizer, vsphere-csi-controller, liveness-probe, vsphere-syncer, csi-provisioner, csi-snapshotter

And in the manifest file
https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/v3.1.0/manifests/vanilla/vsphere-csi-driver.yaml

I see only these podAntiAffinity that are supposed to prevent multiple pods and run only one instance per node. The tolerations are supposed to let pods run on control-plane master nodes:

spec:
      priorityClassName: system-cluster-critical # Guarantees scheduling for critical system pods
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                      - vsphere-csi-controller
              topologyKey: "kubernetes.io/hostname"
      serviceAccountName: vsphere-csi-controller
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule

How do I fix Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container. and FailedScheduling 5m3s default-scheduler 0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..?

vdkotkar commented 11 months ago

I also tested with csi v3.1.0 and got the same result.

Before installing csi I have tained master nodes for the pod toleration:

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v3.1.0/manifests/vanilla/namespace.yaml

mkdir -p /root/vsan
cat <<'EOXF' > /root/vsan/csi-vsphere.conf
[Global]
cluster-id = "my-k8s-vmw"
cluster-distribution = "native"

[VirtualCenter "172.16.1.1"]
insecure-flag = "true"
user = "k8s-vsphere-csi@local"
password = "password"
port = "443"
datacenters = "dc01"
EOXF

kubectl create secret generic vsphere-config-secret --from-file=/root/vsan/csi-vsphere.conf --namespace=vmware-system-csi

kubectl taint nodes so-m001 node-role.kubernetes.io/control-plane=:NoSchedule
kubectl taint nodes so-m002 node-role.kubernetes.io/control-plane=:NoSchedule
kubectl taint nodes so-m003 node-role.kubernetes.io/control-plane=:NoSchedule

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/vsphere-csi-driver/v3.1.0/manifests/vanilla/vsphere-csi-driver.yaml

controller pod: didn't match Pod's node affinity/selector

kubectl get pods,sc,pvc,pv -n vmware-system-csi
NAME                                          READY   STATUS             RESTARTS      AGE
pod/vsphere-csi-controller-699f9799f8-7pq89   0/7     Pending            0             4m25s
pod/vsphere-csi-controller-699f9799f8-9vbcd   0/7     Pending            0             4m25s
pod/vsphere-csi-controller-699f9799f8-pdcc2   0/7     Pending            0             4m25s
pod/vsphere-csi-node-flwvd                    2/3     CrashLoopBackOff   5 (68s ago)   4m25s
pod/vsphere-csi-node-g6g5b                    2/3     CrashLoopBackOff   5 (63s ago)   4m25s
pod/vsphere-csi-node-hmgzf                    2/3     CrashLoopBackOff   5 (76s ago)   4m25s
pod/vsphere-csi-node-wt7sp                    2/3     CrashLoopBackOff   5 (58s ago)   4m25s
pod/vsphere-csi-node-wtgwj                    2/3     CrashLoopBackOff   5 (71s ago)   4m25s
pod/vsphere-csi-node-zvdl7                    2/3     CrashLoopBackOff   5 (80s ago)   4m25s

kubectl describe pod vsphere-csi-controller-699f9799f8-7pq89 -n vmware-system-csi
Name:                 vsphere-csi-controller-699f9799f8-7pq89
Namespace:            vmware-system-csi
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      vsphere-csi-controller
Node:                 <none>
Labels:               app=vsphere-csi-controller
                      pod-template-hash=699f9799f8
                      role=vsphere-csi
Annotations:          <none>
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/vsphere-csi-controller-699f9799f8
Containers:
  csi-attacher:
    Image:      registry.k8s.io/sig-storage/csi-attacher:v4.3.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
      --kube-api-qps=100
      --kube-api-burst=100
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  csi-resizer:
    Image:      registry.k8s.io/sig-storage/csi-resizer:v1.8.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --handle-volume-inuse-error=false
      --csi-address=$(ADDRESS)
      --kube-api-qps=100
      --kube-api-burst=100
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  vsphere-csi-controller:
    Image:       gcr.io/cloud-provider-vsphere/csi/release/driver:v3.1.0
    Ports:       9808/TCP, 2112/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --fss-name=internal-feature-states.csi.vsphere.vmware.com
      --fss-namespace=$(CSI_NAMESPACE)
    Liveness:  http-get http://:healthz/healthz delay=30s timeout=10s period=180s #success=1 #failure=3
    Environment:
      CSI_ENDPOINT:                     unix:///csi/csi.sock
      X_CSI_MODE:                       controller
      X_CSI_SPEC_DISABLE_LEN_CHECK:     true
      X_CSI_SERIAL_VOL_ACCESS_TIMEOUT:  3m
      VSPHERE_CSI_CONFIG:               /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                     PRODUCTION
      INCLUSTER_CLIENT_QPS:             100
      INCLUSTER_CLIENT_BURST:           100
      CSI_NAMESPACE:                    vmware-system-csi (v1:metadata.namespace)
    Mounts:
      /csi from socket-dir (rw)
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  liveness-probe:
    Image:      registry.k8s.io/sig-storage/livenessprobe:v2.10.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --csi-address=/csi/csi.sock
    Environment:  <none>
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  vsphere-syncer:
    Image:      gcr.io/cloud-provider-vsphere/csi/release/syncer:v3.1.0
    Port:       2113/TCP
    Host Port:  0/TCP
    Args:
      --leader-election
      --leader-election-lease-duration=30s
      --leader-election-renew-deadline=20s
      --leader-election-retry-period=10s
      --fss-name=internal-feature-states.csi.vsphere.vmware.com
      --fss-namespace=$(CSI_NAMESPACE)
    Environment:
      FULL_SYNC_INTERVAL_MINUTES:  30
      VSPHERE_CSI_CONFIG:          /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                PRODUCTION
      INCLUSTER_CLIENT_QPS:        100
      INCLUSTER_CLIENT_BURST:      100
      GODEBUG:                     x509sha1=1
      CSI_NAMESPACE:               vmware-system-csi (v1:metadata.namespace)
    Mounts:
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  csi-provisioner:
    Image:      registry.k8s.io/sig-storage/csi-provisioner:v3.5.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --kube-api-qps=100
      --kube-api-burst=100
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
      --default-fstype=ext4
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
  csi-snapshotter:
    Image:      registry.k8s.io/sig-storage/csi-snapshotter:v6.2.2
    Port:       <none>
    Host Port:  <none>
    Args:
      --v=4
      --kube-api-qps=100
      --kube-api-burst=100
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --leader-election-lease-duration=120s
      --leader-election-renew-deadline=60s
      --leader-election-retry-period=30s
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xm6k4 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  vsphere-config-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  vsphere-config-secret
    Optional:    false
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-xm6k4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  5m3s  default-scheduler  0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..

Labels & Taints I have on the cluster

kubectl describe nodes | egrep "Taints:|Name:"
Name:               so-m001
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               so-m002
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               so-m003
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
Name:               so-w001
Taints:             <none>
Name:               so-w002
Taints:             <none>
Name:               so-w003
Taints:             <none>

kubectl get nodes --show-labels
NAME      STATUS   ROLES                       AGE   VERSION          LABELS
so-m001   Ready    control-plane,etcd,master   13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-m001,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=rke2
so-m002   Ready    control-plane,etcd,master   13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-m002,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=rke2
so-m003   Ready    control-plane,etcd,master   13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-m003,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=true,node-role.kubernetes.io/etcd=true,node-role.kubernetes.io/master=true,node.kubernetes.io/instance-type=rke2
so-w001   Ready    <none>                      13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-w001,kubernetes.io/os=linux,node.kubernetes.io/instance-type=rke2
so-w002   Ready    <none>                      13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-w002,kubernetes.io/os=linux,node.kubernetes.io/instance-type=rke2
so-w003   Ready    <none>                      13d   v1.26.9+rke2r1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=rke2,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=so-w003,kubernetes.io/os=linux,node.kubernetes.io/instance-type=rke2

failed to get CsiNodeTopology from node

kubectl logs -n vmware-system-csi vsphere-csi-node-flwvd
Defaulted container "node-driver-registrar" out of: node-driver-registrar, vsphere-csi-node, liveness-probe
I1009 21:08:03.930857       1 main.go:167] Version: v2.8.0
I1009 21:08:03.930911       1 main.go:168] Running node-driver-registrar in mode=registration
I1009 21:08:03.931414       1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I1009 21:08:03.931470       1 connection.go:164] Connecting to unix:///csi/csi.sock
I1009 21:08:03.932081       1 main.go:199] Calling CSI driver to discover driver name
I1009 21:08:03.932093       1 connection.go:193] GRPC call: /csi.v1.Identity/GetPluginInfo
I1009 21:08:03.932096       1 connection.go:194] GRPC request: {}
I1009 21:08:03.933865       1 connection.go:200] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.1.0"}
I1009 21:08:03.933873       1 connection.go:201] GRPC error: <nil>
I1009 21:08:03.933879       1 main.go:209] CSI driver name: "csi.vsphere.vmware.com"
I1009 21:08:03.933919       1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I1009 21:08:03.934026       1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I1009 21:08:03.934616       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I1009 21:08:05.325659       1 main.go:102] Received GetInfo call: &InfoRequest{}
I1009 21:08:05.326700       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I1009 21:08:05.344075       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "so-m003". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E1009 21:08:05.344095       1 main.go:123] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "so-m003". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

Controller csi-attacher out of...

kubectl logs -n vmware-system-csi vsphere-csi-controller-699f9799f8-7pq89
Defaulted container "csi-attacher" out of: csi-attacher, csi-resizer, vsphere-csi-controller, liveness-probe, vsphere-syncer, csi-provisioner, csi-snapshotter

And in the manifest file https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/v3.1.0/manifests/vanilla/vsphere-csi-driver.yaml

I see only these podAntiAffinity that are supposed to prevent multiple pods and run only one instance per node. The tolerations are supposed to let pods run on control-plane master nodes:

spec:
      priorityClassName: system-cluster-critical # Guarantees scheduling for critical system pods
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                      - vsphere-csi-controller
              topologyKey: "kubernetes.io/hostname"
      serviceAccountName: vsphere-csi-controller
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule

How do I fix Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container. and FailedScheduling 5m3s default-scheduler 0/6 nodes are available: 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling..?

@qdrddr After looking at labels on your nodes, I think you need to change nodeSelector in the deployment yaml file. Current nodeSelector is:

      nodeSelector:
        node-role.kubernetes.io/control-plane: ""

Try to change it as mentioned below and see if it works:

      nodeSelector:
        node-role.kubernetes.io/control-plane: "true"

As per my observation, on RKE k8s cluster's master node we get label as "node-role.kubernetes.io/control-plane=true", whereas normal on-prem k8s cluster's master node has label "node-role.kubernetes.io/control-plane=".

Please let us know if pod scheduling works after making this change. I will recommend deleting old deployment first and then re-deploy CSI driver after making this change.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

lethargosapatheia commented 7 months ago

I am encountering the same issue. I'm running kubernetes version 1.29.1 with the --cloud-provider=external flag and running vsphere-csi version 3.0.0. I have the same problem as gabrieletosca does. Because I'm setting the cloud provider to external, the nodes are tainted with:

node.cloudprovider.kubernetes.io/uninitialized

The csi-controller are in pending state and won't start exactly because of this taint:

0/5 nodes are available: 5 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true

And the csi-node pods don't start because of the error mentioned in the topic. So how are you supposed to get out of this issue?

vdkotkar commented 6 months ago

I am encountering the same issue. I'm running kubernetes version 1.29.1 with the --cloud-provider=external flag and running vsphere-csi version 3.0.0. I have the same problem as gabrieletosca does. Because I'm setting the cloud provider to external, the nodes are tainted with:

node.cloudprovider.kubernetes.io/uninitialized

The csi-controller are in pending state and won't start exactly because of this taint:

0/5 nodes are available: 5 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true

And the csi-node pods don't start because of the error mentioned in the topic. So how are you supposed to get out of this issue?

I think you need to add following toleration in your CSI driver deployment yaml (for vsphere-csi-controller deployment), so that csi-controller pods will be scheduled on K8s control plane nodes.

      - key: node.cloudprovider.kubernetes.io/uninitialized
        value: "true"
        effect: NoSchedule
k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 5 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/2284#issuecomment-2047158013): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.