failed to get CsiNodeTopology for the node

kubernetes-sigs / vsphere-csi-driver

vSphere storage Container Storage Interface (CSI) plugin

https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/index.html

Apache License 2.0

296 stars 179 forks source link

failed to get CsiNodeTopology for the node #1661

Closed bnason closed 1 year ago

bnason commented 2 years ago

What happened: vsphere-csi-node DaemonSet node-driver-registrar fails with failed to get CsiNodeTopology for the node

I0322 12:55:25.883806       1 main.go:166] Version: v2.5.0
I0322 12:55:25.883841       1 main.go:167] Running node-driver-registrar in mode=registration
I0322 12:55:25.884289       1 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0322 12:55:25.884310       1 connection.go:154] Connecting to unix:///csi/csi.sock
I0322 12:55:25.884693       1 main.go:198] Calling CSI driver to discover driver name
I0322 12:55:25.884717       1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I0322 12:55:25.884721       1 connection.go:184] GRPC request: {}
I0322 12:55:25.886858       1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v2.5.0"}
I0322 12:55:25.886926       1 connection.go:187] GRPC error: <nil>
I0322 12:55:25.886933       1 main.go:208] CSI driver name: "csi.vsphere.vmware.com"
I0322 12:55:25.886971       1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I0322 12:55:25.887559       1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I0322 12:55:25.887693       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0322 12:55:27.616658       1 main.go:102] Received GetInfo call: &InfoRequest{}
I0322 12:55:27.617124       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I0322 12:55:27.636091       1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "talos-10-120-8-82". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E0322 12:55:27.636112       1 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "talos-10-120-8-82". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

What you expected to happen: The only information I can find on CSINodeTopology with respect to this driver is on the guide Deploying vSphere Container Storage Plug-in with Topology, however, I do NOT have the 2 arguments for the external-provisioner sidecar uncommented as instructed. Other than that, I can't even locate the CSINodeTopology cns.vmware.com/v1alpha1 CRD.

How to reproduce it (as minimally and precisely as possible): Deploy the vsphere-csi-driver as instructed at Install vSphere Container Storage Plug-in

Anything else we need to know?:

Environment:

csi-vsphere version: v2.5.0
vsphere-cloud-controller-manager version: v1.22.5
Kubernetes version: v1.23.4
vSphere version: v7.0.3
OS (e.g. from /etc/os-release): talos v1.0.0-beta.1
Install tools: kubectl

/kind bug

shalini-b commented 2 years ago

@bnason Can you describe the vsphere-csi-controller pod running in vmware-system-csi namespace and paste the output here? Also, can you also paste the output of the command kubectl logs <vsphere-csi-controller-pod-name> -n vmware-system-csi -c vsphere-syncer from all your controller pods.

shalini-b commented 2 years ago

/assign

addyess commented 2 years ago

I've encountered this same issue. I see a vsphere-syncer container within the vsphere-csi-controller-797d86f788-7lxgd pods -- but those pods are all pending

❯ kubectl get pods -n vmware-system-csi
NAME                                      READY   STATUS             RESTARTS      AGE
vsphere-csi-controller-797d86f788-7lxgd   0/7     Pending            0             19m
vsphere-csi-controller-797d86f788-gfc5d   0/7     Pending            0             19m
vsphere-csi-controller-797d86f788-n9vzb   0/7     Pending            0             19m
vsphere-csi-node-bm76f                    2/3     CrashLoopBackOff   6 (56s ago)   7m29s
vsphere-csi-node-grs57                    2/3     CrashLoopBackOff   6 (45s ago)   7m29s

in my case they are pending due to

│ Events:                                                                                                                                                                                                                                                                                                                                                                                                                                │
│   Type     Reason            Age                   From               Message                                                                                                                                                                                                                                                                                                                                                          │
│   ----     ------            ----                  ----               -------                                                                                                                                                                                                                                                                                                                                                          │
│   Warning  FailedScheduling  9m26s                 default-scheduler  0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.                                                                                                                                                                                                                                                                                    │
│   Warning  FailedScheduling  9m55s (x10 over 19m)  default-scheduler  0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.                                                                                                                                                                                                                                                                                    │
│   Warning  FailedScheduling  82s (x7 over 8m22s)   default-scheduler  0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector.                                                                                                                                                                                                                                                                                    │

addyess commented 2 years ago

Seems the issue is somewhere in here

    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                      - vsphere-csi-controller
              topologyKey: "kubernetes.io/hostname"
      serviceAccountName: vsphere-csi-controller
      nodeSelector:
        node-role.kubernetes.io/master: ""

current taints on nodes appear correct, so node-selectors should work

❯ kubectl describe nodes | egrep "Taints:"
Taints:             node-role.kubernetes.io/master:NoSchedule
Taints:             <none>

addyess commented 2 years ago

does it have anything to do with the replicas: 3 when there's only 1 master node for the vsphere-csi-controller deployment?

addyess commented 2 years ago

Looking over my node's LABELS, i noticed that the node-selector actually didn't match any of the node's labels

❯ kubectl describe nodes | egrep "Labels:" -A 10
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=vsphere-vm.cpu-2.mem-4gb.os-ubuntu
                    beta.kubernetes.io/os=linux
                    juju-application=kubernetes-control-plane
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=juju-ab364b-0
                    kubernetes.io/os=linux
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.246.154.99
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 22 Mar 2022 12:37:53 -0500
--
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=vsphere-vm.cpu-4.mem-4gb.os-ubuntu
                    beta.kubernetes.io/os=linux
                    juju-application=kubernetes-worker
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=juju-ab364b-1
                    kubernetes.io/os=linux
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.246.154.111
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 22 Mar 2022 12:38:38 -0500

by labeling the node appropriately, i was able to get the deployment running

addyess commented 2 years ago

However, i'm still having an issue with the driver not being able to identify the nodes:

Couldn't find VM instance with nodeUUID 0c712942-5d15-0d37-83c0-6120d7ca04c5, failed to discover with err: virtual machine wasn't found
...
Couldn't find VM instance with nodeUUID 98512942-a570-bd66-20a2-e1d93935175f, failed to discover with err: virtual machine wasn't found

and the vsphere-csi-node-* pods are in crashloopbackoff

│ I0322 21:47:06.374636       1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to retrieve topology information for Node: "juju-ab364b-1". Error: "failed to retrieve nodeVM \"98512942-a570-bd66-20a2-e1d93935175f\" using the node manager. Error: virtual machine wasn't │
│  found",}

So, i'm still stumped

tandrez commented 2 years ago

@bnason Can you describe the vsphere-csi-controller pod running in vmware-system-csi namespace and paste the output here? Also, can you also paste the output of the command kubectl logs <vsphere-csi-controller-pod-name> -n vmware-system-csi -c vsphere-syncer from all your controller pods.

Hello,

I have a similar issue.

Container node-driver-registrar in vsphere-csi-node daemonset fails with error:

I0330 12:57:56.449872       1 main.go:118] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to create CSINodeTopology CR. Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E0330 12:57:56.449909       1 main.go:120] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to create CSINodeTopology CR. Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

Pod vsphere-csi-controller describe:

Name:         vsphere-csi-controller-76b656958d-gbkc7
Namespace:    vmware-system-csi
Priority:     0
Node:         k8s-master/192.168.134.8
Start Time:   Wed, 30 Mar 2022 14:56:45 +0200
Labels:       app=vsphere-csi-controller
              pod-template-hash=76b656958d
              role=vsphere-csi
Annotations:  cni.projectcalico.org/containerID: ab0134062326ec1b46ffbccf83b01d2853a13b99fd5cbbc021e0d3f59bbbd370
              cni.projectcalico.org/podIP: 10.233.92.7/32
              cni.projectcalico.org/podIPs: 10.233.92.7/32
Status:       Running
IP:           10.233.92.7
IPs:
  IP:           10.233.92.7
Controlled By:  ReplicaSet/vsphere-csi-controller-76b656958d
Containers:
  csi-attacher:
    Container ID:  docker://c52985f21a7d2530319093f2b7727dcde11ff274167d757de0abfba8e5b6d02e
    Image:         k8s.gcr.io/sig-storage/csi-attacher:v3.3.0
    Image ID:      docker-pullable://k8s.gcr.io/sig-storage/csi-attacher@sha256:80dec81b679a733fda448be92a2331150d99095947d04003ecff3dbd7f2a476a
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --kube-api-qps=100
      --kube-api-burst=100
    State:          Running
      Started:      Wed, 30 Mar 2022 14:56:46 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4w2fd (ro)
  vsphere-csi-controller:
    Container ID:   docker://2c8f13ea1df0e3c18b39afaa043db404bab0d1f2ca114bb9b4651dd5e7dfe437
    Image:          gcr.io/cloud-provider-vsphere/csi/release/driver:v2.4.0
    Image ID:       docker-pullable://gcr.io/cloud-provider-vsphere/csi/release/driver@sha256:ff865128421c8e248675814798582d72c35f7f77d8c7450ac4f00429b5281514
    Ports:          9808/TCP, 2112/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 30 Mar 2022 14:59:47 +0200
      Finished:     Wed, 30 Mar 2022 14:59:47 +0200
    Ready:          False
    Restart Count:  5
    Liveness:       http-get http://:healthz/healthz delay=10s timeout=3s period=5s #success=1 #failure=3
    Environment:
      CSI_ENDPOINT:                     unix:///var/lib/csi/sockets/pluginproxy/csi.sock
      X_CSI_MODE:                       controller
      X_CSI_SPEC_DISABLE_LEN_CHECK:     true
      X_CSI_SERIAL_VOL_ACCESS_TIMEOUT:  3m
      VSPHERE_CSI_CONFIG:               /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                     PRODUCTION
    Mounts:
      /etc/cloud from vsphere-config-volume (ro)
      /var/lib/csi/sockets/pluginproxy from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4w2fd (ro)
  liveness-probe:
    Container ID:  docker://24f8bebe747403fd7c66418758ffbb1a79f3807b3e94ac2e0375723028b3e4d9
    Image:         k8s.gcr.io/sig-storage/livenessprobe:v2.4.0
    Image ID:      docker-pullable://k8s.gcr.io/sig-storage/livenessprobe@sha256:529be2c9770add0cdd0c989115222ea9fc1be430c11095eb9f6dafcf98a36e2b
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=4
      --csi-address=$(ADDRESS)
    State:          Running
      Started:      Wed, 30 Mar 2022 14:56:46 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      ADDRESS:  /var/lib/csi/sockets/pluginproxy/csi.sock
    Mounts:
      /var/lib/csi/sockets/pluginproxy from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4w2fd (ro)
  vsphere-syncer:
    Container ID:  docker://51d9a45cdae188d4e9189a16fcb4fba42502712223901677846f3b1acd9247da
    Image:         gcr.io/cloud-provider-vsphere/csi/release/syncer:v2.4.0
    Image ID:      docker-pullable://gcr.io/cloud-provider-vsphere/csi/release/syncer@sha256:b6da4448adf8cc2eb363198748d9f26d17acb500a80c53629379d83cecb6cefe
    Port:          2113/TCP
    Host Port:     0/TCP
    Args:
      --leader-election
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 30 Mar 2022 15:00:13 +0200
      Finished:     Wed, 30 Mar 2022 15:00:13 +0200
    Ready:          False
    Restart Count:  5
    Environment:
      FULL_SYNC_INTERVAL_MINUTES:  30
      VSPHERE_CSI_CONFIG:          /etc/cloud/csi-vsphere.conf
      LOGGER_LEVEL:                PRODUCTION
    Mounts:
      /etc/cloud from vsphere-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4w2fd (ro)
  csi-provisioner:
    Container ID:  docker://a8b952ba5cb26104ff604218383cd1ba62751b11288afafb6204b608c47b2814
    Image:         k8s.gcr.io/sig-storage/csi-provisioner:v3.0.0
    Image ID:      docker-pullable://k8s.gcr.io/sig-storage/csi-provisioner@sha256:6477988532358148d2e98f7c747db4e9250bbc7ad2664bf666348abf9ee1f5aa
    Port:          <none>
    Host Port:     <none>
    Args:
      --v=4
      --timeout=300s
      --csi-address=$(ADDRESS)
      --leader-election
      --default-fstype=ext4
      --kube-api-qps=100
      --kube-api-burst=100
    State:          Running
      Started:      Wed, 30 Mar 2022 14:56:47 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      ADDRESS:  /csi/csi.sock
    Mounts:
      /csi from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4w2fd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  vsphere-config-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  vsphere-config-secret
    Optional:    false
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-4w2fd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              node-role.kubernetes.io/control-plane=
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  4m34s                 default-scheduler  0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 node(s) didn't match Pod's node affinity/selector.
  Normal   Scheduled         4m5s                  default-scheduler  Successfully assigned vmware-system-csi/vsphere-csi-controller-76b656958d-gbkc7 to k8s-master
  Warning  FailedScheduling  4m36s                 default-scheduler  0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 node(s) didn't match Pod's node affinity/selector.
  Normal   Created           4m4s                  kubelet            Created container liveness-probe
  Normal   Created           4m4s                  kubelet            Created container csi-attacher
  Normal   Started           4m4s                  kubelet            Started container csi-attacher
  Normal   Created           4m4s                  kubelet            Created container vsphere-syncer
  Normal   Pulled            4m4s                  kubelet            Container image "gcr.io/cloud-provider-vsphere/csi/release/syncer:v2.4.0" already present on machine
  Normal   Started           4m4s                  kubelet            Started container liveness-probe
  Normal   Pulled            4m4s                  kubelet            Container image "k8s.gcr.io/sig-storage/csi-attacher:v3.3.0" already present on machine
  Normal   Pulled            4m4s                  kubelet            Container image "k8s.gcr.io/sig-storage/livenessprobe:v2.4.0" already present on machine
  Normal   Started           4m3s                  kubelet            Started container csi-provisioner
  Normal   Started           4m3s                  kubelet            Started container vsphere-syncer
  Normal   Pulled            4m3s                  kubelet            Container image "k8s.gcr.io/sig-storage/csi-provisioner:v3.0.0" already present on machine
  Normal   Created           4m3s                  kubelet            Created container csi-provisioner
  Normal   Created           3m46s (x3 over 4m4s)  kubelet            Created container vsphere-csi-controller
  Normal   Pulled            3m46s (x3 over 4m4s)  kubelet            Container image "gcr.io/cloud-provider-vsphere/csi/release/driver:v2.4.0" already present on machine
  Normal   Started           3m45s (x3 over 4m4s)  kubelet            Started container vsphere-csi-controller
  Warning  BackOff           3m45s (x4 over 4m2s)  kubelet            Back-off restarting failed container

Pod vsphere-syncer logs:

{"level":"error","time":"2022-03-30T13:08:05.324540426Z","caller":"kubernetes/kubernetes.go:430","msg":"Failed to update \"csinodetopologies.cns.vmware.com\" CRD with err: resource name may not be empty","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/kubernetes.createCustomResourceDefinition\n\t/build/pkg/kubernetes/kubernetes.go:430\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/kubernetes.CreateCustomResourceDefinitionFromManifest\n\t/build/pkg/kubernetes/kubernetes.go:392\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/syncer/cnsoperator/manager.InitCnsOperator\n\t/build/pkg/syncer/cnsoperator/manager/init.go:168\nmain.initSyncerComponents.func1.2\n\t/build/cmd/syncer/main.go:180"}
{"level":"error","time":"2022-03-30T13:08:05.324585513Z","caller":"manager/init.go:171","msg":"Failed to create \"csinodetopology\" CRD. Error: resource name may not be empty","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/syncer/cnsoperator/manager.InitCnsOperator\n\t/build/pkg/syncer/cnsoperator/manager/init.go:171\nmain.initSyncerComponents.func1.2\n\t/build/cmd/syncer/main.go:180"}
{"level":"error","time":"2022-03-30T13:08:05.324607236Z","caller":"syncer/main.go:181","msg":"Error initializing Cns Operator. Error: resource name may not be empty","stacktrace":"main.initSyncerComponents.func1.2\n\t/build/cmd/syncer/main.go:181"}

indreek commented 2 years ago

@tandrez what is the log output of container vsphere-csi-controller?

tandrez commented 2 years ago

@tandrez what is the log output of container vsphere-csi-controller?

{"level":"info","time":"2022-03-30T13:43:19.131765092Z","caller":"logger/logger.go:41","msg":"Setting default log level to :\"PRODUCTION\""}
{"level":"info","time":"2022-03-30T13:43:19.131881884Z","caller":"vsphere-csi/main.go:56","msg":"Version : v2.4.0","TraceId":"7415a968-2dd7-48b3-898f-7a55ec261e24"}
{"level":"info","time":"2022-03-30T13:43:19.13190827Z","caller":"commonco/utils.go:56","msg":"Defaulting feature states configmap name to \"internal-feature-states.csi.vsphere.vmware.com\"","TraceId":"7415a968-2dd7-48b3-898f-7a55ec261e24"}
{"level":"info","time":"2022-03-30T13:43:19.131925624Z","caller":"commonco/utils.go:60","msg":"Defaulting feature states configmap namespace to \"vmware-system-csi\"","TraceId":"7415a968-2dd7-48b3-898f-7a55ec261e24"}
{"level":"info","time":"2022-03-30T13:43:19.132307995Z","caller":"logger/logger.go:41","msg":"Setting default log level to :\"PRODUCTION\""}
{"level":"info","time":"2022-03-30T13:43:19.132533752Z","caller":"k8sorchestrator/k8sorchestrator.go:152","msg":"Initializing k8sOrchestratorInstance","TraceId":"761c3004-7437-473a-a499-bbdc4cfe778a"}
{"level":"info","time":"2022-03-30T13:43:19.13256854Z","caller":"kubernetes/kubernetes.go:85","msg":"k8s client using in-cluster config","TraceId":"761c3004-7437-473a-a499-bbdc4cfe778a"}
{"level":"info","time":"2022-03-30T13:43:19.133942631Z","caller":"kubernetes/kubernetes.go:352","msg":"Setting client QPS to 100.000000 and Burst to 100.","TraceId":"761c3004-7437-473a-a499-bbdc4cfe778a"}
{"level":"info","time":"2022-03-30T13:43:19.149838052Z","caller":"k8sorchestrator/k8sorchestrator.go:258","msg":"New internal feature states values stored successfully: map[async-query-volume:true block-volume-snapshot:false csi-auth-check:true csi-migration:false csi-windows-support:false improved-csi-idempotency:true improved-volume-topology:true online-volume-extend:true trigger-csi-fullsync:false]","TraceId":"761c3004-7437-473a-a499-bbdc4cfe778a"}
{"level":"info","time":"2022-03-30T13:43:19.149926732Z","caller":"k8sorchestrator/k8sorchestrator.go:178","msg":"k8sOrchestratorInstance initialized","TraceId":"761c3004-7437-473a-a499-bbdc4cfe778a"}
{"level":"info","time":"2022-03-30T13:43:19.153485356Z","caller":"config/config.go:339","msg":"No Net Permissions given in Config. Using default permissions.","TraceId":"761c3004-7437-473a-a499-bbdc4cfe778a"}
{"level":"info","time":"2022-03-30T13:43:19.153592437Z","caller":"vanilla/controller.go:84","msg":"Initializing CNS controller","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6"}
{"level":"info","time":"2022-03-30T13:43:19.154030896Z","caller":"vsphere/utils.go:163","msg":"Defaulting timeout for vCenter Client to 5 minutes","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6"}
{"level":"info","time":"2022-03-30T13:43:19.154068346Z","caller":"vsphere/virtualcentermanager.go:73","msg":"Initializing defaultVirtualCenterManager...","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6"}
{"level":"info","time":"2022-03-30T13:43:19.154084005Z","caller":"vsphere/virtualcentermanager.go:75","msg":"Successfully initialized defaultVirtualCenterManager","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6"}
{"level":"info","time":"2022-03-30T13:43:19.15409378Z","caller":"vsphere/virtualcentermanager.go:121","msg":"Successfully registered VC \"vcenter.local\"","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6"}
{"level":"info","time":"2022-03-30T13:43:19.154110471Z","caller":"vanilla/controller.go:102","msg":"CSI Volume manager idempotency handling feature flag is enabled.","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6"}
{"level":"info","time":"2022-03-30T13:43:19.154125191Z","caller":"cnsvolumeoperationrequest/cnsvolumeoperationrequest.go:80","msg":"Creating CnsVolumeOperationRequest definition on API server and initializing VolumeOperationRequest instance","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6"}
{"level":"info","time":"2022-03-30T13:43:19.153980969Z","caller":"k8sorchestrator/k8sorchestrator.go:484","msg":"configMapAdded: Internal feature state values from \"internal-feature-states.csi.vsphere.vmware.com\" stored successfully: map[async-query-volume:true block-volume-snapshot:false csi-auth-check:true csi-migration:false csi-windows-support:false improved-csi-idempotency:true improved-volume-topology:true online-volume-extend:true trigger-csi-fullsync:false]","TraceId":"bfe17697-8d9a-4288-bac7-788250fcd51f"}
{"level":"info","time":"2022-03-30T13:43:19.156870831Z","caller":"kubernetes/kubernetes.go:85","msg":"k8s client using in-cluster config","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6"}
{"level":"info","time":"2022-03-30T13:43:19.156994064Z","caller":"kubernetes/kubernetes.go:352","msg":"Setting client QPS to 100.000000 and Burst to 100.","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6"}
{"level":"error","time":"2022-03-30T13:43:19.158244627Z","caller":"kubernetes/kubernetes.go:430","msg":"Failed to update \"cnsvolumeoperationrequests.cns.vmware.com\" CRD with err: resource name may not be empty","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/kubernetes.createCustomResourceDefinition\n\t/build/pkg/kubernetes/kubernetes.go:430\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/kubernetes.CreateCustomResourceDefinitionFromManifest\n\t/build/pkg/kubernetes/kubernetes.go:392\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/internalapis/cnsvolumeoperationrequest.InitVolumeOperationRequestInterface\n\t/build/pkg/internalapis/cnsvolumeoperationrequest/cnsvolumeoperationrequest.go:83\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:103\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:142\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:246\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:130\nmain.main\n\t/build/cmd/vsphere-csi/main.go:72\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
{"level":"error","time":"2022-03-30T13:43:19.158319466Z","caller":"cnsvolumeoperationrequest/cnsvolumeoperationrequest.go:87","msg":"failed to create CnsVolumeOperationRequest CRD with error: resource name may not be empty","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/internalapis/cnsvolumeoperationrequest.InitVolumeOperationRequestInterface\n\t/build/pkg/internalapis/cnsvolumeoperationrequest/cnsvolumeoperationrequest.go:87\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:103\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:142\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:246\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:130\nmain.main\n\t/build/cmd/vsphere-csi/main.go:72\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
{"level":"error","time":"2022-03-30T13:43:19.158357616Z","caller":"vanilla/controller.go:109","msg":"failed to initialize VolumeOperationRequestInterface with error: resource name may not be empty","TraceId":"3a6a7955-76a3-46a1-ad85-55c98c4aabe6","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:109\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:142\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:246\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:130\nmain.main\n\t/build/cmd/vsphere-csi/main.go:72\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
{"level":"error","time":"2022-03-30T13:43:19.158394961Z","caller":"service/driver.go:143","msg":"failed to init controller. Error: resource name may not be empty","TraceId":"761c3004-7437-473a-a499-bbdc4cfe778a","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:143\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve.func1\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:246\nsync.(*Once).doSlow\n\t/usr/local/go/src/sync/once.go:68\nsync.(*Once).Do\n\t/usr/local/go/src/sync/once.go:59\ngithub.com/rexray/gocsi.(*StoragePlugin).Serve\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:211\ngithub.com/rexray/gocsi.Run\n\t/go/pkg/mod/github.com/rexray/gocsi@v1.2.2/gocsi.go:130\nmain.main\n\t/build/cmd/vsphere-csi/main.go:72\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225"}
{"level":"info","time":"2022-03-30T13:43:19.158431473Z","caller":"service/driver.go:109","msg":"Configured: \"csi.vsphere.vmware.com\" with clusterFlavor: \"VANILLA\" and mode: \"controller\"","TraceId":"761c3004-7437-473a-a499-bbdc4cfe778a"}
time="2022-03-30T13:43:19Z" level=info msg="removed sock file" path=/var/lib/csi/sockets/pluginproxy/csi.sock
time="2022-03-30T13:43:19Z" level=fatal msg="grpc failed" error="resource name may not be empty"

indreek commented 2 years ago

@tandrez show me your vsphere-config-secret (csi-vsphere.conf)

tandrez commented 2 years ago

@tandrez show me your vsphere-config-secret (csi-vsphere.conf)

[Global]
cluster-id = "kubernetes-cluster-id"

[VirtualCenter "vcenter.local"]
insecure-flag = "true"
user = "someuser@vsphere.local"
password = "somepassword"
port = "443"
datacenters = "DC1"

indreek commented 2 years ago

I'm new to this plugin also.

What i did: I tried to list all nodes and datastores with https://github.com/vmware/govmomi/blob/master/govc/USAGE.md. Try if you can do it same. I used AD accounts and username didn't work like this. Eventually i had use usename in DOMAIN\user format. But what i discovered was that if something is missing in conf, it gave similar errors.

Are you following this manual to create secret? https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.0/vmware-vsphere-csp-getting-started/GUID-BFF39F1D-F70A-4360-ABC9-85BDAFBE8864.html

tandrez commented 2 years ago

I can list objects with govc without any problem.

I am deploying with Kubespray but I also checked the VMware docs and as far as I can tell, the configuration seems OK.

bnason commented 2 years ago

So for my setup, this issue was caused by the vSphere CPI not working correctly and thus not untainting the nodes which never allowed the csi pods to run and I believe one of them is responsible for creating the CRD.

My CPI issue is documented here: https://github.com/kubernetes/cloud-provider-vsphere/issues/614

shalini-b commented 2 years ago

does it have anything to do with the replicas: 3 when there's only 1 master node for the vsphere-csi-controller deployment?

related commit 01ec59a

@addyess Yes you are supposed to update the replica count in CSI controller deployment to the number of master nodes in your environment. After doing this, the controller pods will start running. The CSI nodes depend on the syncer container in the controller pod to fetch certain information about the underlying environment. The nodes will not come up till the controller provides this information to them.

shalini-b commented 2 years ago

However, i'm still having an issue with the driver not being able to identify the nodes:

Couldn't find VM instance with nodeUUID 0c712942-5d15-0d37-83c0-6120d7ca04c5, failed to discover with err: virtual machine wasn't found
...
Couldn't find VM instance with nodeUUID 98512942-a570-bd66-20a2-e1d93935175f, failed to discover with err: virtual machine wasn't found

and the vsphere-csi-node-* pods are in crashloopbackoff

│ I0322 21:47:06.374636       1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to retrieve topology information for Node: "juju-ab364b-1". Error: "failed to retrieve nodeVM \"98512942-a570-bd66-20a2-e1d93935175f\" using the node manager. Error: virtual machine wasn't │
│  found",}

So, i'm still stumped

Which version of vSphere CSI are you running? You need not have to set the node labels manually. If the cloud provider you are using is doing its job correctly, it will set the required labels and remove the NoSchedule taint on the nodes. Can you confirm if this is working?

Jellyfrog commented 2 years ago

Enabling "CSI full sync" causes this for me, disabling it and everything starts to work again

johnwc commented 2 years ago

Enabling Enable Improved Volume Topology causes this error for us. Removing the selection and redeploying brings it online and stable.

addyess commented 2 years ago

@shalini-b I think i'm making forward progress, my current deployment succeeds, but fails to launch all the containers in the daemonset:

each deamonset container vmware-system-csi/vsphere-csi-node-2pdr4:node-driver-registrar presents a similar log description:

I0518 18:00:18.524418       1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to retrieve topology information for Node: "juju-78116a-5". Error: "failed to retrieve nodeVM \"f3ef2942-6724-aff1-6ddb-cea417d0f5aa\" using the node manager. Error: virtual machine wasn't found",}
E0518 18:00:18.524476       1 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to retrieve topology information for Node: "juju-78116a-5". Error: "failed to retrieve nodeVM \"f3ef2942-6724-aff1-6ddb-cea417d0f5aa\" using the node manager. Error: virtual machine wasn't found", restarting registration container.

the machine's UUID:

ubuntu@juju-78116a-5:~$ sudo dmidecode | grep UUID
    UUID: 4229eff3-2467-f1af-6ddb-cea417d0f5aa

the machines provider-id:

Name:               juju-78116a-5
ProviderID:                   vsphere://4229eff3-2467-f1af-6ddb-cea417d0f5aa

both match. But the CSI-node driver is seems to swap the bytes around?

4229eff3-2467-f1af-6ddb-cea417d0f5aa  # from provider-id and dmidecode
f3ef2942-6724-aff1-6ddb-cea417d0f5aa # from container logs

if i reverse the first 12 bytes, they match

AABBCCDD-EEFF-GGHH-IIJJ-KKLLMMNNOOPP
DDCCBBAA-FFEE-HHGG-IIJJ-KKLLMMNNOOPP

edit: i think this is https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1629, updating to v2.5.1 images final-edit: cool. I can make PVCs! way to go

shalini-b commented 2 years ago

Enabling Enable Improved Volume Topology causes this error for us. Removing the selection and redeploying brings it online and stable.

Please note that disabling an already enabled feature in vSphere CSI driver is not supported. Kindly root cause the actual issue.

anmolguptadevops commented 2 years ago

I am also facing similar issue...anyone can help

ivsphere-csi-controller-76b656958d-7rftq 3/5 CrashLoopBackOff 8 (23s ago) 3m12s vsphere-csi-controller-76b656958d-kfwqs 2/5 CrashLoopBackOff 21 (47s ago) 12m vsphere-csi-controller-76b656958d-lmcg6 0/5 ContainerCreating 0 3m12s vsphere-csi-node-4tvs8 2/3 CrashLoopBackOff 7 (48s ago) 12m vsphere-csi-node-5st28 2/3 CrashLoopBackOff 7 (2m54s ago) 14m vsphere-csi-node-8pcg9 2/3 CrashLoopBackOff 7 (3m9s ago) 14m vsphere-csi-node-fmfjt 2/3 CrashLoopBackOff 7 (53s ago) 12m vsphere-csi-node-h229j 2/3 CrashLoopBackOff 7 (52s ago) 12m vsphere-csi-node-l5fls 2/3 CrashLoopBackOff 7 (47s ago) 12m vsphere-csi-node-mlskw 2/3 CrashLoopBackOff 7 (46s ago) 12m vsphere-csi-node-xp97z 2/3 CrashLoopBackOff 7 (47s ago) 12m

johnwc commented 2 years ago

@shalini-b

Enabling Enable Improved Volume Topology causes this error for us. Removing the selection and redeploying brings it online and stable.

Please note that disabling an already enabled feature in vSphere CSI driver is not supported. Kindly root cause the actual issue.

Without that disabled it will not successfully deploy. So there is no way to disable a feature for a already deployed driver, when the driver can't deploy in the first place... Hence the use of the word redeploy, and not reconfigure If you want to fill me in on where to look to find the root cause for your driver failing when that is enabled, I'd be more than happy to.

anmolguptadevops commented 2 years ago

Enable Improved Volume Topology How to disable it ?? I am not able to find it in kubespray code KIndly find my node-driver-registrar error

kubectl logs vsphere-csi-node-4gfr9 -c node-driver-registrar I0527 19:19:56.618396 1 main.go:166] Version: v2.4.0 I0527 19:19:56.618456 1 main.go:167] Running node-driver-registrar in mode=registration I0527 19:19:56.619961 1 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock" I0527 19:19:56.619997 1 connection.go:154] Connecting to unix:///csi/csi.sock I0527 19:19:56.620597 1 main.go:198] Calling CSI driver to discover driver name I0527 19:19:56.620660 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo I0527 19:19:56.620670 1 connection.go:184] GRPC request: {} I0527 19:19:56.625349 1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"${VERSION}"} I0527 19:19:56.625437 1 connection.go:187] GRPC error: I0527 19:19:56.625446 1 main.go:208] CSI driver name: "csi.vsphere.vmware.com" I0527 19:19:56.625523 1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock I0527 19:19:56.626696 1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock I0527 19:19:56.627075 1 node_register.go:92] Skipping HTTP server because endpoint is set to: "" I0527 19:19:58.092926 1 main.go:102] Received GetInfo call: &InfoRequest{} I0527 19:19:58.093235 1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration" I0527 19:19:58.106465 1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = No Virtual Center hosts defined,} E0527 19:19:58.106531 1 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = No Virtual Center hosts defined, restarting registration container.

shalini-b commented 2 years ago

@shalini-b

Enabling Enable Improved Volume Topology causes this error for us. Removing the selection and redeploying brings it online and stable.

Please note that disabling an already enabled feature in vSphere CSI driver is not supported. Kindly root cause the actual issue.

Without that disabled it will not successfully deploy. So there is no way to disable a feature for a already deployed driver, when the driver can't deploy in the first place... Hence the use of the word redeploy, and not reconfigure If you want to fill me in on where to look to find the root cause for your driver failing when that is enabled, I'd be more than happy to.

Which driver version are you using? Kindly use v2.4.1+ to get rid of this issue. If you are still unable to deploy the driver, collect the following logs: For each vSphere CSI controller pod running in your env:

kubectl logs <controller-pod-name> -c vsphere-syncer -n vmware-system-csi

For any of the vSphere CSI nodes in a CrashLoopBackOff:

kubectl logs <node-pod-name> -c vsphere-csi-node -n vmware-system-csi

brathina-spectro commented 2 years ago

We ran into a similar issue with vsphere-csi-driver v2.5.2

The root cause was the cloud account used did not have enough permissions to read zone info from vcenter. CPI tried to read tags for setting failure domain labels on the node and ended up failing because of not having enough permissions. When this happened, we saw the exact error

I0824 19:54:27.091878       1 main.go:166] Version: v2.5.0
I0824 19:54:27.091907       1 main.go:167] Running node-driver-registrar in mode=registration
I0824 19:54:27.092263       1 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0824 19:54:27.092288       1 connection.go:154] Connecting to unix:///csi/csi.sock
I0824 19:54:27.092662       1 main.go:198] Calling CSI driver to discover driver name
I0824 19:54:27.092682       1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I0824 19:54:27.092686       1 connection.go:184] GRPC request: {}
I0824 19:54:27.094514       1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v2.5.2"}
I0824 19:54:27.094581       1 connection.go:187] GRPC error: <nil>
I0824 19:54:27.094587       1 main.go:208] CSI driver name: "csi.vsphere.vmware.com"
I0824 19:54:27.094639       1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I0824 19:54:27.094799       1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I0824 19:54:27.094859       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0824 19:54:28.397818       1 main.go:102] Received GetInfo call: &InfoRequest{}
I0824 19:54:28.397991       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration"
I0824 19:54:28.413900       1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "res-cal-p8-cp-np6cq". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",}
E0824 19:54:28.413942       1 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "res-cal-p8-cp-np6cq". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

vsphere-cloud-controller-manager had the below error

I0824 19:55:12.021205       1 node_controller.go:390] Initializing node res-cal-p8-cp-np6cq with cloud provider
I0824 19:55:12.021237       1 instances.go:113] instances.InstanceID() CACHED with res-cal-p8-cp-np6cq
I0824 19:55:12.021244       1 instances.go:83] instances.NodeAddressesByProviderID() CACHED with 42274b7c-bfd9-960e-1c91-7f0fa5b014c6
E0824 19:55:12.185474       1 zones.go:195] Failed to get host system properties. err: NoPermission
E0824 19:55:12.205266       1 zones.go:124] Failed to get host system properties. err: NoPermission
E0824 19:55:12.205304       1 node_controller.go:212] error syncing 'res-cal-p8-cp-np6cq': failed to get instance metadata for node res-cal-p8-cp-np6cq: failed to get zone from cloud provider: Zone: Error fetching by providerID: NoPermission Error fetching by NodeName: NoPermission, requeuing

omniproc commented 2 years ago

If you don't need the feature you may set improved-volume-topology: 'false' in ConfigMaps/internal-feature-states.csi.vsphere.vmware.com. Otherwise this can fail for multiple reasons (e.g. as pointed out because of missing permissions in vCenter). Simply disabling the feature we didn't want to use fixed the issue for us. It seems as it is enabled by default in more recent vSphere CSI releases. I'm not sure why this is needed since the manifest still has the comments args you'd need to enable toplogy awareness. The new feature gates are not very well documented.

waxling commented 2 years ago

I can confirm that I've hit this problem also (new deployment, vsphere 7.0u3, k3s v1.24.4+k3s1)

as mentioned here, and in https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1948 the default setting of improved-volume-topology: 'true' in vsphere-csi-driver.yaml

seems to be the cause, and changing it to false allows the pods to deploy.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

agtogna commented 1 year ago

as mentioned here, and in #1948 the default setting of improved-volume-topology: 'true' in vsphere-csi-driver.yaml

seems to be the cause, and changing it to false allows the pods to deploy.

I can confirm it fixes the problem for me too:

vSphere 7.0u3
k8s 1.23.15 installed with ClusterAPI custom image (generated with image-builder)
CSI Driver v2.7.0

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/1661#issuecomment-1436052066): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

praction1988 commented 1 year ago

Same with me issue with CSI ..

root@k8s-master01:/etc/kubernetes# kubectl get pods -n vmware-system-csi NAME READY STATUS RESTARTS AGE vsphere-csi-controller-5664789fc9-gg4wn 6/7 CrashLoopBackOff 1 (4s ago) 17s vsphere-csi-controller-5664789fc9-mz5b8 6/7 CrashLoopBackOff 1 (3s ago) 17s vsphere-csi-controller-5664789fc9-q47jz 6/7 CrashLoopBackOff 1 (3s ago) 17s vsphere-csi-node-7rqdd 2/3 CrashLoopBackOff 1 (6s ago) 17s vsphere-csi-node-rk497 2/3 CrashLoopBackOff 1 (7s ago) 17s vsphere-csi-node-rl2ws 2/3 Error 1 (10s ago) 17s vsphere-csi-node-vm8tw 2/3 CrashLoopBackOff 1 (7s ago) 17s vsphere-csi-node-vv9x5 2/3 CrashLoopBackOff 1 (7s ago) 17s vsphere-csi-node-wbmcp 2/3 CrashLoopBackOff 1 (9s ago) 17s vsphere-csi-node-wkk27 2/3 CrashLoopBackOff 1 (9s ago) 17s root@k8s-master01:/etc/kubernetes# kubectl logs vsphere-csi-controller-5664789fc9-mz5b8 -n vmware-system-csi -c vsphere-syncer {"level":"info","time":"2023-03-19T22:18:04.163624732Z","caller":"logger/logger.go:41","msg":"Setting default log level to :\"PRODUCTION\""} {"level":"info","time":"2023-03-19T22:18:04.164377027Z","caller":"syncer/main.go:76","msg":"Version : v2.7.0","TraceId":"29e0f1e2-c4a5-4d43-85b3-1bbc9c441343"} {"level":"info","time":"2023-03-19T22:18:04.164637218Z","caller":"syncer/main.go:93","msg":"Starting container with operation mode: METADATA_SYNC","TraceId":"29e0f1e2-c4a5-4d43-85b3-1bbc9c441343"} {"level":"info","time":"2023-03-19T22:18:04.164802222Z","caller":"kubernetes/kubernetes.go:85","msg":"k8s client using in-cluster config","TraceId":"29e0f1e2-c4a5-4d43-85b3-1bbc9c441343"} {"level":"info","time":"2023-03-19T22:18:04.165106234Z","caller":"syncer/main.go:115","msg":"Starting the http server to expose Prometheus metrics..","TraceId":"29e0f1e2-c4a5-4d43-85b3-1bbc9c441343"} {"level":"info","time":"2023-03-19T22:18:04.165457343Z","caller":"kubernetes/kubernetes.go:389","msg":"Setting client QPS to 100.000000 and Burst to 100.","TraceId":"29e0f1e2-c4a5-4d43-85b3-1bbc9c441343"} I0319 22:18:04.168818 1 leaderelection.go:248] attempting to acquire leader lease vmware-system-csi/vsphere-syncer... I0319 22:18:04.198090 1 leaderelection.go:258] successfully acquired lease vmware-system-csi/vsphere-syncer {"level":"error","time":"2023-03-19T22:18:04.201127391Z","caller":"config/config.go:459","msg":"error while reading config file: 1:1: illegal character U+0024 '$'","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/common/config.ReadConfig\n\t/build/pkg/common/config/config.go:459\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/common/config.GetCnsconfig\n\t/build/pkg/common/config/config.go:489\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.GetConfig\n\t/build/pkg/csi/service/common/util.go:281\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.InitConfigInfo\n\t/build/pkg/csi/service/common/util.go:292\nmain.initSyncerComponents.func1\n\t/build/cmd/syncer/main.go:161\ngithub.com/kubernetes-csi/csi-lib-utils/leaderelection.(leaderElection).Run.func1\n\t/go/pkg/mod/github.com/kubernetes-csi/csi-lib-utils@v0.11.0/leaderelection/leader_election.go:179"} {"level":"error","time":"2023-03-19T22:18:04.201572149Z","caller":"config/config.go:491","msg":"failed to parse config. Err: 1:1: illegal character U+0024 '$'","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/common/config.GetCnsconfig\n\t/build/pkg/common/config/config.go:491\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.GetConfig\n\t/build/pkg/csi/service/common/util.go:281\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.InitConfigInfo\n\t/build/pkg/csi/service/common/util.go:292\nmain.initSyncerComponents.func1\n\t/build/cmd/syncer/main.go:161\ngithub.com/kubernetes-csi/csi-lib-utils/leaderelection.(leaderElection).Run.func1\n\t/go/pkg/mod/github.com/kubernetes-csi/csi-lib-utils@v0.11.0/leaderelection/leader_election.go:179"} {"level":"error","time":"2023-03-19T22:18:04.201844658Z","caller":"common/util.go:294","msg":"failed to read config. Error: 1:1: illegal character U+0024 '$'","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service/common.InitConfigInfo\n\t/build/pkg/csi/service/common/util.go:294\nmain.initSyncerComponents.func1\n\t/build/cmd/syncer/main.go:161\ngithub.com/kubernetes-csi/csi-lib-utils/leaderelection.(leaderElection).Run.func1\n\t/go/pkg/mod/github.com/kubernetes-csi/csi-lib-utils@v0.11.0/leaderelection/leader_election.go:179"} {"level":"error","time":"2023-03-19T22:18:04.202088917Z","caller":"syncer/main.go:163","msg":"failed to initialize the configInfo. Err: 1:1: illegal character U+0024 '$'","stacktrace":"main.initSyncerComponents.func1\n\t/build/cmd/syncer/main.go:163\ngithub.com/kubernetes-csi/csi-lib-utils/leaderelection.(leaderElection).Run.func1\n\t/go/pkg/mod/github.com/kubernetes-csi/csi-lib-utils@v0.11.0/leaderele

Coku2015 commented 1 year ago

I got the similar error in my k3s cluster.

vSphere version: 7.0.3

lei@leik3svSphere01:~$ sudo kubectl logs -n vmware-system-csi vsphere-csi-node-mrrxx Defaulted container "node-driver-registrar" out of: node-driver-registrar, vsphere-csi-node, liveness-probe I0509 12:13:58.557564 1 main.go:167] Version: v2.7.0 I0509 12:13:58.557609 1 main.go:168] Running node-driver-registrar in mode=registration I0509 12:13:58.558041 1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock" I0509 12:13:58.558067 1 connection.go:154] Connecting to unix:///csi/csi.sock I0509 12:13:58.558907 1 main.go:199] Calling CSI driver to discover driver name I0509 12:13:58.558919 1 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo I0509 12:13:58.558923 1 connection.go:184] GRPC request: {} I0509 12:13:58.561096 1 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.0.0"} I0509 12:13:58.561142 1 connection.go:187] GRPC error: I0509 12:13:58.561149 1 main.go:209] CSI driver name: "csi.vsphere.vmware.com" I0509 12:13:58.561221 1 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock I0509 12:13:58.561334 1 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock I0509 12:13:58.561415 1 node_register.go:92] Skipping HTTP server because endpoint is set to: "" I0509 12:13:59.747918 1 main.go:102] Received GetInfo call: &InfoRequest{} I0509 12:13:59.748377 1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/csi.vsphere.vmware.com/registration" I0509 12:13:59.765452 1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "leik3svsphere01". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1",} E0509 12:13:59.765492 1 main.go:123] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to get CsiNodeTopology for the node: "leik3svsphere01". Error: no matches for kind "CSINodeTopology" in version "cns.vmware.com/v1alpha1", restarting registration container.

EldinEgrlic commented 1 year ago

If you don't need the feature you may set improved-volume-topology: 'false' in ConfigMaps/internal-feature-states.csi.vsphere.vmware.com. Otherwise this can fail for multiple reasons (e.g. as pointed out because of missing permissions in vCenter). Simply disabling the feature we didn't want to use fixed the issue for us. It seems as it is enabled by default in more recent vSphere CSI releases. I'm not sure why this is needed since the manifest still has the comments args you'd need to enable toplogy awareness. The new feature gates are not very well documented.

@omniproc Thanks, this helped me!