kubernetes-sigs / vsphere-csi-driver

vSphere storage Container Storage Interface (CSI) plugin
https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/index.html
Apache License 2.0
296 stars 179 forks source link

Crashing vsphere-csi-controller with RWX (ReadWtiteMany) PV #2755

Closed dzanto closed 1 month ago

dzanto commented 8 months ago

/kind bug

What happened:

I have install vSphere Cloud Provider Interface (CPI) and vSphere Container Storage Interface (CSI) to kubernetes cluster from Rancher Apps, and mount RWO (ReadWtiteOnce), it's works fine. Then I try to create PVC (PersistentVolumeClaim) with RWX (ReadWtiteMany) mode, but csi-provisioner and vsphere-csi-controller begin restarting with logs:

csi-provisioner:

controller.go:860] Started provisioner controller csi.vsphere.vmware.com_vsphere-csi-controller-568b9cb986-vpv7c_aa38ac7f-1fce-4826-bf03-38744d6cbf38!
controller.go:1337] provision "default/rwx-storage" class "vsphere-csi-sc": started
controller.go:568] skip translation of storage class for plugin: csi.vsphere.vmware.com
event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"rwx-storage", UID:"72ae5017-3dfc-4e2d-ba54-d228584064ab", APIVersion:"v1", ResourceVersion:"22586568", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/rwx-storage"
controller.go:1082] Temporary error received, adding PVC 72ae5017-3dfc-4e2d-ba54-d228584064ab to claims in progress
controller.go:934] Retrying syncing claim "72ae5017-3dfc-4e2d-ba54-d228584064ab", failure 0
controller.go:957] error syncing claim "72ae5017-3dfc-4e2d-ba54-d228584064ab": failed to provision volume with StorageClass "vsphere-csi-sc": rpc error: code = Unavailable desc = error reading from server: EOF
event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"rwx-storage", UID:"72ae5017-3dfc-4e2d-ba54-d228584064ab", APIVersion:"v1", ResourceVersion:"22586568", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "vsphere-csi-sc": rpc error: code = Unavailable desc = error reading from server: EOF
controller.go:1337] provision "default/rwx-storage" class "vsphere-csi-sc": started
controller.go:568] skip translation of storage class for plugin: csi.vsphere.vmware.com
connection.go:132] Lost connection to unix:///csi/csi.sock.
event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"rwx-storage", UID:"72ae5017-3dfc-4e2d-ba54-d228584064ab", APIVersion:"v1", ResourceVersion:"22586568", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/rwx-storage"
connection.go:87] Lost connection to CSI driver, exiting

vsphere-csi-controller

2024-01-13T15:56:11.197184179Z {"level":"info","time":"2024-01-13T15:56:11.197118888Z","caller":"vanilla/controller.go:2718","msg":"ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"f14c9ed4-b8b4-4994-8ea7-acc4e868637b"}
{"level":"info","time":"2024-01-13T15:56:31.050911804Z","caller":"vanilla/controller.go:1805","msg":"CreateVolume: called with args {Name:pvc-72ae5017-3dfc-4e2d-ba54-d228584064ab CapacityRange:required_bytes:11534336  VolumeCapabilities:[mount:<fs_type:\"ext4\" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > ] Parameters:map[] Secrets:map[] VolumeContentSource:<nil> AccessibilityRequirements:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"fd9fdf44-45fa-4112-ae25-1b0efd285d4d"}
panic: runtime error: invalid memory address or nil pointer dereference
2024-01-13T15:56:31.059233875Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1b22a25]
2024-01-13T15:56:31.059236584Z 
2024-01-13T15:56:31.059238504Z goroutine 419 [running]:
2024-01-13T15:56:31.059240375Z sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).createFileVolume(0xc0000a20a0, {0x26af658, 0xc000247920}, 0xc0004ec2a0)
2024-01-13T15:56:31.059242305Z  /build/pkg/csi/service/vanilla/controller.go:1736 +0xd05
2024-01-13T15:56:31.059244155Z sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).CreateVolume.func1()
2024-01-13T15:56:31.059245885Z  /build/pkg/csi/service/vanilla/controller.go:1848 +0x3d7
2024-01-13T15:56:31.059247605Z sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).CreateVolume(0xc0000a20a0, {0x26af658, 0xc000b07b60}, 0xc0004ec2a0)
2024-01-13T15:56:31.059249395Z  /build/pkg/csi/service/vanilla/controller.go:1858 +0x1bb
2024-01-13T15:56:31.059253975Z github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x229bba0?, 0xc0000a20a0}, {0x26af658, 0xc000b07b60}, 0xc00022ef00, 0x0)
2024-01-13T15:56:31.059256265Z  /go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5671 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00022aa80, {0x26b6218, 0xc000d16d00}, 0xc00052fd40, 0xc000d48d20, 0x38db8a0, 0x0)
2024-01-13T15:56:31.059260375Z  /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283 +0xcfe
google.golang.org/grpc.(*Server).handleStream(0xc00022aa80, {0x26b6218, 0xc000d16d00}, 0xc00052fd40, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620 +0xa2f
2024-01-13T15:56:31.059270455Z google.golang.org/grpc.(*Server).serveStreams.func1.2()
2024-01-13T15:56:31.059272495Z  /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922 +0x98
2024-01-13T15:56:31.059274495Z created by google.golang.org/grpc.(*Server).serveStreams.func1
2024-01-13T15:56:31.059276525Z  /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:920 +0x28a

Environment:

chethanv28 commented 8 months ago

@dzanto Did you disable multi-vcenter-csi-topology flag in internal-feature-states.csi.vsphere.vmware.com configmap after the driver was first initialized ?

dzanto commented 8 months ago

multi-vcenter-csi-topology option absent in internal-feature-states.csi.vsphere.vmware.com configmap.

kind: ConfigMap
apiVersion: v1
metadata:
  annotations:
    meta.helm.sh/release-name: vsphere-csi
    meta.helm.sh/release-namespace: kube-system
  labels:
    app.kubernetes.io/managed-by: Helm
  name: internal-feature-states.csi.vsphere.vmware.com
  namespace: kube-system
data:
  async-query-volume: 'false'
  block-volume-snapshot: 'false'
  cnsmgr-suspend-create-volume: 'false'
  csi-auth-check: 'false'
  csi-migration: 'false'
  csi-windows-support: 'false'
  improved-csi-idempotency: 'false'
  improved-volume-topology: 'false'
  list-volumes: 'false'
  max-pvscsi-targets-per-vm: 'false'
  online-volume-extend: 'false'
  pv-to-backingdiskobjectid-mapping: 'false'
  topology-preferential-datastores: 'false'
  trigger-csi-fullsync: 'false'
  use-csinode-id: 'true'

I create custom StorageClass with csi.storage.k8s.io/fstype: nfs4 and crashes go away. Default StorageClass (from rancher helm chart) not contain this parameter.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: vsphere-nfs
parameters:
  csi.storage.k8s.io/fstype: nfs4
provisioner: csi.vsphere.vmware.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

But when I create PVC, PV doesn't appear.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-client-pvc
spec:
  storageClassName: vsphere-nfs
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Mi

RWX work only when I create PVC and PV manually, how there https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/master/example/vanilla-k8s-RWM-filesystem-volumes/example-static-fileshare-provisioning.yaml

How automatically create PV?

shalini-b commented 8 months ago

Looks like the vSphere CSI driver was not deployed properly. vSphere CSI driver v3.0.1 has multi-vcenter-csi-topology feature gate set to true. Refer to https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/v3.0.1/manifests/vanilla/vsphere-csi-driver.yaml#L164C39-L164C39

dzanto commented 8 months ago

I use rancher's helm chart: https://artifacthub.io/packages/helm/rke2-charts/rancher-vsphere-csi

In this topology option is disabled by default.

https://artifacthub.io/packages/helm/rke2-charts/rancher-vsphere-csi?modal=template&template=controller/deployment.yaml (212 line)

shalini-b commented 8 months ago

The topology flag in provisioner is set to false by default in our YAML as well. It is only set to true when a customer chooses to use topology in their environment.

The multi-vcenter-csi-topology feature gate we are talking about is present in a configmap with name internal-feature-states.csi.vsphere.vmware.com in namespace vmware-system-csi. This should be set to true if you are using vSphere CSI driver v3.0.1

dzanto commented 8 months ago

I added multi-vcenter-csi-topology: true to configmap, but it didn't help. Also I tried multi-vcenter-csi-topology: false. vsphere-csi-controller again crashed after creating PVC.

shalini-b commented 8 months ago

Can you post the logs when you set multi-vcenter-csi-topology to true in configmap?

dzanto commented 8 months ago
{"level":"info","time":"2024-01-30T06:39:53.042383033Z","caller":"vanilla/controller.go:1805","msg":"CreateVolume: called with args {Name:pvc-13e4061b-61d6-4f6a-ad6a-ef7d1425dc4e CapacityRange:required_bytes:10485760  VolumeCapabilities:[mount:<fs_type:\"nfs4\" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > ] Parameters:map[] Secrets:map[] VolumeContentSource:<nil> AccessibilityRequirements:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"2177c87f-8ff3-406f-8ffa-b366a1d14a12"}
panic: runtime error: invalid memory address or nil pointer dereference
2024-01-30T06:39:53.059969372Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1a4f182]
2024-01-30T06:39:53.059974632Z 
2024-01-30T06:39:53.059977523Z goroutine 654 [running]:
2024-01-30T06:39:53.059980332Z sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/common.(*AuthManager).GetFsEnabledClusterToDsMap(0x0, {0x26af658?, 0xc0000585b8?})
2024-01-30T06:39:53.059983223Z  /build/pkg/csi/service/common/authmanager.go:137 +0x62
2024-01-30T06:39:53.059986273Z sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).createFileVolume(0xc00034a190, {0x26af658, 0xc000632300}, 0xc0001b2770)
2024-01-30T06:39:53.059989283Z  /build/pkg/csi/service/vanilla/controller.go:1734 +0xcf3
2024-01-30T06:39:53.059992373Z sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).CreateVolume.func1()
2024-01-30T06:39:53.059995143Z  /build/pkg/csi/service/vanilla/controller.go:1836 +0x2c5
2024-01-30T06:39:53.060001453Z sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).CreateVolume(0xc00034a190, {0x26af658, 0xc000fd8db0}, 0xc0001b2770)
2024-01-30T06:39:53.060004383Z  /build/pkg/csi/service/vanilla/controller.go:1858 +0x1bb
2024-01-30T06:39:53.060007163Z github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x229bba0?, 0xc00034a190}, {0x26af658, 0xc000fd8db0}, 0xc0003bb800, 0x0)
    /go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5671 +0x170
2024-01-30T06:39:53.060013863Z google.golang.org/grpc.(*Server).processUnaryRPC(0xc000196a80, {0x26b6218, 0xc000bea9c0}, 0xc0002797a0, 0xc000a37860, 0x38db8a0, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283 +0xcfe
google.golang.org/grpc.(*Server).handleStream(0xc000196a80, {0x26b6218, 0xc000bea9c0}, 0xc0002797a0, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620 +0xa2f
2024-01-30T06:39:53.060029353Z google.golang.org/grpc.(*Server).serveStreams.func1.2()
    /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:920 +0x28a
Midaxess commented 6 months ago

Hi, have you found a solution ?

I have the same issue with my clusters rke2 and k3s and the helm charts rancher-vsphere-csi:103.0.0+up3.0.2-rancher1

{"level":"info","time":"2024-03-15T15:34:05.026383613Z","caller":"vanilla/controller.go:2719","msg":"ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"91f15cdb-ed37-4014-b79e-ebd785e54684"}
{"level":"info","time":"2024-03-15T15:34:24.880189533Z","caller":"vanilla/controller.go:1806","msg":"CreateVolume: called with args {Name:pvc-2a06d694-903f-41da-85bb-1475e20d2ff9 CapacityRange:required_bytes:1073741824  VolumeCapabilities:[mount:<fs_type:\"nfs4\" > access_mode:<mode:MULTI_NODE_MULTI_WRITER > ] Parameters:map[] Secrets:map[] VolumeContentSource:<nil> AccessibilityRequirements:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}","TraceId":"a2fd7e91-f23e-44f9-8037-2a0c13595c03"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1b24e65]

goroutine 275 [running]:
sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).createFileVolume(0xc0000a1c20, {0x26b2d98, 0xc000874660}, 0xc0004e2380)
        /build/pkg/csi/service/vanilla/controller.go:1737 +0xd05
sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).CreateVolume.func1()
        /build/pkg/csi/service/vanilla/controller.go:1849 +0x3d7
sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).CreateVolume(0xc0000a1c20, {0x26b2d98, 0xc00085ce10}, 0xc0004e2380)
        /build/pkg/csi/service/vanilla/controller.go:1859 +0x1bb
github.com/container-storage-interface/spec/lib/go/csi._Controller_CreateVolume_Handler({0x229ed80?, 0xc0000a1c20}, {0x26b2d98, 0xc00085ce10}, 0xc0004fc660, 0x0)
        /go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5671 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0002cfc00, {0x26b9978, 0xc000557ba0}, 0xc0002eea20, 0xc000875ad0, 0x38e08a0, 0x0)
        /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283 +0xcfe
google.golang.org/grpc.(*Server).handleStream(0xc0002cfc00, {0x26b9978, 0xc000557ba0}, 0xc0002eea20, 0x0)
        /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620 +0xa2f
google.golang.org/grpc.(*Server).serveStreams.func1.2()
        /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:920 +0x28a

Anything else we need to know?:

The vSAN File Service is working because I tried to create a File NFS share and I able to mount it manually on a node of the k8s cluster

Environment:

Midaxess commented 6 months ago

Ok after few weeks I found the solution

Edit in your configMap :

csi-auth-check: 'false' -> csi-auth-check: 'true'

Restart Pods of the vSphere plugin and recreate PVC

@dzanto Let me know if this helped you

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 month ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/2755#issuecomment-2315384550): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.