kubernetes-sigs / vsphere-csi-driver

vSphere storage Container Storage Interface (CSI) plugin
https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/index.html
Apache License 2.0
289 stars 173 forks source link

Container Volumes not showing in vcenter after CSI driver upgrade #2863

Open NelSat opened 3 months ago

NelSat commented 3 months ago

/kind bug

What happened:

We have 5 kubernetes clusters running on rancher/RKE with vsphere CSI driver installed. We recently upgraded our clusters to kubernetes 1.24. We then decided to upgrade the CPI to 1.24, and the vsphere CSI from 2.6.4 to 3.0.3. Most clusters seem to be running ok after the upgrade, but in one specific cluster we ran into some trouble after a few days. We realized some statefulsets stopped working with errors like "Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[dshm data]: timed out waiting for the condition". While checking the monitor tab in vcenter, we realized all PV's from that specific cluster weren't being displayed anymore. The vsphere-syncer container shows the following errors in its log:

{"level":"error","time":"2024-04-22T22:25:14.807503295Z","caller":"migration/migration.go:225","msg":"failed to register volume for volumeSpec: &{/verdaccio/storage }, with err: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/apis/migration.(*volumeMigration).GetVolumeID\n\t/build/pkg/apis/migration/migration.go:225\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:98\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"error","time":"2024-04-22T22:25:14.807520119Z","caller":"syncer/fullsync.go:100","msg":"FullSync for VC pvcn01.sof.intra: Failed to get VolumeID from volumeMigrationService for spec: &{/verdaccio/storage }. Err: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:100\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"info","time":"2024-04-22T22:25:14.807542385Z","caller":"syncer/metadatasyncer.go:551","msg":"CSI full sync failed with error: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\""}
{"level":"info","time":"2024-04-22T22:55:14.797778219Z","caller":"syncer/metadatasyncer.go:536","msg":"fullSync is triggered"}
{"level":"info","time":"2024-04-22T22:55:14.79886185Z","caller":"syncer/fullsync.go:46","msg":"FullSync for VC pvcn01.sof.intra: start"}
{"level":"error","time":"2024-04-22T22:55:14.805566227Z","caller":"migration/migration.go:449","msg":"failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/apis/migration.(*volumeMigration).registerVolume\n\t/build/pkg/apis/migration/migration.go:449\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/apis/migration.(*volumeMigration).GetVolumeID\n\t/build/pkg/apis/migration/migration.go:223\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:98\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"error","time":"2024-04-22T22:55:14.805718644Z","caller":"migration/migration.go:225","msg":"failed to register volume for volumeSpec: &{/verdaccio/storage }, with err: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/apis/migration.(*volumeMigration).GetVolumeID\n\t/build/pkg/apis/migration/migration.go:225\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:98\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"error","time":"2024-04-22T22:55:14.805769516Z","caller":"syncer/fullsync.go:100","msg":"FullSync for VC pvcn01.sof.intra: Failed to get VolumeID from volumeMigrationService for spec: &{/verdaccio/storage }. Err: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\"","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.CsiFullSync\n\t/build/pkg/syncer/fullsync.go:100\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/syncer.InitMetadataSyncer.func11\n\t/build/pkg/syncer/metadatasyncer.go:549"}
{"level":"info","time":"2024-04-22T22:55:14.805823601Z","caller":"syncer/metadatasyncer.go:551","msg":"CSI full sync failed with error: failed to extract datastore name from in-tree volume path: \"/verdaccio/storage\""}

Looking back at vcenter events, right after the update we started getting lots of "Sync volume" and "Attach container volume" events from the nodes belonging to the problematic cluster.

I googled similar errors that ocurred in much older versions of CSI, but none of the workarounds, like reconciling the CNS catalog, worked for me. Also, from what I understand, all in-tree volumes should have been migrated to CSI long before this last update, since we were using the 2.6.4 version with migration enabled.

Tried updating CSI to 3.0.3rc1, but only thing that changed was the frequency of the fullsync calls by the syncer.

I'm able to create new volumes, and they all show in vcenter normally.

What you expected to happen:

CSI migration to run smoothly.

How to reproduce it (as minimally and precisely as possible):

Upgraded CPI to 1.24 via helm chart, and then CSI driver to 3.0.3 by applying the manifests according to vmware documentation.

Anything else we need to know?: Attacher logs:

I0422 23:25:51.516461       1 controller.go:167] Ignoring VolumeAttachment "csi-60dadb2af1c06faf14306ee6cbad41d4202178189ac514a2f776f5b66a9dd129" change
I0422 23:25:51.517738       1 csi_handler.go:607] Saving detach error to "csi-5e8922476ef04758f8a871589d4c12c18fb9dd999fcbfdcf9dff4e40477fe085"
I0422 23:25:51.519699       1 csi_handler.go:618] Saved detach error to "csi-bd4c20e084df667d96b57fc08842d7faa9d36d01894163805e4da34469e09e9f"
I0422 23:25:51.519743       1 csi_handler.go:234] Error processing "csi-bd4c20e084df667d96b57fc08842d7faa9d36d01894163805e4da34469e09e9f": failed to detach: rpc error: code = Internal desc = volumeID "ce02b2b4-425b-4eb9-903b-9356f3360c0f" not found in QueryVolume
I0422 23:25:51.520500       1 controller.go:167] Ignoring VolumeAttachment "csi-bd4c20e084df667d96b57fc08842d7faa9d36d01894163805e4da34469e09e9f" change
I0422 23:25:51.522020       1 csi_handler.go:607] Saving detach error to "csi-9972fa01b2c76c5dce6ca2a68202eea7f3c2a4ca99c6aad0dcbff9755db9cf79"
I0422 23:25:51.522652       1 controller.go:167] Ignoring VolumeAttachment "csi-47a42b20c1708c01cad92d6f99c3f2bbb0e7c281be6df195a8a90ca6cd528268" change
I0422 23:25:51.522784       1 csi_handler.go:618] Saved detach error to "csi-47a42b20c1708c01cad92d6f99c3f2bbb0e7c281be6df195a8a90ca6cd528268"
I0422 23:25:51.522827       1 csi_handler.go:234] Error processing "csi-47a42b20c1708c01cad92d6f99c3f2bbb0e7c281be6df195a8a90ca6cd528268": failed to detach: rpc error: code = Internal desc = volumeID "eb1c60de-5225-482d-a0c1-8081f708ed99" not found in QueryVolume
I0422 23:25:51.526304       1 csi_handler.go:607] Saving detach error to "csi-17c07cd0c45986e88f9d5b1648ca024f9dca183897845cdb4a591de3e17c60d4"
I0422 23:25:51.526340       1 csi_handler.go:618] Saved detach error to "csi-5e8922476ef04758f8a871589d4c12c18fb9dd999fcbfdcf9dff4e40477fe085"
I0422 23:25:51.526362       1 csi_handler.go:234] Error processing "csi-5e8922476ef04758f8a871589d4c12c18fb9dd999fcbfdcf9dff4e40477fe085": failed to detach: rpc error: code = Internal desc = volumeID "9bd79636-37f4-4022-90ef-56109d9aac1f" not found in QueryVolume
I0422 23:25:51.527388       1 controller.go:167] Ignoring VolumeAttachment "csi-5e8922476ef04758f8a871589d4c12c18fb9dd999fcbfdcf9dff4e40477fe085" change
I0422 23:25:51.529637       1 csi_handler.go:618] Saved detach error to "csi-9972fa01b2c76c5dce6ca2a68202eea7f3c2a4ca99c6aad0dcbff9755db9cf79"
I0422 23:25:51.529673       1 csi_handler.go:234] Error processing "csi-9972fa01b2c76c5dce6ca2a68202eea7f3c2a4ca99c6aad0dcbff9755db9cf79": failed to detach: rpc error: code = Internal desc = volumeID "0f472b9d-b6ac-4c48-9f37-4aa963b41154" not found in QueryVolume
I0422 23:25:51.530407       1 controller.go:167] Ignoring VolumeAttachment "csi-9972fa01b2c76c5dce6ca2a68202eea7f3c2a4ca99c6aad0dcbff9755db9cf79" change
I0422 23:25:51.530998       1 csi_handler.go:618] Saved detach error to "csi-60dadb2af1c06faf14306ee6cbad41d4202178189ac514a2f776f5b66a9dd129"
I0422 23:25:51.531043       1 csi_handler.go:234] Error processing "csi-60dadb2af1c06faf14306ee6cbad41d4202178189ac514a2f776f5b66a9dd129": failed to detach: rpc error: code = Internal desc = volumeID "d6184ece-e53a-4851-95dd-8c7a9605842b" not found in QueryVolume
I0422 23:25:51.532438       1 csi_handler.go:618] Saved detach error to "csi-17c07cd0c45986e88f9d5b1648ca024f9dca183897845cdb4a591de3e17c60d4"
I0422 23:25:51.532471       1 csi_handler.go:234] Error processing "csi-17c07cd0c45986e88f9d5b1648ca024f9dca183897845cdb4a591de3e17c60d4": failed to detach: rpc error: code = Internal desc = volumeID "b436a8f2-5c26-4243-8355-15699d91d5c6" not found in QueryVolume
I0422 23:25:51.533601       1 controller.go:167] Ignoring VolumeAttachment "csi-17c07cd0c45986e88f9d5b1648ca024f9dca183897845cdb4a591de3e17c60d4" change
I0422 23:25:54.565248       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
E0422 23:25:54.578944       1 controller.go:146] Failed to reconcile volume attachments: failed to ListVolumes: failed to list volumes: rpc error: code = FailedPrecondition desc = difference between number of K8s volumes and CNS volumes is greater than threshold.
I0422 23:26:44.549739       1 reflector.go:559] k8s.io/client-go/informers/factory.go:150: Watch close - *v1.CSINode total 4 items received
I0422 23:26:54.581024       1 csi_handler.go:123] Reconciling VolumeAttachments with driver backend state
E0422 23:26:54.594484       1 controller.go:146] Failed to reconcile volume attachments: failed to ListVolumes: failed to list volumes: rpc error: code = FailedPrecondition desc = difference between number of K8s volumes and CNS volumes is greater than threshold.

csi-controller logs

{"level":"error","time":"2024-04-22T23:25:51.521555326Z","caller":"vanilla/controller.go:2215","msg":"volumeID \"0f472b9d-b6ac-4c48-9f37-4aa963b41154\" not found in QueryVolume","TraceId":"512f0c70-c74c-4e41-bff7-ea2d6dc2be21","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2215\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2286\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:51.521673344Z","caller":"vanilla/controller.go:2289","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"detach-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"512f0c70-c74c-4e41-bff7-ea2d6dc2be21","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2289\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:51.524805416Z","caller":"vanilla/controller.go:2215","msg":"volumeID \"b436a8f2-5c26-4243-8355-15699d91d5c6\" not found in QueryVolume","TraceId":"b1d8b022-30c6-4d6a-8b93-3e58502fe953","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2215\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2286\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:51.52607625Z","caller":"vanilla/controller.go:2289","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"detach-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"b1d8b022-30c6-4d6a-8b93-3e58502fe953","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ControllerUnpublishVolume\n\t/build/pkg/csi/service/vanilla/controller.go:2289\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ControllerUnpublishVolume_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5725\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2024-04-22T23:25:54.568463081Z","caller":"config/config.go:404","msg":"No Net Permissions given in Config. Using default permissions.","TraceId":"cb048756-458a-40a8-8988-9f3c6cbf74dc"}
{"level":"error","time":"2024-04-22T23:25:54.578449047Z","caller":"vanilla/controller.go:2519","msg":"difference between number of K8s volumes: 54, and CNS volumes: 0, is greater than threshold: 50, and completely out of sync.","TraceId":"cb048756-458a-40a8-8988-9f3c6cbf74dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2519\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:54.578566327Z","caller":"vanilla/controller.go:2522","msg":"difference between number of K8s volumes and CNS volumes is greater than threshold.","TraceId":"cb048756-458a-40a8-8988-9f3c6cbf74dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2522\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:25:54.578630482Z","caller":"vanilla/controller.go:2560","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"list-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"cb048756-458a-40a8-8988-9f3c6cbf74dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2560\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2024-04-22T23:26:54.582476526Z","caller":"config/config.go:404","msg":"No Net Permissions given in Config. Using default permissions.","TraceId":"bca4ef1a-7b86-4cd8-80e6-64c186d8c0da"}
{"level":"error","time":"2024-04-22T23:26:54.593711122Z","caller":"vanilla/controller.go:2519","msg":"difference between number of K8s volumes: 54, and CNS volumes: 0, is greater than threshold: 50, and completely out of sync.","TraceId":"bca4ef1a-7b86-4cd8-80e6-64c186d8c0da","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2519\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:26:54.593825996Z","caller":"vanilla/controller.go:2522","msg":"difference between number of K8s volumes and CNS volumes is greater than threshold.","TraceId":"bca4ef1a-7b86-4cd8-80e6-64c186d8c0da","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2522\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:26:54.593916529Z","caller":"vanilla/controller.go:2560","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"list-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"bca4ef1a-7b86-4cd8-80e6-64c186d8c0da","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2560\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2024-04-22T23:27:54.598308233Z","caller":"config/config.go:404","msg":"No Net Permissions given in Config. Using default permissions.","TraceId":"8b8e00f8-1a4c-4c58-aa30-f20235996203"}
{"level":"error","time":"2024-04-22T23:27:54.609811978Z","caller":"vanilla/controller.go:2519","msg":"difference between number of K8s volumes: 54, and CNS volumes: 0, is greater than threshold: 50, and completely out of sync.","TraceId":"8b8e00f8-1a4c-4c58-aa30-f20235996203","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2519\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:27:54.609907854Z","caller":"vanilla/controller.go:2522","msg":"difference between number of K8s volumes and CNS volumes is greater than threshold.","TraceId":"8b8e00f8-1a4c-4c58-aa30-f20235996203","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes.func1\n\t/build/pkg/csi/service/vanilla/controller.go:2522\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2557\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"error","time":"2024-04-22T23:27:54.609957844Z","caller":"vanilla/controller.go:2560","msg":"Operation failed, reporting failure status to Prometheus. Operation Type: \"list-volume\", Volume Type: \"unknown\", Fault Type: \"csi.fault.Internal\"","TraceId":"8b8e00f8-1a4c-4c58-aa30-f20235996203","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).ListVolumes\n\t/build/pkg/csi/service/vanilla/controller.go:2560\ngithub.com/container-storage-interface/spec/lib/go/csi._Controller_ListVolumes_Handler\n\t/go/pkg/mod/github.com/container-storage-interface/spec@v1.7.0/lib/go/csi/csi.pb.go:5761\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1283\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:1620\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.47.0/server.go:922"}
{"level":"info","time":"2024-04-22T23:28:01.648430291Z","caller":"common/authmanager.go:279","msg":"No vSAN datastores found for vCenter \"pvcn01.sof.intra\"","TraceId":"c3ceb22e-97e0-4f34-b1a0-5e6b695b80fb"}
{"level":"info","time":"2024-04-22T23:28:01.648484693Z","caller":"common/authmanager.go:183","msg":"auth manager: newFsEnabledClusterToDsMap is updated to map[] for vCenter \"pvcn01.sof.intra\"","TraceId":"c3ceb22e-97e0-4f34-b1a0-5e6b695b80fb"}
{"level":"info","time":"2024-04-22T23:28:01.661403831Z","caller":"common/authmanager.go:163","msg":"auth manager: datastoreMapForBlockVolumes is updated to map[ds:///vmfs/volumes/65e9caaa-47491870-6801-d06726c49670/:Datastore: Datastore:datastore-221625, datastore URL: ds:///vmfs/volumes/65e9caaa-47491870-6801-d06726c49670/ ds:///vmfs/volumes/65fce6e8-53512aaa-a749-d06726c49670/:Datastore: Datastore:datastore-223172, datastore URL: ds:///vmfs/volumes/65fce6e8-53512aaa-a749-d06726c49670/ ds:///vmfs/volumes/66220883-eb34d14e-f70f-d06726c49670/:Datastore: Datastore:datastore-229535, datastore URL: ds:///vmfs/volumes/66220883-eb34d14e-f70f-d06726c49670/ ds:///vmfs/volumes/6622d6ec-b0d08c4a-5469-d06726c49670/:Datastore: Datastore:datastore-229556, datastore URL: ds:///vmfs/volumes/6622d6ec-b0d08c4a-5469-d06726c49670/] for vCenter \"pvcn01.sof.intra\"","TraceId":"93b7ed52-ed2c-48fd-be65-a31b77396e7f"}
{"level":"info","time":"2024-04-22T23:28:01.662262621Z","caller":"k8sorchestrator/topology.go:239","msg":"Refreshing preferred datastores information...","TraceId":"349eb887-de9f-4863-aa8b-75c673449a3a"}
{"level":"info","time":"2024-04-22T23:28:01.662975948Z","caller":"config/config.go:404","msg":"No Net Permissions given in Config. Using default permissions.","TraceId":"349eb887-de9f-4863-aa8b-75c673449a3a"}
{"level":"info","time":"2024-04-22T23:28:01.663038928Z","caller":"vsphere/utils.go:265","msg":"Defaulting timeout for vCenter Client to 5 minutes","TraceId":"349eb887-de9f-4863-aa8b-75c673449a3a"}
{"level":"warn","time":"2024-04-22T23:28:02.11051213Z","caller":"common/topology.go:159","msg":"failed to retrieve tags for category \"cns.vmware.topology-preferred-datastores\" in vCenter \"pvcn01.sof.intra\". Reason: GET https://pvcn01.sof.intra:443/rest/com/vmware/cis/tagging/category/id:cns.vmware.topology-preferred-datastores: 404 Not Found","TraceId":"349eb887-de9f-4863-aa8b-75c673449a3a"}

Environment:

NelSat commented 2 months ago

Update: after a day or so, newly created volumes also disappeared from vcenter container volumes list.

sathieu commented 2 weeks ago

probably same as https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/2802.

This may be related to csi-provisionner enabling topology support by default: https://github.com/kubernetes-csi/external-provisioner/pull/1167

sathieu commented 2 weeks ago

It seems that topology support is now enabled by default, but according to docs :

If you already use vSphere Container Storage Plug-in to run applications, but haven't used the topology feature, you must re-create the entire cluster and delete any existing PVCs in the system to be able to use the topology feature.

So this looks like a breaking change. Can we disable topology with topology-preferential-datastores=false?

@divyenpatel any idea?

NelSat commented 2 weeks ago

Update: after trying a lot of different things to recover the volume catalog in vcenter, we just gave up and manually migrated our data from the broken vsphere volumes to other kinds of storage. Then we did a full purge of the vsphere cpi/csi and all the vsphere volumes from that cluster, and re-installed it all using the rancher helm charts. So far it's working. We were completely clueless about topology settings being the root cause, since we never used it and none of the error messages we found led us this way.