kubernetes-sigs / vsphere-csi-driver

vSphere storage Container Storage Interface (CSI) plugin
https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/index.html
Apache License 2.0
296 stars 179 forks source link

vSphere CSI migration fails for volumes with "The object or item referred to could not be found." #3082

Open gnufied opened 4 hours ago

gnufied commented 4 hours ago

One of our customers migrated from a version of k8s where CSI migration was not enabled to a version where CSI migration is enabled. Now bunch of those PVs are unusable with new version of k8s.

When I dug further, what I found is - syncer receives vim.fault.NotFound error for volume we are trying to register with CNS. So full error in syncer looks like:

", fault: "(*types.LocalizedMethodFault)(0xc00252d500)({
 DynamicData: (types.DynamicData) {
 },
 Fault: (*types.NotFound)(0xc00252d520)({
  VimFault: (types.VimFault) {
   MethodFault: (types.MethodFault) {
    FaultCause: (*types.LocalizedMethodFault)(<nil>),
    FaultMessage: ([]types.LocalizableMessage) <nil>
   }
  }
 }),
 LocalizedMessage: (string) (len=50) \"The object or item referred to could not be found.\"

But - when I open vsan logs in vCenter, I see:

2024-10-17T08:56:55.738Z info vsanvcmgmtd[16073] [vSAN@6876 sub=vmomi.soapStub[5] opID=91718591] SOAP request returned HTTP failure; <<io_obj p:0x00007f2d3c3f3000, h:30, <TCP '127.0.0.1 :
        │  34092'>, <TCP '127.0.0.1 : 1080'>>, /sdk>, method: registerDisk; code: 500(Internal Server Error); fault: (vim.fault.AlreadyExists) {
47619   │ -->    faultCause = (vmodl.MethodFault) null,
47620   │ -->    faultMessage = <unset>,
47621   │ -->    name = "e241bc4f-b78b-4cd5-997f-3424eb561ef1"
47622   │ -->    msg = "Received SOAP response fault from [<<io_obj p:0x00007f2d3c3f3000, h:30, <TCP '127.0.0.1 : 34092'>, <TCP '127.0.0.1 : 1080'>>, /sdk>]: registerDisk
47623   │ --> The specified key, name, or identifier 'e241bc4f-b78b-4cd5-997f-3424eb561ef1' already exists."
47624   │ --> }

47635   │ 2024-10-17T08:56:56.197Z error vsanvcmgmtd[16073] [vSAN@6876 sub=FcdService opID=91718591] Failed to find vol e241bc4f-b78b-4cd5-997f-3424eb561ef1 from volumeInfoCache
47636   │ 2024-10-17T08:56:56.203Z error vsanvcmgmtd[33985] [vSAN@6876 sub=Workflow opID=91718591] Workflow previous action has fault (vim.fault.NotFound) {
47637   │ -->    faultCause = (vmodl.MethodFault) null,
47638   │ -->    faultMessage = <unset>
47639   │ -->    msg = "e241bc4f-b78b-4cd5-997f-3424eb561ef1"

So although - vsan service thinks volume is already registered, later on volume is not found in its cache and hence vim.Fault.NotFound is returned to the client.

47635   │ 2024-10-17T08:56:56.197Z error vsanvcmgmtd[16073] [vSAN@6876 sub=FcdService opID=91718591] Failed to find vol e241bc4f-b78b-4cd5-997f-3424eb561ef1 from volumeInfoCache

This looks like similar to case we observed earlier - https://knowledge.broadcom.com/external/article?legacyId=91752

Is there a workaround we can use?

gnufied commented 4 hours ago

cc @divyenpatel

gnufied commented 4 hours ago

Another point is - in this case customer is on 8.0.3 version of vCenter. I thought this issue was fixed in 8.0.2.