kubernetes-sigs / vsphere-csi-driver

vSphere storage Container Storage Interface (CSI) plugin
https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/index.html
Apache License 2.0
293 stars 177 forks source link

Fix volume provisioning when duplicate region or zone tags exist #2814

Closed dobsonj closed 5 months ago

dobsonj commented 6 months ago

What this PR does / why we need it:

This PR allows volume provisioning to work in two scenarios:

A) Both the datacenter and the cluster have a (duplicate) region tag

datacenter1: region-A
  cluster1: region-A, zone-C
    host1: 

B) Both the cluster and the host have a (duplicate) zone tag

datacenter1: region-A
  cluster1: zone-C
    host1: zone-C

In both cases, the volume fails to be provisioned with:

Warning ProvisioningFailed 7m42s (x9 over 12m) csi.vsphere.vmware.com_vmware-vsphere-csi-driver-controller-f9bf8757c-7mhdl_4c338b41-cb91-400f-8034-eb1cf1452e34 failed to provision volume with StorageClass "thin-csi": rpc error: code = Internal desc = failed to create volume. Errors encountered: [No compatible datastores found for accessibility requirements [map[topology.csi.vmware.com/openshift-region:region-A topology.csi.vmware.com/openshift-zone:zone-C]] pertaining to vCenter ...

Although this cluster is tagged incorrectly, this scenario did work on previous versions of the driver, which means upgrading the driver could break volume provisioning on existing clusters where this previously worked.

I propose a simple solution: when analyzing the tags, prefer the entity highest in the hierarchy and ignore the duplicate tag lower in the hierarchy. If the datacenter has a region tag, that takes precedent over any cluster underneath it. Likewise, if the cluster has a zone tag, that takes precedent over any host in that cluster.

This can be done with two changes:

1) Remove duplicate hosts in GetHostsForSegment

findCommonHostsforAllTopologyKeys only returns the host reference if the host is found exactly once in each segment. If the tag exists twice in the hierarchy, it fails and returns an empty slice (no common hosts) even though there is a common host in both region-A and zone-C. See this backtrace as an example. If we remove duplicate host entries before finding common hosts, it correctly returns the only common host for the region and zone tags.

2) Ignore duplicate tags in GetTopologyLabels

After changing GetHostsForSegment, it is still possible for provisioning to fail during GetTopologyLabels because the CSINodeTopology reports Error: duplicate values detected for category openshift-zone as \\\"zone-C\\\" and \\\"zone-C\\\" in its status. Since GetTopologyLabels iterates up the hierarchy through ancestors starting from the NodeVM, it could update the value instead of throwing an error, which means the higher-level entity's value takes precedent.

Which issue this PR fixes:

fixes #2681

Testing done:

Manually tested volume provisioning with and without this change in the two scenarios listed above.

Special notes for your reviewer:

/cc @divyenpatel @shalini-b @gnufied

Release note:

Fix volume provisioning when duplicate region or zone tags exist
divyenpatel commented 5 months ago

/ok-to-test

dobsonj commented 5 months ago

Hi @shalini-b @divyenpatel, do you feel comfortable accepting this change? Or do you have any additional feedback?

shalini-b commented 5 months ago

/approve

k8s-ci-robot commented 5 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dobsonj, shalini-b

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/master/OWNERS)~~ [shalini-b] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
jsafrane commented 5 months ago

/lgtm