kubernetes-sigs / vsphere-csi-driver

vSphere storage Container Storage Interface (CSI) plugin
https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/index.html
Apache License 2.0
295 stars 181 forks source link

vSphere CSI: Storage DRS support #686

Open yuga711 opened 3 years ago

yuga711 commented 3 years ago

The vSphere CSI driver page states vSphere CSI driver and Cloud Native Storage in vSphere does not currently support Storage DRS feature in vSphere. Since CSI driver already creates FCDs, can you please explain what is the limitation in supporting the storage DRS feature? Also, is this feature in the CSI driver roadmap? Thanks!

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug /kind feature

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

yuga711 commented 3 years ago

cc @divyenpatel @msau42 @jingxu97

divyenpatel commented 3 years ago

@yuga711

Since CSI driver already creates FCDs, can you please explain what is the limitation in supporting the storage DRS feature?

FCD does not support SDRS yet.

Also, is this feature in the CSI driver roadmap?

Yes, we have SDRS support in our roadmap.

SandeepPissay commented 3 years ago

@yuga711 Can you explain your use case or your customer's use cases for SDRS support? SDRS does storage load balancing between datastores based on performance, latency, capacity usage, etc. It also helps in moving storage objects(currently only VMs) out of the datastore when you put a datastore into maintenance mode. We have commonly heard of datastore maintenance mode support and capacity load balancing use cases. So having some details on the immediate problems to be solved will help prioritize the feature. Please reach out to me in Kubernetes slack(#provider-vsphere) if you would like to share some details.

limb-adobe commented 3 years ago

@SandeepPissay our immediate need for SDRS support fall under the common requests you have already stated, datastore maintenance mode and capacity load balancing.

SandeepPissay commented 3 years ago

@limb-adobe what version of vSphere are you using? I'm wondering if you are fine with upgrading the vSphere or you are looking for capacity load balancing and datastore maintenance mode feature support for already released vSphere versions.

tgelter commented 3 years ago

@limb-adobe what version of vSphere are you using? I'm wondering if you are fine with upgrading the vSphere or you are looking for capacity load balancing and datastore maintenance mode feature support for already released vSphere versions.

@SandeepPissay, @limb-adobe is my colleague so I can field this question for now. We are using vCenter Server 7.0 Update 1c. Landon would need to provide guidance about future version timelines. I think we'd be open to it, but would likely prefer support for this version of vCenter since we're currently in the process of globally migrating off of 6.7 right now.

yuga711 commented 3 years ago

@yuga711 Can you explain your use case or your customer's use cases for SDRS support? SDRS does storage load balancing between datastores based on performance, latency, capacity usage, etc. It also helps in moving storage objects(currently only VMs) out of the datastore when you put a datastore into maintenance mode. We have commonly heard of datastore maintenance mode support and capacity load balancing use cases. So having some details on the immediate problems to be solved will help prioritize the feature. Please reach out to me in Kubernetes slack(#provider-vsphere) if you would like to share some details.

Our customers are looking for the common SDRS usecases stated here: capacity, IO load balancing and storage-objects migration.

SandeepPissay commented 3 years ago

@yuga711 Lets say we cannot support SDRS in the near future. Is there any preference on which use cases are more important to your customers:

  1. Datastore maintenance to relocate volumes(and VMs) out of the datastore. This should balance the capacity usage on other available datastores.
  2. Capacity load balancing during provisioning operation so that the datastores are balanced on usage.
  3. IO load balancing between datastores. Dynamically move the volumes to load balance the IO performance.

IMHO (3) will mean we need SDRS support which may take a long time.

How would you(or your customers) prioritize it? Few more questions that comes to my mind are:

  1. How many of your customers are asking for this features?
  2. Which version of vSphere are they using it?
  3. When do they need this feature, are they open to upgrade vSphere?
  4. We can probably enhance CSI to address datastore maintenance and capacity load balancing sooner. Would that be a better short term solution for your customers?
mitchellmaler commented 3 years ago

We are working on upgrading to 7.0u2 and were investigating using vSphere CSI for our Vanilla Kubernetes Clusters and excited for the incoming online resize support.

We are mostly looking for the capacity and IO load balancing support and we would pick capacity/space load balancing as a good first feature to support for SDRS.

SandeepPissay commented 3 years ago

@mitchellmaler can you answer these questions:

  1. How many datastores are present in the vCenter inventory?
  2. Are they all VMFS datastores? I'm wondering if you are looking at capacity load balancing for the same datastore type or between different datastore types like VMFS to NFS, VVOL to VMFS, VSAN to VMFS, etc.
  3. Which version of vSphere do you have?
  4. How many vSphere clusters do you have? Are you looking for capacity load balancing between datastores across datacenters?
mitchellmaler commented 3 years ago

@SandeepPissay

  1. The current SDRS datastore clusters that we will be working with contain around 15 to 20 datastores.
  2. They are all VMFS. We are only looking for the same type as we do not use the other types much.
  3. We are currently on 6.7u3 but looking to upgrade to 7 soon.
  4. We have multiple clusters per datacenter but only looking to load balance within the same datacenter and not across.
tgelter commented 3 years ago

@SandeepPissay

  1. The current SDRS datastore clusters that we will be working with contain around 15 to 20 datastores.
  2. They are all VMFS. We are only looking for the same type as we do not use the other types much.
  3. We are currently on 6.7u3 but looking to upgrade to 7 soon.
  4. We have multiple clusters per datacenter but only looking to load balance within the same datacenter and not across.

FWIW our configuration looks almost identical to this. We use a smaller number of datastores at present, but all are VMFS. Also, we're using vSphere 7.0u2.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

vrabbi commented 3 years ago

Any updates on this?

SandeepPissay commented 3 years ago

@vrabbi we are tracking this ask in our internal backlog. This seems to be a large work item, so I do not have visibility into when this feature will be released.

vrabbi commented 3 years ago

@SandeepPissay thanks for the quick reply. This would be huge for many of our customers. If there is any information that would be helpfull to get from me please let me know. I get this is a large work item and completely understand it may take some time. If there is anythin we can do to help add context, use cases, referneces etc. In order to help push this forwards i would be glad to help with that.

SandeepPissay commented 3 years ago

If there is anythin we can do to help add context, use cases, referneces etc. In order to help push this forwards i would be glad to help with that.

@vrabbi yes, this will be super useful! Could you send this info in an email to me? My email is ssrinivas@vmware.com. Thanks!

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

tgelter commented 3 years ago

/remove-lifecycle rotten

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tgelter commented 3 years ago

/remove-lifecycle rotten

McAndersDK commented 2 years ago

Any news on this? and how do we recover then CNS tell us it cant find the VMDK after it have been moved by DRS ?

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

tgelter commented 2 years ago

/remove-lifecycle rotten

MarkMorsing commented 2 years ago

Any updates on this?

Having large kubernetes enviorments without DRS is crazy. Also, enabling DRS on a kubernetes datastore breaks the FCD path, thus breaking the persistent storage function

jingxu97 commented 2 years ago

@MarkMorsing, could you provide some details on how enabling DRS on a kubernetes datastore breaks the FCD path? Thanks!

MarkMorsing commented 2 years ago

@jingxu97 SDRS breaks FCD path when this occurs:

Server-A has a PV mounted. A SDRS is triggered and migreres Server-A to a datastore with more free space. SDRS triggers file rename on all VM files to ensure consistensy, so the GUID on the PV.vmdk is renamed to the server name.vmdk (Server-A.vmdk). And since the disk is also moved out of the FCD folder and into Server-A’s folder.

Example: Datastore\FCD\PV-GUID\PV-GUID.vmdk

Is mounted on Server-A

SDRS is triggered on Server-A

PV-GUID.vmdk becomes Datastore\Server-A\Server-A00X.vmdk

And no other nodes can mount or unmount the PV.

Rather critical issue in my opinion - hope it makes sense

jingxu97 commented 2 years ago

@MarkMorsing just to confirm are you using in-tree or CSI to provision volume?

If using CSI, I think PV users volume handler which is not related to path, for example

    VolumeHandle:      73280b64-a261-49a3-942a-13618634a5df
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1643318253130-8081-csi.vsphere.vmware.com
                           type=vSphere CNS Block Volume
MarkMorsing commented 2 years ago

@jingxu97 we are using CSI version 2.2.2, and it expects the FCD filepath and name.

we’ve had multiple lost PVs due to this.

also we’re not using vsan but regular FC VMFS datastores

jingxu97 commented 2 years ago

I am using VMFS datastore too. I can see the volume is under fcd directory, but in PV spec, the volumeHandle does not use path, just an id (different from vmdk file name)

MarkMorsing commented 2 years ago

@jingxu97 Yes but if you attach the PV to a VM, then storage migrate the VM then the PV, will be moved out of the FCD folder, and everything will go bad from there.

Try it

jingxu97 commented 2 years ago

Yes, I tried it and it is working after storage migration. the vmdk file name and path are changed completely, but volume Handler remains the same.

But I didn't enable SDRs, I used manual way to migrate VM with storage only migration.

MarkMorsing commented 2 years ago

@jingxu97 Ours was done automaticly and after that we Can no longer de-attach or attach the volume to another node

msau42 commented 2 years ago

@MarkMorsing can you paste your PV spec so we can verify which volume driver you are using? The CSI driver's spec uses ID, not path in its attributes.

heilerich commented 2 years ago

I do not have a suitable test environment right now to verify, but, if I remember correctly, the problem should not be in the CSI driver itself.

The CSI driver indeed uses the ID of the CNS Volume (a object within vSphere) in the PV spec. When the VMDK files are renamed and moved by SDRS, this connection is not affected. But in a second step, when the CSI driver asks vSphere to mount the disk, vSphere tries to find the disk belonging to the CNS Volume Object. At that point vSphere returns an error, because it cannot find the disk VMDK file anymore.

To clarify: K8s PV object asks for {CNS-ID} -(1)-> vSphere CNS Volume object is found, it contains {DISK-UID} -(2)-> vSphere tries to mount fcd/{DISK-UID}/{DISK-UID}.vmdk.

Connection (1) is unaffected by this bug. Connection (2) breaks when the volume is moved by SDRS. So strictly speaking, this bug is not an incompatibility between SDRS and the CSI driver, but an incompatibility between vSphere SDRS and vSphere Cloud Native Storage (CNS).

Because the CSI driver relies on CNS, it is also affected by this bug. I suspect that this must be fixed in the (proprietary) vSphere CNS code, not the CSI driver code.

To my understanding from https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/565#issuecomment-891621272 vmware is aware of the problem.

jingxu97 commented 2 years ago

For my testing, after storage migration, I do delete pod which trigger volume unmount and detach and create new pods which trigger volume attach and mount on different node, it is working.

vrabbi commented 2 years ago

@MarkMorsing what vcenter version are you on? Changes were made in CNS in 7u2 and 7u3 that help address storage migration. Its not 100% yet but it does work in many use cases

MarkMorsing commented 2 years ago

@msau42 Here are the PV specs: "csi": { "driver": "csi.vsphere.vmware.com", "fsType": "ext4", "volumeAttributes": { "storage.kubernetes.io/csiProvisionerIdentity": "1633738834268-8081-csi.vsphere.vmware.com", "type": "vSphere CNS Block Volume" }, "volumeHandle": "dd6552f3-b1f1-40cc-bd38-59a2d68c2683" },

@jingxu97 that's strange, whenever we try to remount a volume, it states that it can't find it.

@vrabbi We're running vCenter 7u2d

vrabbi commented 2 years ago

I believe the actual change may have been only in u3 but i dont remember off the top of my head right now

vrabbi commented 2 years ago

It has worked for me in 7u3c as well

MarkMorsing commented 2 years ago

@vrabbi Hmm i don't see anything i vCenter 7.u3x release notes about improvements to the CNS? Can you point me in the direction before i start planning an upgrade.

Also we kinda need a 100% fix, can't run a production enviorment on something that works 80% of the time :)

vrabbi commented 2 years ago

As its not 100% i dont believe it was added to release notes. This indeed needs more backend work and then im sure they will announce it

vrabbi commented 2 years ago

1192 explains what has been added in 7u3 but it is not part of csi yet.

MarkMorsing commented 2 years ago

@vrabbi That's great, but it should really be a top priority to get it integrated in the csi. It'd boost stability and usability quite a bit

jingxu97 commented 2 years ago

I am using vcenter 6.7u3, it works.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tgelter commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tgelter commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale