Open yuga711 opened 3 years ago
cc @divyenpatel @msau42 @jingxu97
@yuga711
Since CSI driver already creates FCDs, can you please explain what is the limitation in supporting the storage DRS feature?
FCD does not support SDRS yet.
Also, is this feature in the CSI driver roadmap?
Yes, we have SDRS support in our roadmap.
@yuga711 Can you explain your use case or your customer's use cases for SDRS support? SDRS does storage load balancing between datastores based on performance, latency, capacity usage, etc. It also helps in moving storage objects(currently only VMs) out of the datastore when you put a datastore into maintenance mode. We have commonly heard of datastore maintenance mode support and capacity load balancing use cases. So having some details on the immediate problems to be solved will help prioritize the feature. Please reach out to me in Kubernetes slack(#provider-vsphere) if you would like to share some details.
@SandeepPissay our immediate need for SDRS support fall under the common requests you have already stated, datastore maintenance mode and capacity load balancing.
@limb-adobe what version of vSphere are you using? I'm wondering if you are fine with upgrading the vSphere or you are looking for capacity load balancing and datastore maintenance mode feature support for already released vSphere versions.
@limb-adobe what version of vSphere are you using? I'm wondering if you are fine with upgrading the vSphere or you are looking for capacity load balancing and datastore maintenance mode feature support for already released vSphere versions.
@SandeepPissay, @limb-adobe is my colleague so I can field this question for now. We are using vCenter Server 7.0 Update 1c. Landon would need to provide guidance about future version timelines. I think we'd be open to it, but would likely prefer support for this version of vCenter since we're currently in the process of globally migrating off of 6.7 right now.
@yuga711 Can you explain your use case or your customer's use cases for SDRS support? SDRS does storage load balancing between datastores based on performance, latency, capacity usage, etc. It also helps in moving storage objects(currently only VMs) out of the datastore when you put a datastore into maintenance mode. We have commonly heard of datastore maintenance mode support and capacity load balancing use cases. So having some details on the immediate problems to be solved will help prioritize the feature. Please reach out to me in Kubernetes slack(#provider-vsphere) if you would like to share some details.
Our customers are looking for the common SDRS usecases stated here: capacity, IO load balancing and storage-objects migration.
@yuga711 Lets say we cannot support SDRS in the near future. Is there any preference on which use cases are more important to your customers:
IMHO (3) will mean we need SDRS support which may take a long time.
How would you(or your customers) prioritize it? Few more questions that comes to my mind are:
We are working on upgrading to 7.0u2 and were investigating using vSphere CSI for our Vanilla Kubernetes Clusters and excited for the incoming online resize support.
We are mostly looking for the capacity and IO load balancing support and we would pick capacity/space load balancing as a good first feature to support for SDRS.
@mitchellmaler can you answer these questions:
@SandeepPissay
@SandeepPissay
- The current SDRS datastore clusters that we will be working with contain around 15 to 20 datastores.
- They are all VMFS. We are only looking for the same type as we do not use the other types much.
- We are currently on 6.7u3 but looking to upgrade to 7 soon.
- We have multiple clusters per datacenter but only looking to load balance within the same datacenter and not across.
FWIW our configuration looks almost identical to this. We use a smaller number of datastores at present, but all are VMFS. Also, we're using vSphere 7.0u2.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Any updates on this?
@vrabbi we are tracking this ask in our internal backlog. This seems to be a large work item, so I do not have visibility into when this feature will be released.
@SandeepPissay thanks for the quick reply. This would be huge for many of our customers. If there is any information that would be helpfull to get from me please let me know. I get this is a large work item and completely understand it may take some time. If there is anythin we can do to help add context, use cases, referneces etc. In order to help push this forwards i would be glad to help with that.
If there is anythin we can do to help add context, use cases, referneces etc. In order to help push this forwards i would be glad to help with that.
@vrabbi yes, this will be super useful! Could you send this info in an email to me? My email is ssrinivas@vmware.com. Thanks!
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle rotten
Any news on this? and how do we recover then CNS tell us it cant find the VMDK after it have been moved by DRS ?
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
Any updates on this?
Having large kubernetes enviorments without DRS is crazy. Also, enabling DRS on a kubernetes datastore breaks the FCD path, thus breaking the persistent storage function
@MarkMorsing, could you provide some details on how enabling DRS on a kubernetes datastore breaks the FCD path? Thanks!
@jingxu97 SDRS breaks FCD path when this occurs:
Server-A has a PV mounted. A SDRS is triggered and migreres Server-A to a datastore with more free space. SDRS triggers file rename on all VM files to ensure consistensy, so the GUID on the PV.vmdk is renamed to the server name.vmdk (Server-A.vmdk). And since the disk is also moved out of the FCD folder and into Server-A’s folder.
Example: Datastore\FCD\PV-GUID\PV-GUID.vmdk
Is mounted on Server-A
SDRS is triggered on Server-A
PV-GUID.vmdk becomes Datastore\Server-A\Server-A00X.vmdk
And no other nodes can mount or unmount the PV.
Rather critical issue in my opinion - hope it makes sense
@MarkMorsing just to confirm are you using in-tree or CSI to provision volume?
If using CSI, I think PV users volume handler which is not related to path, for example
VolumeHandle: 73280b64-a261-49a3-942a-13618634a5df
ReadOnly: false
VolumeAttributes: storage.kubernetes.io/csiProvisionerIdentity=1643318253130-8081-csi.vsphere.vmware.com
type=vSphere CNS Block Volume
@jingxu97 we are using CSI version 2.2.2, and it expects the FCD filepath and name.
we’ve had multiple lost PVs due to this.
also we’re not using vsan but regular FC VMFS datastores
I am using VMFS datastore too. I can see the volume is under fcd directory, but in PV spec, the volumeHandle does not use path, just an id (different from vmdk file name)
@jingxu97 Yes but if you attach the PV to a VM, then storage migrate the VM then the PV, will be moved out of the FCD folder, and everything will go bad from there.
Try it
Yes, I tried it and it is working after storage migration. the vmdk file name and path are changed completely, but volume Handler remains the same.
But I didn't enable SDRs, I used manual way to migrate VM with storage only migration.
@jingxu97 Ours was done automaticly and after that we Can no longer de-attach or attach the volume to another node
@MarkMorsing can you paste your PV spec so we can verify which volume driver you are using? The CSI driver's spec uses ID, not path in its attributes.
I do not have a suitable test environment right now to verify, but, if I remember correctly, the problem should not be in the CSI driver itself.
The CSI driver indeed uses the ID of the CNS Volume (a object within vSphere) in the PV spec. When the VMDK files are renamed and moved by SDRS, this connection is not affected. But in a second step, when the CSI driver asks vSphere to mount the disk, vSphere tries to find the disk belonging to the CNS Volume Object. At that point vSphere returns an error, because it cannot find the disk VMDK file anymore.
To clarify: K8s PV object asks for {CNS-ID}
-(1)-> vSphere CNS Volume object is found, it contains {DISK-UID}
-(2)-> vSphere tries to mount fcd/{DISK-UID}/{DISK-UID}.vmdk
.
Connection (1) is unaffected by this bug. Connection (2) breaks when the volume is moved by SDRS. So strictly speaking, this bug is not an incompatibility between SDRS and the CSI driver, but an incompatibility between vSphere SDRS and vSphere Cloud Native Storage (CNS).
Because the CSI driver relies on CNS, it is also affected by this bug. I suspect that this must be fixed in the (proprietary) vSphere CNS code, not the CSI driver code.
To my understanding from https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/565#issuecomment-891621272 vmware is aware of the problem.
For my testing, after storage migration, I do delete pod which trigger volume unmount and detach and create new pods which trigger volume attach and mount on different node, it is working.
@MarkMorsing what vcenter version are you on? Changes were made in CNS in 7u2 and 7u3 that help address storage migration. Its not 100% yet but it does work in many use cases
@msau42 Here are the PV specs: "csi": { "driver": "csi.vsphere.vmware.com", "fsType": "ext4", "volumeAttributes": { "storage.kubernetes.io/csiProvisionerIdentity": "1633738834268-8081-csi.vsphere.vmware.com", "type": "vSphere CNS Block Volume" }, "volumeHandle": "dd6552f3-b1f1-40cc-bd38-59a2d68c2683" },
@jingxu97 that's strange, whenever we try to remount a volume, it states that it can't find it.
@vrabbi We're running vCenter 7u2d
I believe the actual change may have been only in u3 but i dont remember off the top of my head right now
It has worked for me in 7u3c as well
@vrabbi Hmm i don't see anything i vCenter 7.u3x release notes about improvements to the CNS? Can you point me in the direction before i start planning an upgrade.
Also we kinda need a 100% fix, can't run a production enviorment on something that works 80% of the time :)
As its not 100% i dont believe it was added to release notes. This indeed needs more backend work and then im sure they will announce it
@vrabbi That's great, but it should really be a top priority to get it integrated in the csi. It'd boost stability and usability quite a bit
I am using vcenter 6.7u3, it works.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The vSphere CSI driver page states
vSphere CSI driver and Cloud Native Storage in vSphere does not currently support Storage DRS feature in vSphere.
Since CSI driver already creates FCDs, can you please explain what is the limitation in supporting the storage DRS feature? Also, is this feature in the CSI driver roadmap? Thanks!Is this a BUG REPORT or FEATURE REQUEST?:
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
uname -a
):