vmware clustering issues with ceph/rbd iscsi HA support

mikechristie commented 6 years ago

This is a place keeper for a known issue with vmware HA and ceph/rbd iscsi HA support.

It is not a issue when using single host vmware setups with ceph/rbd iscsi HA support and it is not a issue when using vmware HA setups with ceph/rbd iSCSI non-HA setups.

The problem is that in ceph/rbd iscsi HA mode rbd requires the rbd exclusive lock when executing IO. If a LUN contains multiple VMs, and they have different active hosts, and one host cannot access the active-optimized iscsi gateway then it will failover to one of the non-primary gateways, but the other hosts will continue to use the primary gateway. The rbd lock will then bounce between both hosts and cause performance problems and possibly crashes/hangs in the iscsi gateways.

A kernel and tcmu-runner change that allows tcmu-runner to share an rbd image's lock between multiple iscsi gateways is being designed.

A temporary workaround for smaller setups would be to use a VM per LUN. There is no workaround for larger setups.

GoozeyX commented 6 years ago

Hi Mike,

Do you have information on where those design changes are being proposed at or discussed at?

mikechristie commented 6 years ago

No.

Do you want to work on it or are you just wondering? If you want to work on it I can bug the rbd maintainer about the r d parts and I can write up what needs to be done for the kernel.

mikechristie commented 6 years ago

We are switching our failover type to explicit:

https://github.com/ceph/ceph-iscsi-config/pull/54 https://github.com/open-iscsi/tcmu-runner/pull/407

For ESX when the follow over feature is enabled (alua_followover=on) we will not hit the above problem. There is a similar feature for linux. I have not found one for windows though.

GoozeyX commented 6 years ago

Mike would this also address the Issue with PGRs not working if you have 2 gateways? From what i understand the alua_followover=no would solve the potential issue of rbd locks being bounced between two gateway nodes but this wouldnt necessarily address PGRs (aka the ability the share a lun across multiple hosts without potential datacorruption) ?

mikechristie commented 6 years ago

Mike would this also address the Issue with PGRs not working if you have 2 gateways? From what i

No. It will not help. For PGRs even though one node is in standby it still needs to be able to setup and report the PGR state. I think the windows cluster validation test you need to run when you setup a windows cluster even tests for this.

understand the alua_followover=no would solve the potential issue of rbd locks being bounced

Just to make sure there is no mixup. That should be "on" and not "no" :)

DennisKonrad commented 6 years ago

Hi Mike,

I see you are working on this when looking through the commits. It would be nice to have some information about the current usability of ceph/rbd iSCSI HA with VMware HA. Or is there any target release that is supposed to support the lock share?

mikechristie commented 6 years ago

I am currently working on LIO changes and the tcmu kernel/user interface to support shared locking and fix another possible data corruption issue with VMware HA similar to the one found for the single path case here https://github.com/open-iscsi/tcmu-runner/pull/384.

For the 1.4.0 release I am trying to complete by the end of August, the data corruption issues will be fixed in all configurations.

For VMware HA and the specific issue that this issue was created for where we will hit the path ping ponging issue that will not be fixed in that release. For both shared locking and PGRs, the primary part is being able to tell tcmu-runner what I_T nexus (iscsi path basically) the command is coming from. This is taking me a lot longer than I thought and I am currently fixing up related bugs in the PGR code. Specifically this patch:

https://www.spinics.net/lists/target-devel/msg16945.html

ended up leading to a lot of other changes.

iceman91176 commented 4 years ago

Hello @mikechristie - doues that issues still exist ? If yes, are there other workarounds than the ONE-VM-PER-LUN ? What is the limit of LUNs that iscsi-gw does support ?

mikechristie commented 4 years ago

It still exists.
No other workarounds.
The limit is 256 LUNs per target and the max number of targets depends on the systems resources. There is no hard coded limit for target and the actual target struct do not take a lot of memory, so you would probably hit the ESX limit on sessions/paths which I think would be 1024 per ESX host.

MatthiasGrandl commented 3 years ago

What is the current status here @lxbsz ? As I assume Mike is not working on this anymore I tagged you, because I would like for somebody to shed light on this. This is a pretty big show stopper for us.

lxbsz commented 3 years ago

What is the current status here @lxbsz ? As I assume Mike is not working on this anymore I tagged you, because I would like for somebody to shed light on this. This is a pretty big show stopper for us.

I am mainly focusing on and ocuppied by cephfs stuff, not get any chance to work on it yet.

lkuhn900 commented 1 year ago

I'm working on a ceph iscsi poc currently. The goal is to have a multi-rbd shared datastore across 3 esxi hosts. Currently only one host at a time can actually see and write to the datastore. Is this something that can work?

open-iscsi / tcmu-runner

vmware clustering issues with ceph/rbd iscsi HA support #341