Closed lixiaokeng closed 1 year ago
I can't reproduce.
If I do echo offline >/sys/class/block/sdb/device/state
, multipathd logs show:
Nov 09 08:26:08 luzifer multipathd[2919]: sda: tur state = up
Nov 09 08:26:12 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:12 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:17 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:17 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:22 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:22 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:27 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:27 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:28 luzifer multipathd[2919]: sda: tur state = up
Nov 09 08:26:32 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:32 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:37 luzifer multipathd[2919]: sdb: state down, checker not called
This is how it's expected to behave.
There is a mistake and I'm sorry for it. I echo offline in server but not in client. I use targetcli to create lun0 and the real disk of lun0 is sdb in server. Make sdb offline.
This is log in client.
Nov 4 17:51:27 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: performing delayed actions
Nov 4 17:51:27 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: reload [0 31457280 multipath 0 1 alua 2 1 service-time 0 1 1 8:32 1 service-time 0 1 1 8:16 1]
Nov 4 17:57:55 localhost multipathd[10826]: sdc: mark as failed
Nov 4 17:57:55 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov 4 17:57:56 localhost multipathd[10826]: sdb: mark as failed
Nov 4 17:57:56 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 0
Nov 4 17:58:00 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: sdc - tur checker reports path is up
Nov 4 17:58:00 localhost multipathd[10826]: 8:32: reinstated
Nov 4 17:58:00 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov 4 17:58:01 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: sdb - tur checker reports path is up
Nov 4 17:58:01 localhost multipathd[10826]: 8:16: reinstated
Nov 4 17:58:01 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 2
Nov 4 17:58:01 localhost multipathd[10826]: sdc: mark as failed
Nov 4 17:58:01 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov 4 17:58:02 localhost multipathd[10826]: sdb: mark as failed
Nov 4 17:58:02 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 0
Nov 4 17:58:05 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: sdc - tur checker reports path is up
Nov 4 17:58:05 localhost multipathd[10826]: 8:32: reinstated
Nov 4 17:58:05 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov 4 17:58:06 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: sdb - tur checker reports path is up
Nov 4 17:58:06 localhost multipathd[10826]: 8:16: reinstated
Nov 4 17:58:06 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 2
Nov 4 17:58:06 localhost multipathd[10826]: sdc: mark as failed
Nov 4 17:58:06 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov 4 17:58:07 localhost multipathd[10826]: sdb: mark as failed
So this is iSCSI?
I'm not sure that the solution to this problem is in multipath. The device is presumably really returning a good status to the TUR checker, even though it can't complete IO. I'm not sure what, short of basically copying the work of the directio checker, multipath could do to find out that this is the case. To work around the issue, the shaky path detection methods should be able to stop this sort of ping-ponging.
I agree. The TUR checker can only work if a GOOD response from TUR actually means that the device is able to handle I/O. That doesn't seem to be the case here.
What target is this?
I use the target created by targetcli.
See target_core_spc.c:1305. LIO always reports GOOD status to TUR. You can't use the TUR checker with LIO. If you wish, this is a deficiency of the Linux target.
We should make an entry to our hwtable for this.
Can you please review and test https://github.com/openSUSE/multipath-tools/commit/350af2cebd3ce9f89ac7b63cd145309b237a2474, which I've just pushed to https://github.com/openSUSE/multipath-tools/tree/tip ?
There are some problem. We also should set ".detect_checker = DETECT_CHECKER_OFF" and ".product = "disk0"". DETECT_CHECKER_OFF makes no detect_alua.
Right for .detect_checker
(I keep making this mistake), but I don'd understand why we'd need disk0
. The regexp we're using should match any product.
It is OK. I test this in 0.8.7 and the .product is "RBD". It has been changed. There is no other question.
So, we need to add the .detect_checker
line and we're good?
Yes
Here is a test. We use tur path_checker and echo offline to backstores, then we write to multipath devices on client. The IO process becomes D+ because request queue in kernel. We find the path will be down->up->down->up and no_path_retry doesn't work because the tur check is OK but IO is fail.
This problem can be solved if we use directio path_checker or set no_path_retry fail. However, we want to use tur path_checker when there are thousands of multiparty devices. Do you have a good idea about this problem?