opensvc / multipath-tools

Other
59 stars 47 forks source link

IO process is D+ when path_checker is tur and there some problem in backstores #54

Closed lixiaokeng closed 1 year ago

lixiaokeng commented 1 year ago

Here is a test. We use tur path_checker and echo offline to backstores, then we write to multipath devices on client. The IO process becomes D+ because request queue in kernel. We find the path will be down->up->down->up and no_path_retry doesn't work because the tur check is OK but IO is fail.

This problem can be solved if we use directio path_checker or set no_path_retry fail. However, we want to use tur path_checker when there are thousands of multiparty devices. Do you have a good idea about this problem?

mwilck commented 1 year ago

I can't reproduce.

If I do echo offline >/sys/class/block/sdb/device/state, multipathd logs show:

Nov 09 08:26:08 luzifer multipathd[2919]: sda: tur state = up
Nov 09 08:26:12 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:12 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:17 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:17 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:22 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:22 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:27 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:27 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:28 luzifer multipathd[2919]: sda: tur state = up
Nov 09 08:26:32 luzifer multipathd[2919]: sdb: state down, checker not called
Nov 09 08:26:32 luzifer multipathd[2919]: 36001405130d0940e1914873a58afb4ad: sdb - path offline
Nov 09 08:26:37 luzifer multipathd[2919]: sdb: state down, checker not called

This is how it's expected to behave.

lixiaokeng commented 1 year ago

There is a mistake and I'm sorry for it. I echo offline in server but not in client. I use targetcli to create lun0 and the real disk of lun0 is sdb in server. Make sdb offline.

lixiaokeng commented 1 year ago

This is log in client.

Nov  4 17:51:27 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: performing delayed actions
Nov  4 17:51:27 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: reload [0 31457280 multipath 0 1 alua 2 1 service-time 0 1 1 8:32 1 service-time 0 1 1 8:16 1]
Nov  4 17:57:55 localhost multipathd[10826]: sdc: mark as failed
Nov  4 17:57:55 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov  4 17:57:56 localhost multipathd[10826]: sdb: mark as failed
Nov  4 17:57:56 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 0
Nov  4 17:58:00 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: sdc - tur checker reports path is up
Nov  4 17:58:00 localhost multipathd[10826]: 8:32: reinstated
Nov  4 17:58:00 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov  4 17:58:01 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: sdb - tur checker reports path is up
Nov  4 17:58:01 localhost multipathd[10826]: 8:16: reinstated
Nov  4 17:58:01 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 2
Nov  4 17:58:01 localhost multipathd[10826]: sdc: mark as failed
Nov  4 17:58:01 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov  4 17:58:02 localhost multipathd[10826]: sdb: mark as failed
Nov  4 17:58:02 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 0
Nov  4 17:58:05 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: sdc - tur checker reports path is up
Nov  4 17:58:05 localhost multipathd[10826]: 8:32: reinstated
Nov  4 17:58:05 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov  4 17:58:06 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: sdb - tur checker reports path is up
Nov  4 17:58:06 localhost multipathd[10826]: 8:16: reinstated
Nov  4 17:58:06 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 2
Nov  4 17:58:06 localhost multipathd[10826]: sdc: mark as failed
Nov  4 17:58:06 localhost multipathd[10826]: 3600140552e30369149849b69967aef8d: remaining active paths: 1
Nov  4 17:58:07 localhost multipathd[10826]: sdb: mark as failed
mwilck commented 1 year ago

So this is iSCSI?

bmarzins commented 1 year ago

I'm not sure that the solution to this problem is in multipath. The device is presumably really returning a good status to the TUR checker, even though it can't complete IO. I'm not sure what, short of basically copying the work of the directio checker, multipath could do to find out that this is the case. To work around the issue, the shaky path detection methods should be able to stop this sort of ping-ponging.

mwilck commented 1 year ago

I agree. The TUR checker can only work if a GOOD response from TUR actually means that the device is able to handle I/O. That doesn't seem to be the case here.

mwilck commented 1 year ago

What target is this?

lixiaokeng commented 1 year ago

I use the target created by targetcli.
image

mwilck commented 1 year ago

See target_core_spc.c:1305. LIO always reports GOOD status to TUR. You can't use the TUR checker with LIO. If you wish, this is a deficiency of the Linux target.

mwilck commented 1 year ago

We should make an entry to our hwtable for this.

mwilck commented 1 year ago

Can you please review and test https://github.com/openSUSE/multipath-tools/commit/350af2cebd3ce9f89ac7b63cd145309b237a2474, which I've just pushed to https://github.com/openSUSE/multipath-tools/tree/tip ?

lixiaokeng commented 1 year ago

There are some problem. We also should set ".detect_checker = DETECT_CHECKER_OFF" and ".product = "disk0"". DETECT_CHECKER_OFF makes no detect_alua.

image

mwilck commented 1 year ago

Right for .detect_checker (I keep making this mistake), but I don'd understand why we'd need disk0. The regexp we're using should match any product.

lixiaokeng commented 1 year ago

It is OK. I test this in 0.8.7 and the .product is "RBD". It has been changed. There is no other question.

mwilck commented 1 year ago

So, we need to add the .detect_checker line and we're good?

lixiaokeng commented 1 year ago

Yes