Open AresChen1 opened 5 years ago
When you wrote "all image's IO is return to zero" do you mean all IOs will be failed by the multipath layer when the no_path_retry timer in /etc/multipath.conf expires?
Or, do you mean the IO returns successfully, but for READ commands the data in the buffer is zeros when it should be some non-zero data?
Also, can you tell me if you are using the ceph-iscsi tools like gwcli to setup or using targetcli directly?
If you are using ceph-iscsi could you tell me the version?
When you wrote "all image's IO is return to zero" do you mean all IOs will be failed by the multipath layer when the no_path_retry timer in /etc/multipath.conf expires?
Or, do you mean the IO returns successfully, but for READ commands the data in the buffer is zeros when it should be some non-zero data?
It can't read and write data into the multipath device. Like this:
[root@ldap ~]# fio -filename=/dev/mapper/mpathg -direct=1 -iodepth 64 -thread -rw=randwrite -ioengine=libaio -bs=64K -size=1G -numjobs=1 -runtime=900 -group_reporting -name=mytest
mytest: (g=0): rw=randwrite, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=64 fio-3.1 Starting 1 thread Jobs: 1 (f=1): [w(1)][60.0%][r=0KiB/s,w=0KiB/s][r=0,w=0][eta 12m:00s]
It can't read and write data into the multipath device that exported from the "pool2" that is not full.Because when "tcmu-runner" got a "timeout" error code from librbd of "pool1" that is full. It will disable the tpg in tcmu_notify_conn_lost of tcmu-runner project. If there are three multipath device from "pool1" and there alua belong to one of three gateways independently. The three tpgs will be disabled.
Also, can you tell me if you are using the ceph-iscsi tools like gwcli to setup or using targetcli directly?
If you are using ceph-iscsi could you tell me the version?
Use ceph-iscsi tools that version is 3.0. Use targetcli to list like this:
o- / ......................................................................................................................... [...] o- backstores .............................................................................................................. [...] | o- block .................................................................................................. [Storage Objects: 0] | o- fileio ................................................................................................. [Storage Objects: 0] | o- pscsi .................................................................................................. [Storage Objects: 0] | o- ramdisk ................................................................................................ [Storage Objects: 0] | o- user:glfs .............................................................................................. [Storage Objects: 0] | o- user:qcow .............................................................................................. [Storage Objects: 0] | o- user:rbd ............................................................................................... [Storage Objects: 1] | | o- testpool.testimage .............................................. [testpool/testimage;osd_op_timeout=30 (5.0GiB) activated] | | o- alua ................................................................................................... [ALUA Groups: 4] | | o- ano2 ............................................................................... [ALUA state: Active/non-optimized] | | o- ano3 ............................................................................... [ALUA state: Active/non-optimized] | | o- ao ..................................................................................... [ALUA state: Active/optimized] | | o- default_tg_pt_gp ....................................................................... [ALUA state: Active/optimized] | o- user:zbc ............................................................................................... [Storage Objects: 0] o- iscsi ............................................................................................................ [Targets: 1] | o- iqn.2019-06.com.xitcorp.iscsi-gw:7669ac11e98c61a1 ................................................................. [TPGs: 3] | o- tpg1 ........................................................................................................... [disabled] | | o- acls .......................................................................................................... [ACLs: 0] | | o- luns .......................................................................................................... [LUNs: 1] | | | o- lun0 ................................................................................... [user/testpool.testimage (ao)] | | o- portals .................................................................................................... [Portals: 1] | | o- 10.0.100.12:3260 ................................................................................................. [OK] | o- tpg2 ........................................................................................................... [disabled] | | o- acls .......................................................................................................... [ACLs: 0] | | o- luns .......................................................................................................... [LUNs: 1] | | | o- lun0 ................................................................................. [user/testpool.testimage (ano2)] | | o- portals .................................................................................................... [Portals: 1] | | o- 10.0.100.11:3260 ................................................................................................. [OK] | o- tpg3 .......................................................................................... [no-gen-acls, auth per-acl] | o- acls .......................................................................................................... [ACLs: 1] | | o- iqn.1994-05.com.redhat:a92e9f6e6e80 ...................................................... [1-way auth, Mapped LUNs: 1] | | o- mapped_lun0 ..................................................................... [lun0 user/testpool.testimage (rw)] | o- luns .......................................................................................................... [LUNs: 1] | | o- lun0 ................................................................................. [user/testpool.testimage (ano3)] | o- portals .................................................................................................... [Portals: 1] | o- 10.0.100.10:3260 ................................................................................................. [OK] o- loopback ......................................................................................................... [Targets: 0] o- vhost ............................................................................................................ [Targets: 0] o- xen-pvscsi ....................................................................................................... [Targets: 0]
Ok, here are the options.
2.. You can run the gwcli disk reconfigure command to set the osd_op_timeout to a high value.
Ok, here are the options.
- You can configure a target per pool. That way each LUN's iscsi sessions will be separate. One pool hanging and timing out due to a pool full issue will not affect the other target and its pool.
2.. You can run the gwcli disk reconfigure command to set the osd_op_timeout to a high value.
- If you want to blanket turn it off then I will need to send a patch to modify the code, because if you try osd_op_timeout=0 then some other code is going to kick in and try to adjust it.
Can I comments the code that set tpg's enable to 0 in tgt_port_grp_recovery_thread_fn?
Yeah, as a temp hack that will be ok.
Env: tcmu-runner 1.4.1, centos 7.5, kernel 3.10, ceph 13.2.5
Root cause: When tcmu-runner got timeout error code from librbd, it will disable the tpg. So the tcp connection from initiator will be closed. If the three gateways get the timeout error code, these three tpgs will be disabled. So all the image's IO will be return to zero.
How can I fix this? Whether can I don't disable the tpg?