Poor RBD performance as LIO-TCMU iSCSI target

DongyuanPan commented 6 years ago

Hi~ I am a senior university student and I've been learning ceph and iscsi recently.

I'm using fio to test the performance of the RBD,but performance degradation when using RBDs with LIO-TCMU.

My test is mainly about the performance of the RBD as a target using LIO_TCMU、the performance of the RBD itself (no iSCSI or LIO-TCMU)、the performance of the RBD as a target using TGT.

Details about the test environment:

Single node test "cluster" (osd pool default size = 1) with Ceph (version 12.2.2)
CentOS 7.4 (3.10.0-693.11.6.el7.x86_64)
fio-2.99 tcmu-runner-1.3.0-re4
16 OSD and osd_objectstore is bluestore
rbd default features = 3

I use targetcli(or tgtadm) to create target device and use initiator to login it.And then，I use fio to test the device. 1）the performance of the RBD itself (no iSCSI or LIO-TCMU) rbd create image-10 --size 102400 (rbd default features = 3) fio test config

[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=rbd
clientname=admin
pool=rbd
rbdname=image-10
rw=randwrite
bs=4k
numjobs=4
buffered=0
runtime=180
group_reporting=1

[rbd_iodepth32]
iodepth=128
#write_iops_log=write_rbd_default_feature_one
#log_avg_msec=1000

performance: 35-40 K IOPS

2)the performance of the RBD as a target using TGT. create lun: tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --backing-store rbd/image-10 --bstype rbd

initiator iscsiadm -m node --targetname iqn.2018-01.com.example02:iscsi -p 192.168.x.x:3260 -l

the lun was mounted as /dev/sdw

fio test

[global]
bs=4k
ioengine=libaio
iodepth=128
direct=1
#sync=1
runtime=30
size=60G
buffered=0
#directory=/mnt
numjobs=4
filename=/dev/sdw
group_reporting=1

[rand-write]
time_based
write_iops_log=write_tgt_default_feature_three
log_avg_msec=1000
rw=randwrite
#stonewall

performance: 18-20K IOPS

3）the performance of the RBD as a target using LIO_TCMU use targetcli to create lun and tpg default_cmdsn_depth=512. initiator side node.session.cmds_max = 2048 node.session.queue_depth = 1024

fio
[global]
bs=4k
ioengine=libaio
iodepth=128
direct=1
#sync=1
runtime=180
size=50G
buffered=0
#directory=/mnt
numjobs=4
filename=/dev/sdv
group_reporting=1

[rand-write]
time_based
write_iops_log=write_tgt_default_feature_three
log_avg_msec=1000
rw=randwrite
#stonewall

/dev/sdv backend image-10

performance: 7K IOPS

I found an issue similar to me, but I still haven't found the problem http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-October/044021.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-December/045347.html

Thanks for any help anyone can provide!

mikechristie commented 6 years ago

We are just starting to investigate performance.

One known issue is that for LIO and open-iscsi you need to have node.session.cmds_max match the LIO default_cmdsn_depth setting. If they are not the same, then there seems to be a bug on the initiator side where IOs are requeued and do not get retried quickly like normal.

There is another issue for latency/IOPs type of tests where one command slows others. The attached patch

runner-dont-wait.txt

is a hack around it but it needs work because it can cause extra switches.

For target_core_user there are other issues like its memory allocation in the main path, but you might not be hitting that with the fio arguments you are using.

DongyuanPan commented 6 years ago

Thank you～ @mikechristie

I retested the performance with tcmu-runner-1.3.0 and set node.session.cmds_max match the LIO default_cmdsn_depth.There are some improvements in performance( 18.8 K IOPS)，and the performance the same as using TGT. If I use tcmu-runner-1.3.0 without optimization . Is this normal for the performance（the same as TGT）？

There is another issue for latency/IOPs type of tests where one command slows others. The attached patch runner-dont-wait.txt is a hack around it but it needs work because it can cause extra switches.

if I test with the patch，the performance （32K IOPS） approach the RBD itself. But the patch is only for tests？ The argument wakeup is determined by aio_track->tracked_aio_ops. AIO must be tracked ? What might occur if I do not track AIO ？Can this parameter be specified by the user？

mikechristie commented 6 years ago

Thanks for testing.

if I test with the patch，the performance （32K IOPS） approach the RBD itself. But the patch is only for tests？

Yeah, the patch needs some cleanup, because of what you notice below.

The argument wakeup is determined by aio_track->tracked_aio_ops. AIO must be tracked ? What might occur if I do not track AIO ？Can this parameter be specified by the user？

It is used during failover/failback and recovery to make sure IOs are not being executed in the handler modules (handler_rbd, handler_glfs, etc) when we execute a callout like lock() or (re)open().

So ideally, we have these issues:

In aio_command_finish we do not want to batch commands like we do today. We can either completely drop the batching like in the patch attached in the previous comment, or we can try to add some setting to try and limit how long we wait before calling tcmulib_processing_complete. For example we could do something like:


If (!wakeup && current_batch_wait > batch_timeout)
        tcmulib_processing_complete(dev);

We would like to remove the track_lock from the main IO path, but we still need a way to make sure IO is not running on the handler when we do the lock/open callouts. We can maybe replace the aio_wait_for_empty_queue calls with tcmu_flush_device calls.

MIZZ122 commented 6 years ago

@Github641234230 I noticed that your kernel version is 3.10.0-693.11.6.el7.x86_64. Did you add some patch for your kernel? Are you going to be HA? My kernel is 3.10.0-693.11.6.el7.x86_64，tcmu-runner-1.3.0-re4. There is IOERROE when modify the kernel parameter enable = 1.

MIZZ122 commented 6 years ago

@mikechristie If our product can only use CentOS 7.4 (3.10.0-693.11.6.el7.x86_64) and I want to do HA.what should I do?Which patch I can use?

I've tried using targetcli to export RBDs to all gateway nodes,On the iscsi client side, I use dm-multipath to find it and it can work well( both Active/Active and active/passive).Is there any problem using this method for HA? And this issue https://github.com/open-iscsi/tcmu-runner/issues/356 Active/Active is not supported. I am very confused.

mikechristie commented 6 years ago

@MIZZ122

For upstream tcmu-runner/ceph-iscsi-cli HA support you have to use RHEL 7.5 beta or newer kernel or this kernel:

https://github.com/ceph/ceph-client.

HA is only supported with active/passive. You must use the settings here

http://docs.ceph.com/docs/master/rbd/iscsi-initiators/

Just because dm-multipath let's you setup active/active does not mean it is safe. You can end up with data corruption. Use the settings in the docs.

If you are doing single node (non HA) then you can do active/active across multiple portals on that one node.

mikechristie commented 6 years ago

@MIZZ122 if you have other questions about active/active can you open a new issues or discuss it in the issue for active/active. This issue is for perf only.

dillaman commented 6 years ago

@mikechristie Any update on this issue?

mikechristie commented 6 years ago

@lxbsz was testing it out for gluster with the perf team. lxbsz, did it help and did you make the changes I requested and were they needed or was it ok to just always just complete right away?

It looks like you probably got busy with resize so I can do the changes. Are you guys working with the perf team still, so we can get them tested?

lxbsz commented 6 years ago

@mikechristie Yes, we and the perf team together test this.

The environment is base PostgreSQL database when running on Gluster Block Volume in a CNS environment.

1, by changing node.session.cmds_max to match the LIO default_cmdsn_depth. The performance improved just very small improvement, about 5%?

2, by https://github.com/open-iscsi/tcmu-runner/files/1654757/runner-dont-wait.txt The performance improved about 10%.

3, by changing the default_cmdsn_depth to 64: The performance improved about 27%.

So we are preparing to have a more test about this later. These days we are busy with the RHGS's release.

mikechristie commented 6 years ago

Ok, assume this is back on me.

lxbsz commented 6 years ago

We will test this by mixing them up later once we have enough time.

serjponomarev commented 6 years ago

Can I use this (https://github.com/open-iscsi/tcmu-runner/files/1654757/runner-dont-wait.txt) patch for production ESXi environment? If not recommended, how i can help you to investigate performance for fix it? I have all needed hardware

mikechristie commented 6 years ago

It is perfectly safe crash wise but might cause other regressions. If you can test, I can give you a patch later this week that makes it configurable so we can try to figure out if there is some balance between the 2 extreme settings being used with and without the patch or if it might need to be configurable for the type of workload.

serjponomarev commented 6 years ago

Ok, i'am waiting patch and instruction for how test it (ceph, tcmu-runner, FIO)

DongyuanPan commented 6 years ago

In my test environment for ceph rbd, the tgt perf is better than lio-tcmu. So I created an IBLOCK backstore from a /dev/sda block device by Targetcli in order to test the LIO perf without tcmu/tcmu-runner.

4K rand_write LIO+SSD DISK -> IOPS=48.9k, BW=191MiB/s TGT+SSD DISK -> IOPS=49.2k, BW=192MiB/s

4K rand_read LIO+SSD DISK -> IOPS=44.9k, BW=175MiB/s TGT+SSD DISK -> IOPS=46.5k, BW=182MiB/s

64K write LIO+SSD DISK ->IOPS=6221, BW=389MiB/s TGT+SSD DISK -> IOPS=9100, BW=569MiB/s

64K read LIO+SSD DISK ->IOPS=8389, BW=524MiB/s TGT+SSD DISK ->IOPS=19.3k, BW=1208MiB/s

The perf of TGT is better than LIO. It's strange. Thanks for any help anyone can provide!

wwba commented 6 years ago

@mikechristie In my ceph cluster, the throughput of the scsi disks is much lower than RBD's. I run the LIO iscsi gateway in vm with kernel version '4.16.0-0.rc6'. In the vm, I compaired the performance of tcmu-runner with KRBD using fio util(sync=1, -ioengine=psync -bs=4M -numjobs=10).

4M seq write & one LIO gw for a RBD KRBDBW=409MiB, avg lat = 97ms LIO + TCMU BW=131MiB, avg lat = 305ms TGT+rbd_bsBW=362MiB, avg lat = 110ms

4M seq read & one LIO gw for a RBD KRBD BW=1571MiB, avg lat = 25ms LIO + TCMU BW=256MiB, avg lat = 155ms TGT+rbd_bsBW=1556MiB, avg lat = 26ms

4M seq write & one LIO gw for four RBDs KRBD BW=205MiB, avg lat = 190ms LIO + TCMUBW=42MiB, avg lat = 921ms TGT+rbd_bsBW=193MiB, avg lat = 206ms

4M seq read & one LIO gw for four RBDs KRBD BW=416MiB, avg lat = 96ms LIO + TCMU BW=148MiB, avg lat = 270ms TGT+rbd_bsBW=397MiB, avg lat = 100ms

I have a poor throughput for scsi disk using TCMU, is this having something to do with the

For target_core_user there are other issues like its memory allocation in the main path

as you say ?

shadowlinyf commented 6 years ago

@mikechristie Has the runner-dont-wait.txt patch already been merged to 1.4RC1?

mikechristie commented 6 years ago

Yes.

shadowlinyf commented 6 years ago

@mikechristie I am having performance issue with EC RBD as backend store. I am using 1.4RC1.KRBD seq write speed is about 600MB/s.TCUM+RBD seq write speed is around 30MB/s.

NUABO commented 5 years ago

hi @shadowlinyf ,will you test it again afterwards, is tcmu still very poor?

Allenscript commented 5 years ago

now i meet the same performance like this , fio with rbd ,the result was about 500MB/s , if with tcmu of user：rbd , the fio test result was about 15MB/s ，this performance is too poor , my env is ：kernel -5.0.4 , tcmu -lasest release 1.4.1 , ceph - 12.2.11

deng-ruixuan commented 2 years ago

now i meet the same performance like this , fio with rbd ,the result was about 500MB/s , if with tcmu of user：rbd , the fio test result was about 15MB/s ，this performance is too poor , my env is ：kernel -5.0.4 , tcmu -lasest release 1.4.1 , ceph - 12.2.11

i meet the same performance like this. I seem to have solved this my problem. Although there are other performance issues. we can try to use gwcli to set the following parameters for disk: /disks> reconfigure blockpool/image01 hw_max_sectors 8192 /disks> reconfigure blockpool/image01 max_data_area_mb 128 After setting，the performance of tcmu can approximate the performance of librbd in HHD scenarios

open-iscsi / tcmu-runner

Poor RBD performance as LIO-TCMU iSCSI target #359