Closed kaazoo closed 5 years ago
I don't understand why the write performance is (much) faster behind the iscsi export. Does this come from some write buffering? did you test with the fsync convention? dd if=/dev/urandom of=/dev/XZ bs=100G conv=fsync This runs the test in synchronous mode. Does that result shows the similar trend?
@kaazoo Looking at your configuration (not too many NL-SAS, raidz2) your results are very good, even locally. Remember that each disk of this class has about 100 IOPS only. I wouldn't expect more using such disks and raidz2.
Have you tried running iSCSI transfer for a long time (at least one hour)? Until ZIL is full you will get write performance boost, because data is being written to RAM, if not writing synchronously. And then you can experience sudden drop, because data has to be moved on HDD. It's called write throttle. You can read about it here: http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/ https://github.com/zfsonlinux/zfs/blob/6e9710f7c33a7440904b39fb4e6b68b3325cfeca/module/zfs/vdev_queue.c https://github.com/zfsonlinux/zfs/blob/6e9710f7c33a7440904b39fb4e6b68b3325cfeca/module/zfs/dsl_pool.c If a problem, tune it with zfs module parameters: zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_delay_min_dirty_percent zfs_delay_scale zio_delay_max
Moreover iSCSI can aggregate few operations from queue into one, so you get more IOPS at initiator side. Run full random test (fio) to exclude this.
In context of slower reads: have you disabled read ahead at initiator side?
@zielony360 How do I disable readahead on the initiator side? I tried the following:
echo 0 > /sys/block/dm-2/queue/read_ahead_kb
echo 0 > /sys/block/sde/queue/read_ahead_kb
echo 0 > /sys/block/sdg/queue/read_ahead_kb
That doesn't seem to have impact when testing with dd. I will also test with fio asap.
On the target, I switched to a zpool with 2x raidz2 (8 disks each, ashift=12). Then I created a zvol with volblockisze=128K.
dd locally on target:
dd if=/dev/zero bs=1M count=102400 of=/dev/zvol/data/iscsi conv=fsync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 275.299 s, 390 MB/s
dd of=/dev/null bs=1M count=102400 if=/dev/zvol/data/iscsi conv=fsync
dd: fsync failed for ‘/dev/null’: Invalid argument
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 149.828 s, 717 MB/s
dd on initiator:
dd if=/dev/zero bs=1M count=102400 of=/live_neu/test/testfile conv=fsync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 119.756 s, 897 MB/s
dd if=/dev/zero bs=1M count=102400 of=/live_neu/test/testfile2 oflag=sync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 366.874 s, 293 MB/s
dd of=/dev/null bs=1M count=102400 if=/live_neu/test/testfile conv=fsync
dd: fsync failed for ‘/dev/null’: Invalid argument
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 400.654 s, 268 MB/s
dd of=/dev/null bs=1M count=102400 if=/live_neu/test/testfile2 iflag=sync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 715.744 s, 150 MB/s
As I have no experience with fio so far, I followed http://n4f.siftusystems.com/index.php/2013/07/14/zfs-zvol-blocksize-and-esxi/ .
fio sequential write on initiator:
fio --rw=write --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
sequential: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
sequential: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [W] [100.0% done] [0KB/368.0MB/0KB /s] [0/2944/0 iops] [eta 00m:00s]
sequential: (groupid=0, jobs=1): err= 0: pid=14936: Wed Jan 13 17:22:15 2016
write: io=102400MB, bw=304557KB/s, iops=2379, runt=344295msec
clat (usec): min=39, max=599211, avg=414.62, stdev=5110.46
lat (usec): min=42, max=599214, avg=417.36, stdev=5110.46
clat percentiles (usec):
| 1.00th=[ 51], 5.00th=[ 65], 10.00th=[ 71], 20.00th=[ 74],
| 30.00th=[ 77], 40.00th=[ 81], 50.00th=[ 92], 60.00th=[ 171],
| 70.00th=[ 189], 80.00th=[ 195], 90.00th=[ 205], 95.00th=[ 211],
| 99.00th=[ 298], 99.50th=[27520], 99.90th=[42240], 99.95th=[51456],
| 99.99th=[220160]
bw (KB /s): min=24581, max=424704, per=100.00%, avg=311173.98, stdev=78388.57
lat (usec) : 50=0.80%, 100=52.47%, 250=45.21%, 500=0.69%, 750=0.02%
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.07%, 50=0.65%
lat (msec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%
cpu : usr=1.73%, sys=19.80%, ctx=359725, majf=0, minf=1128
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=819200/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=102400MB, aggrb=304557KB/s, minb=304557KB/s, maxb=304557KB/s, mint=344295msec, maxt=344295msec
fio sequential read on initiator:
fio --rw=read --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
sequential: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [301.9MB/0KB/0KB /s] [2415/0/0 iops] [eta 00m:00s]
sequential: (groupid=0, jobs=1): err= 0: pid=14947: Wed Jan 13 17:29:18 2016
read : io=102400MB, bw=276221KB/s, iops=2157, runt=379615msec
clat (usec): min=16, max=362018, avg=460.98, stdev=4534.24
lat (usec): min=16, max=362018, avg=461.22, stdev=4534.24
clat percentiles (usec):
| 1.00th=[ 26], 5.00th=[ 31], 10.00th=[ 33], 20.00th=[ 36],
| 30.00th=[ 39], 40.00th=[ 41], 50.00th=[ 44], 60.00th=[ 47],
| 70.00th=[ 53], 80.00th=[ 145], 90.00th=[ 748], 95.00th=[ 1416],
| 99.00th=[ 4384], 99.50th=[11072], 99.90th=[52480], 99.95th=[112128],
| 99.99th=[193536]
bw (KB /s): min=43309, max=435604, per=100.00%, avg=279063.82, stdev=70748.45
lat (usec) : 20=0.03%, 50=65.09%, 100=14.18%, 250=2.61%, 500=4.16%
lat (usec) : 750=3.96%, 1000=2.69%
lat (msec) : 2=4.35%, 4=1.85%, 10=0.55%, 20=0.26%, 50=0.17%
lat (msec) : 100=0.05%, 250=0.06%, 500=0.01%
cpu : usr=0.70%, sys=14.14%, ctx=180416, majf=0, minf=1716
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=819200/w=0/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
READ: io=102400MB, aggrb=276220KB/s, minb=276220KB/s, maxb=276220KB/s, mint=379615msec, maxt=379615msec
fio random write on initiator:
fio --direct=0 --rw=randwrite --bs=128k --size=100G --numjobs=1 --group_reporting --name=randwrite
randwrite: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
randwrite: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [w] [100.0% done] [0KB/402.2MB/0KB /s] [0/3217/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=1): err= 0: pid=16484: Thu Jan 14 08:11:14 2016
write: io=102400MB, bw=248721KB/s, iops=1943, runt=421588msec
clat (usec): min=42, max=351883, avg=507.57, stdev=2101.34
lat (usec): min=43, max=351886, avg=510.05, stdev=2101.40
clat percentiles (usec):
| 1.00th=[ 78], 5.00th=[ 227], 10.00th=[ 266], 20.00th=[ 298],
| 30.00th=[ 314], 40.00th=[ 334], 50.00th=[ 350], 60.00th=[ 366],
| 70.00th=[ 390], 80.00th=[ 418], 90.00th=[ 474], 95.00th=[ 540],
| 99.00th=[ 772], 99.50th=[15680], 99.90th=[31616], 99.95th=[39168],
| 99.99th=[58624]
bw (KB /s): min= 1124, max=640000, per=100.00%, avg=250529.14, stdev=155233.74
lat (usec) : 50=0.03%, 100=2.20%, 250=4.97%, 500=85.31%, 750=6.41%
lat (usec) : 1000=0.37%
lat (msec) : 2=0.03%, 4=0.02%, 10=0.02%, 20=0.35%, 50=0.28%
lat (msec) : 100=0.02%, 250=0.01%, 500=0.01%
cpu : usr=1.63%, sys=18.54%, ctx=805263, majf=0, minf=1995
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=819200/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=102400MB, aggrb=248720KB/s, minb=248720KB/s, maxb=248720KB/s, mint=421588msec, maxt=421588msec
fio random read on initiator:
fio --direct=0 --rw=randread --bs=128k --size=100G --numjobs=1 --group_reporting --name=randwrite
randwrite: (g=0): rw=randread, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
Jobs: 1 (f=1): [r] [7.8% done] [12544KB/0KB/0KB /s] [98/0/0 iops] [eta 02h:06m:40s]
Jobs: 1 (f=1): [r] [8.6% done] [11776KB/0KB/0KB /s] [92/0/0 iops] [eta 02h:05m:28s]
Jobs: 1 (f=1): [r] [100.0% done] [53760KB/0KB/0KB /s] [420/0/0 iops] [eta 00m:02s]
randwrite: (groupid=0, jobs=1): err= 0: pid=16506: Thu Jan 14 10:29:52 2016
read : io=102400MB, bw=12890KB/s, iops=100, runt=8134640msec
clat (usec): min=46, max=111984, avg=9920.28, stdev=3906.41
lat (usec): min=46, max=111985, avg=9920.85, stdev=3906.41
clat percentiles (usec):
| 1.00th=[ 1288], 5.00th=[ 6304], 10.00th=[ 7392], 20.00th=[ 8384],
| 30.00th=[ 9024], 40.00th=[ 9536], 50.00th=[ 9792], 60.00th=[10176],
| 70.00th=[10432], 80.00th=[10688], 90.00th=[10944], 95.00th=[11584],
| 99.00th=[28544], 99.50th=[36608], 99.90th=[49920], 99.95th=[54016],
| 99.99th=[67072]
bw (KB /s): min= 6095, max=74602, per=100.00%, avg=12903.65, stdev=1555.26
lat (usec) : 50=0.01%, 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%
lat (usec) : 1000=0.29%
lat (msec) : 2=1.24%, 4=0.41%, 10=53.10%, 20=42.88%, 50=1.92%
lat (msec) : 100=0.10%, 250=0.01%
cpu : usr=0.16%, sys=1.59%, ctx=819798, majf=0, minf=5848
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=819200/w=0/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
READ: io=102400MB, aggrb=12890KB/s, minb=12890KB/s, maxb=12890KB/s, mint=8134640msec, maxt=8134640msec
I'm not sure if my testing is correct, but it still seems that reading is performing worse than writing.
sequential (buffered?) write: 897 MB/s (dd conv=fsync)
sequential (unbuffered?) write: 293 MB/s (dd oflag=sync) 304 MB/s (fio)
random write: 245 MB/s (fio)
sequential (buffered?) read: 268 MB/s (dd conv=fsync)
sequential (unbuffered?) read: 150 MB/s (dd iflag=sync) 273 MB/s (fio)
random read: 13 MB/s (fio)
Do you think these results are OK or should they be better?
I order to have something to compare against, I created an ext4 filesystem instead of ZFS on the initiator.
ext4 doesn't seem to support 128 KB blocksize (max seems to be 64 KB), so I just went with the standard of 4 KB., allthough this is not optimal.
# mkfs.ext4 /dev/mapper/mpath139
mke2fs 1.42.9 (4-Feb-2014)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=32 blocks, Stripe width=32 blocks
1335885824 inodes, 21374173184 blocks
1068708659 blocks (5.00%) reserved for the super user
First data block=0
652288 block groups
32768 blocks per group, 32768 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
2560000000, 3855122432, 5804752896, 12800000000, 17414258688
Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
# mount -o noatime /dev/mapper/mpath139 /mnt/
I switched on readahead again:
# echo 16384 > /sys/block/dm-2/queue/read_ahead_kb
# echo 16384 > /sys/block/sd/queue/read_ahead_kb
# echo 16384 > /sys/block/sd/queue/read_ahead_kb
dd on initiator:
# dd if=/dev/zero bs=1M count=102400 of=/mnt/testfile conv=fsync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 241.338 s, 445 MB/s
# echo 3 > /proc/sys/vm/drop_caches
# dd of=/dev/null bs=1M count=102400 if=/mnt/testfile conv=fsync
dd: fsync failed for ‘/dev/null’: Invalid argument
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 204.507 s, 525 MB/s
# dd if=/dev/zero bs=1M count=10240 of=/mnt/testfile2 oflag=sync
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 433.752 s, 24.8 MB/s
# echo 3 > /proc/sys/vm/drop_caches
# dd of=/dev/null bs=1M count=10240 if=/mnt/testfile2 iflag=sync
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 14.9626 s, 718 MB/s
fio sequential write on initiator:
# fio --rw=write --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
sequential: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
sequential: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [W] [100.0% done] [0KB/834.9MB/0KB /s] [0/6678/0 iops] [eta 00m:00s]
sequential: (groupid=0, jobs=1): err= 0: pid=17231: Thu Jan 14 14:43:58 2016
write: io=102400MB, bw=544325KB/s, iops=4252, runt=192638msec
clat (usec): min=48, max=5062.4K, avg=232.55, stdev=10358.33
lat (usec): min=50, max=5062.4K, avg=234.12, stdev=10358.34
clat percentiles (usec):
| 1.00th=[ 64], 5.00th=[ 65], 10.00th=[ 67], 20.00th=[ 69],
| 30.00th=[ 76], 40.00th=[ 80], 50.00th=[ 84], 60.00th=[ 89],
| 70.00th=[ 96], 80.00th=[ 116], 90.00th=[ 159], 95.00th=[ 203],
| 99.00th=[ 350], 99.50th=[ 8640], 99.90th=[14784], 99.95th=[15680],
| 99.99th=[49920]
bw (KB /s): min= 35, max=1711872, per=100.00%, avg=626581.29, stdev=352171.63
lat (usec) : 50=0.01%, 100=72.86%, 250=24.97%, 500=1.20%, 750=0.01%
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.54%, 20=0.37%, 50=0.01%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%, >=2000=0.01%
cpu : usr=1.22%, sys=41.73%, ctx=8049, majf=0, minf=672
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=819200/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=102400MB, aggrb=544324KB/s, minb=544324KB/s, maxb=544324KB/s, mint=192638msec, maxt=192638msec
Disk stats (read/write):
dm-2: ios=0/12512, merge=0/87329, ticks=0/23337968, in_queue=23435524, util=99.30%, aggrios=0/6269, aggrmerge=0/0, aggrticks=0/11595746, aggrin_queue=11660066, aggrutil=81.17%
sde: ios=0/6238, merge=0/0, ticks=0/11229532, in_queue=11348720, util=78.70%
sdf: ios=0/6300, merge=0/0, ticks=0/11961960, in_queue=11971412, util=81.17%
fio sequential read on initiator:
# echo 3 > /proc/sys/vm/drop_caches
# fio --rw=read --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
sequential: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [522.8MB/0KB/0KB /s] [4181/0/0 iops] [eta 00m:00s]
sequential: (groupid=0, jobs=1): err= 0: pid=17246: Thu Jan 14 14:51:58 2016
read : io=102400MB, bw=543450KB/s, iops=4245, runt=192948msec
clat (usec): min=12, max=655945, avg=232.91, stdev=1707.33
lat (usec): min=12, max=655945, avg=233.18, stdev=1707.34
clat percentiles (usec):
| 1.00th=[ 13], 5.00th=[ 14], 10.00th=[ 15], 20.00th=[ 17],
| 30.00th=[ 22], 40.00th=[ 28], 50.00th=[ 31], 60.00th=[ 35],
| 70.00th=[ 39], 80.00th=[ 62], 90.00th=[ 145], 95.00th=[ 1080],
| 99.00th=[ 3216], 99.50th=[ 5728], 99.90th=[20096], 99.95th=[30336],
| 99.99th=[54016]
bw (KB /s): min=179851, max=743424, per=100.00%, avg=545326.82, stdev=87853.49
lat (usec) : 20=26.91%, 50=50.69%, 100=8.95%, 250=5.97%, 500=1.36%
lat (usec) : 750=0.42%, 1000=0.50%
lat (msec) : 2=2.17%, 4=2.42%, 10=0.32%, 20=0.19%, 50=0.09%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
cpu : usr=0.56%, sys=36.64%, ctx=57842, majf=0, minf=625
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=819200/w=0/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
READ: io=102400MB, aggrb=543450KB/s, minb=543450KB/s, maxb=543450KB/s, mint=192948msec, maxt=192948msec
Disk stats (read/write):
dm-2: ios=51254/600, merge=51197/43, ticks=320060/8268, in_queue=328016, util=98.62%, aggrios=25629/317, aggrmerge=0/0, aggrticks=152020/4128, aggrin_queue=156150, aggrutil=49.03%
sde: ios=25653/328, merge=0/0, ticks=152832/4548, in_queue=157380, util=49.03%
sdf: ios=25606/307, merge=0/0, ticks=151208/3708, in_queue=154920, util=49.00%
fio random write on initiator:
# fio --direct=0 --rw=randwrite --bs=128k --size=100G --numjobs=1 --group_reporting --name=randwrite
randwrite: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
randwrite: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [w] [100.0% done] [0KB/553.5MB/0KB /s] [0/4427/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=1): err= 0: pid=17254: Thu Jan 14 14:56:54 2016
write: io=102400MB, bw=408593KB/s, iops=3192, runt=256631msec
clat (usec): min=56, max=1199.9K, avg=309.51, stdev=3036.72
lat (usec): min=57, max=1199.9K, avg=311.13, stdev=3036.76
clat percentiles (usec):
| 1.00th=[ 68], 5.00th=[ 79], 10.00th=[ 95], 20.00th=[ 105],
| 30.00th=[ 113], 40.00th=[ 120], 50.00th=[ 129], 60.00th=[ 139],
| 70.00th=[ 163], 80.00th=[ 207], 90.00th=[ 258], 95.00th=[ 310],
| 99.00th=[ 5728], 99.50th=[ 9280], 99.90th=[16512], 99.95th=[26752],
| 99.99th=[87552]
bw (KB /s): min= 240, max=1632768, per=100.00%, avg=415720.78, stdev=301620.32
lat (usec) : 100=13.60%, 250=75.44%, 500=8.96%, 750=0.14%, 1000=0.05%
lat (msec) : 2=0.13%, 4=0.23%, 10=1.05%, 20=0.34%, 50=0.06%
lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 2000=0.01%
cpu : usr=1.24%, sys=46.85%, ctx=396770, majf=0, minf=2807
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=819200/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=102400MB, aggrb=408592KB/s, minb=408592KB/s, maxb=408592KB/s, mint=256631msec, maxt=256631msec
Disk stats (read/write):
dm-2: ios=0/601733, merge=0/46030, ticks=0/29710064, in_queue=29717184, util=99.51%, aggrios=0/301806, aggrmerge=0/0, aggrticks=0/14675630, aggrin_queue=14680310, aggrutil=74.27%
sde: ios=0/301762, merge=0/0, ticks=0/16050024, in_queue=16058292, util=74.27%
sdf: ios=0/301851, merge=0/0, ticks=0/13301236, in_queue=13302328, util=66.70%
fio random read on initiator:
# echo 3 > /proc/sys/vm/drop_caches
# fio --direct=0 --rw=randread --bs=128k --size=100G --numjobs=1 --group_reporting --name=randwrite
randwrite: (g=0): rw=randread, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
^Cbs: 1 (f=1): [r] [2.7% done] [12915KB/0KB/0KB /s] [100/0/0 iops] [eta 02h:38m:43s]
fio: terminating on signal 2
randwrite: (groupid=0, jobs=1): err= 0: pid=17270: Thu Jan 14 15:02:01 2016
read : io=2776.7MB, bw=10747KB/s, iops=83, runt=264558msec
clat (msec): min=1, max=2031, avg=11.90, stdev=22.27
lat (msec): min=1, max=2031, avg=11.90, stdev=22.27
clat percentiles (msec):
| 1.00th=[ 7], 5.00th=[ 8], 10.00th=[ 9], 20.00th=[ 10],
| 30.00th=[ 10], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 11],
| 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 16], 95.00th=[ 22],
| 99.00th=[ 40], 99.50th=[ 46], 99.90th=[ 98], 99.95th=[ 212],
| 99.99th=[ 930]
bw (KB /s): min= 194, max=13407, per=100.00%, avg=10935.34, stdev=1823.38
lat (msec) : 2=0.02%, 4=0.02%, 10=39.20%, 20=54.93%, 50=5.50%
lat (msec) : 100=0.24%, 250=0.05%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2000=0.01%, >=2000=0.01%
cpu : usr=0.10%, sys=0.88%, ctx=24149, majf=0, minf=1597
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=22213/w=0/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
READ: io=2776.7MB, aggrb=10747KB/s, minb=10747KB/s, maxb=10747KB/s, mint=264558msec, maxt=264558msec
Disk stats (read/write):
dm-2: ios=24115/803, merge=0/58, ticks=262036/19312, in_queue=281284, util=99.17%, aggrios=12062/426, aggrmerge=0/0, aggrticks=129436/9642, aggrin_queue=139076, aggrutil=49.03%
sde: ios=12079/421, merge=0/0, ticks=129400/8816, in_queue=138212, util=49.01%
sdf: ios=12046/431, merge=0/0, ticks=129472/10468, in_queue=139940, util=49.03%
sequential (buffered?) write: 445 MB/s (dd conv=fsync)
sequential (unbuffered?) write: 25 MB/s (dd oflag=sync) 612 MB/s (fio)
random write: 406 MB/s (fio)
sequential (buffered?) read: 525 MB/s (dd conv=fsync)
sequential (unbuffered?) read: 718 MB/s (dd iflag=sync) 533 MB/s (fio)
random read: 11 MB/s (fio)
The results of ZFS vs. ext4 differ a lot. The relation between (fio) sequential reads and sequential writes seems to be better with ext4. It follows the relation you can see on the target. Don't know what's wrong with (dd oflag=sync). Random reads perform bad for both.
I did further testing with 128 KB blocksize for zvol and fio.
The following commands were use for fio:
# fio --rw=write --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
# fio --rw=read --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
# fio --direct=0 --rw=randwrite --bs=128k --size=100G --numjobs=1 --group_reporting --name=random
# fio --direct=0 --rw=randread --bs=128k --size=100G --numjobs=1 --group_reporting --name=random
test | blocksize log/phy of dev | fio seq write | fio seq read | fio rand write | fio rand read |
---|---|---|---|---|---|
zfs on target | 4K | 827 MB/s | 1446 MB/s | 815 MB/s | 26 MB/s |
zfs on initiator on LIO iSCSI disk on zvol | 512 / 128K | 308 MB/s | 251 MB/s | 334 MB/s | 13 MB/s |
zfs on initiator on LIO iSCSI disk on zfs | 4K | 127 MB/s | 356 MB/s | 381 MB/s | 19 MB/s |
zfs on initiator on SCST iSCSI disk on zvol | 4K / 128K | 299 MB/s | 277 MB/s | 328 MB/s | 13 MB/s |
zfs on initiator on SCST iSCSI disk on zfs | 4K | 208 MB/s | 526 MB/s | 327 MB/s | 21 MB/s |
ext4 on initiator on LIO iSCSI disk on zvol | 512 / 128K | 739 MB/s | 381 MB/s | 413 MB/s | 10 MB/s |
ext4 on initiator on SCST iSCSI disk on zvol | 512 / 128K | 874 MB/s | 471 MB/s | 456 MB/s | 10 MB/s |
ext4 on initiator on SCST iSCSI disk on zvol | 4K / 128K | 807 MB/s | 514 MB/s | 435 MB/s | 10 MB/s |
Then I also tested fio with 4 KB blocksize on 128 KB zvol blocksize.
The following commands were use for fio:
# fio --rw=write --bs=4k --size=100G --numjobs=1 --group_reporting --name=sequential_4k
# fio --rw=read --bs=4k --size=100G --numjobs=1 --group_reporting --name=sequential_4k
# fio --direct=0 --rw=randwrite --bs=4k --size=100G --numjobs=1 --group_reporting --name=random_4k
# fio --direct=0 --rw=randread --bs=4k --size=100G --numjobs=1 --group_reporting --name=random_4k
test | blocksize log/phy of dev | fio seq write | fio seq read | fio rand write | fio rand read |
---|---|---|---|---|---|
zfs on target | 4K | 84 MB/s | 1061 MB/s | 305 KB/s | 842 KB/s |
zfs on initiator on LIO iSCSI disk on zvol | 512 / 128K | 50 MB/s | 288 MB/s | 200 KB/s | 1 MB/s |
zfs on initiator on SCST iSCSI disk on zvol | 4K / 128K | 52 MB/s | 299 MB/s | 176 KB/s | 632 KB/s |
ext4 on initiator on LIO iSCSI disk on zvol | 512 / 128K | 614 MB/s | 364 MB/s | 16 MB/s | 300 KB/s |
ext4 on initiator on SCST iSCSI disk on zvol | 4K / 128K | 732 MB/s | 436 MB/s | 46 MB/s | 10 MB/s |
Ext4 seems to perform much better than ZFS when running on iSCSI disk.
We're seeing this too, and what's interesting is that the read speeds seem indicate disk reads, not ARC as the raw ZFS on target results do. Could you post the contents of scst.conf and the saveconfig.json from LIO?
Curious this is still open from 2016? It was the first few hits on Google for zfs and slow iscsi.
A few recent commits seem to have smoothed out zvols a bit, but we are nowhere near the 0.6.4 #s these days. Zvols in general suffer from performance issues due to cow on binary targets, but they were much faster in the past.
Would love to see some realistic numbers for that reason. Right now I"m looking at a deadlocked zvol, happened as soon as I set sync=default. Why? Who knows, likely something I broke but don't even know where to start. Not only did it hang that zvol but the entire pool. There are I thought at least two timers that should have kicked the process by now but haven't. Point is I'd agree, zvols still have some rather odd bugs.
I have a setup with a head node (iSCSI initiator) and multiple storage appliances (iSCSI targets). Each target is standard server hardware (16x / 24x HGST 8 TB SAS3). All systems run Ubuntu 14.04 and ZFS 0.6.5.4. LIO (Linux IO target) is used for the target. Open-iSCSI is used for the initiator. Target and initiator are connected via 10 GBit Ethernet.
On the target for example a single raidz2 vdev with 16 disks is configured. With a single big zvol (for example volblocksize=128K) I get the following performance locally (compression disabled, dd 100 GB): write: 1100-4900 IOPS, 140-600 MB/s read: 5100-6200 IOPS, 650-780 MB/s
That zvol is exported via iSCSI and used as a vdev on the head node. According to 'zdb' ashift is set to 17. I get the following performance (compression disabled, dd 100 GB): write: 5600-7300 IOPS, 710-910 MB/s read: 1600-2300 IOPS, 205-300 MB/s
I already tried to change different parameters without improvement.
on target:
on initiator:
When I export raw disks on the iSCSI target instead of a single zvol, performance on the initiator side is as expected.