openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.33k stars 1.72k forks source link

Low performance when zpool is based on iSCSI disk based on zvol/zfs #4211

Closed kaazoo closed 5 years ago

kaazoo commented 8 years ago

I have a setup with a head node (iSCSI initiator) and multiple storage appliances (iSCSI targets). Each target is standard server hardware (16x / 24x HGST 8 TB SAS3). All systems run Ubuntu 14.04 and ZFS 0.6.5.4. LIO (Linux IO target) is used for the target. Open-iSCSI is used for the initiator. Target and initiator are connected via 10 GBit Ethernet.

On the target for example a single raidz2 vdev with 16 disks is configured. With a single big zvol (for example volblocksize=128K) I get the following performance locally (compression disabled, dd 100 GB): write: 1100-4900 IOPS, 140-600 MB/s read: 5100-6200 IOPS, 650-780 MB/s

That zvol is exported via iSCSI and used as a vdev on the head node. According to 'zdb' ashift is set to 17. I get the following performance (compression disabled, dd 100 GB): write: 5600-7300 IOPS, 710-910 MB/s read: 1600-2300 IOPS, 205-300 MB/s

I already tried to change different parameters without improvement.

on target:

on initiator:

When I export raw disks on the iSCSI target instead of a single zvol, performance on the initiator side is as expected.

alexanderhaensch commented 8 years ago

I don't understand why the write performance is (much) faster behind the iscsi export. Does this come from some write buffering? did you test with the fsync convention? dd if=/dev/urandom of=/dev/XZ bs=100G conv=fsync This runs the test in synchronous mode. Does that result shows the similar trend?

zielony360 commented 8 years ago

@kaazoo Looking at your configuration (not too many NL-SAS, raidz2) your results are very good, even locally. Remember that each disk of this class has about 100 IOPS only. I wouldn't expect more using such disks and raidz2.

Have you tried running iSCSI transfer for a long time (at least one hour)? Until ZIL is full you will get write performance boost, because data is being written to RAM, if not writing synchronously. And then you can experience sudden drop, because data has to be moved on HDD. It's called write throttle. You can read about it here: http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/ https://github.com/zfsonlinux/zfs/blob/6e9710f7c33a7440904b39fb4e6b68b3325cfeca/module/zfs/vdev_queue.c https://github.com/zfsonlinux/zfs/blob/6e9710f7c33a7440904b39fb4e6b68b3325cfeca/module/zfs/dsl_pool.c If a problem, tune it with zfs module parameters: zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_delay_min_dirty_percent zfs_delay_scale zio_delay_max

Moreover iSCSI can aggregate few operations from queue into one, so you get more IOPS at initiator side. Run full random test (fio) to exclude this.

In context of slower reads: have you disabled read ahead at initiator side?

kaazoo commented 8 years ago

@zielony360 How do I disable readahead on the initiator side? I tried the following:

echo 0 > /sys/block/dm-2/queue/read_ahead_kb 
echo 0 > /sys/block/sde/queue/read_ahead_kb 
echo 0 > /sys/block/sdg/queue/read_ahead_kb 

That doesn't seem to have impact when testing with dd. I will also test with fio asap.

kaazoo commented 8 years ago

On the target, I switched to a zpool with 2x raidz2 (8 disks each, ashift=12). Then I created a zvol with volblockisze=128K.

dd locally on target:

dd if=/dev/zero bs=1M count=102400 of=/dev/zvol/data/iscsi conv=fsync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 275.299 s, 390 MB/s

dd of=/dev/null bs=1M count=102400 if=/dev/zvol/data/iscsi conv=fsync
dd: fsync failed for ‘/dev/null’: Invalid argument
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 149.828 s, 717 MB/s

dd on initiator:

dd if=/dev/zero bs=1M count=102400 of=/live_neu/test/testfile conv=fsync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 119.756 s, 897 MB/s

dd if=/dev/zero bs=1M count=102400 of=/live_neu/test/testfile2 oflag=sync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 366.874 s, 293 MB/s

dd of=/dev/null bs=1M count=102400 if=/live_neu/test/testfile conv=fsync
dd: fsync failed for ‘/dev/null’: Invalid argument
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 400.654 s, 268 MB/s

dd of=/dev/null bs=1M count=102400 if=/live_neu/test/testfile2 iflag=sync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 715.744 s, 150 MB/s

As I have no experience with fio so far, I followed http://n4f.siftusystems.com/index.php/2013/07/14/zfs-zvol-blocksize-and-esxi/ .

fio sequential write on initiator:

fio --rw=write --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
sequential: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
sequential: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [W] [100.0% done] [0KB/368.0MB/0KB /s] [0/2944/0 iops] [eta 00m:00s]
sequential: (groupid=0, jobs=1): err= 0: pid=14936: Wed Jan 13 17:22:15 2016
  write: io=102400MB, bw=304557KB/s, iops=2379, runt=344295msec
    clat (usec): min=39, max=599211, avg=414.62, stdev=5110.46
     lat (usec): min=42, max=599214, avg=417.36, stdev=5110.46
    clat percentiles (usec):
     |  1.00th=[   51],  5.00th=[   65], 10.00th=[   71], 20.00th=[   74],
     | 30.00th=[   77], 40.00th=[   81], 50.00th=[   92], 60.00th=[  171],
     | 70.00th=[  189], 80.00th=[  195], 90.00th=[  205], 95.00th=[  211],
     | 99.00th=[  298], 99.50th=[27520], 99.90th=[42240], 99.95th=[51456],
     | 99.99th=[220160]
    bw (KB  /s): min=24581, max=424704, per=100.00%, avg=311173.98, stdev=78388.57
    lat (usec) : 50=0.80%, 100=52.47%, 250=45.21%, 500=0.69%, 750=0.02%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.07%, 50=0.65%
    lat (msec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu          : usr=1.73%, sys=19.80%, ctx=359725, majf=0, minf=1128
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=819200/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400MB, aggrb=304557KB/s, minb=304557KB/s, maxb=304557KB/s, mint=344295msec, maxt=344295msec

fio sequential read on initiator:

fio --rw=read --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
sequential: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [301.9MB/0KB/0KB /s] [2415/0/0 iops] [eta 00m:00s]
sequential: (groupid=0, jobs=1): err= 0: pid=14947: Wed Jan 13 17:29:18 2016
  read : io=102400MB, bw=276221KB/s, iops=2157, runt=379615msec
    clat (usec): min=16, max=362018, avg=460.98, stdev=4534.24
     lat (usec): min=16, max=362018, avg=461.22, stdev=4534.24
    clat percentiles (usec):
     |  1.00th=[   26],  5.00th=[   31], 10.00th=[   33], 20.00th=[   36],
     | 30.00th=[   39], 40.00th=[   41], 50.00th=[   44], 60.00th=[   47],
     | 70.00th=[   53], 80.00th=[  145], 90.00th=[  748], 95.00th=[ 1416],
     | 99.00th=[ 4384], 99.50th=[11072], 99.90th=[52480], 99.95th=[112128],
     | 99.99th=[193536]
    bw (KB  /s): min=43309, max=435604, per=100.00%, avg=279063.82, stdev=70748.45
    lat (usec) : 20=0.03%, 50=65.09%, 100=14.18%, 250=2.61%, 500=4.16%
    lat (usec) : 750=3.96%, 1000=2.69%
    lat (msec) : 2=4.35%, 4=1.85%, 10=0.55%, 20=0.26%, 50=0.17%
    lat (msec) : 100=0.05%, 250=0.06%, 500=0.01%
  cpu          : usr=0.70%, sys=14.14%, ctx=180416, majf=0, minf=1716
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=819200/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=102400MB, aggrb=276220KB/s, minb=276220KB/s, maxb=276220KB/s, mint=379615msec, maxt=379615msec

fio random write on initiator:

fio --direct=0 --rw=randwrite --bs=128k --size=100G --numjobs=1 --group_reporting --name=randwrite
randwrite: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
randwrite: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [w] [100.0% done] [0KB/402.2MB/0KB /s] [0/3217/0 iops] [eta 00m:00s]  
randwrite: (groupid=0, jobs=1): err= 0: pid=16484: Thu Jan 14 08:11:14 2016
  write: io=102400MB, bw=248721KB/s, iops=1943, runt=421588msec
    clat (usec): min=42, max=351883, avg=507.57, stdev=2101.34
     lat (usec): min=43, max=351886, avg=510.05, stdev=2101.40
    clat percentiles (usec):
     |  1.00th=[   78],  5.00th=[  227], 10.00th=[  266], 20.00th=[  298],
     | 30.00th=[  314], 40.00th=[  334], 50.00th=[  350], 60.00th=[  366],
     | 70.00th=[  390], 80.00th=[  418], 90.00th=[  474], 95.00th=[  540],
     | 99.00th=[  772], 99.50th=[15680], 99.90th=[31616], 99.95th=[39168],
     | 99.99th=[58624]
    bw (KB  /s): min= 1124, max=640000, per=100.00%, avg=250529.14, stdev=155233.74
    lat (usec) : 50=0.03%, 100=2.20%, 250=4.97%, 500=85.31%, 750=6.41%
    lat (usec) : 1000=0.37%
    lat (msec) : 2=0.03%, 4=0.02%, 10=0.02%, 20=0.35%, 50=0.28%
    lat (msec) : 100=0.02%, 250=0.01%, 500=0.01%
  cpu          : usr=1.63%, sys=18.54%, ctx=805263, majf=0, minf=1995
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=819200/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400MB, aggrb=248720KB/s, minb=248720KB/s, maxb=248720KB/s, mint=421588msec, maxt=421588msec

fio random read on initiator:

fio --direct=0 --rw=randread --bs=128k --size=100G --numjobs=1 --group_reporting --name=randwrite
randwrite: (g=0): rw=randread, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
Jobs: 1 (f=1): [r] [7.8% done] [12544KB/0KB/0KB /s] [98/0/0 iops] [eta 02h:06m:40s] 
Jobs: 1 (f=1): [r] [8.6% done] [11776KB/0KB/0KB /s] [92/0/0 iops] [eta 02h:05m:28s] 
Jobs: 1 (f=1): [r] [100.0% done] [53760KB/0KB/0KB /s] [420/0/0 iops] [eta 00m:02s]   
randwrite: (groupid=0, jobs=1): err= 0: pid=16506: Thu Jan 14 10:29:52 2016
  read : io=102400MB, bw=12890KB/s, iops=100, runt=8134640msec
    clat (usec): min=46, max=111984, avg=9920.28, stdev=3906.41
     lat (usec): min=46, max=111985, avg=9920.85, stdev=3906.41
    clat percentiles (usec):
     |  1.00th=[ 1288],  5.00th=[ 6304], 10.00th=[ 7392], 20.00th=[ 8384],
     | 30.00th=[ 9024], 40.00th=[ 9536], 50.00th=[ 9792], 60.00th=[10176],
     | 70.00th=[10432], 80.00th=[10688], 90.00th=[10944], 95.00th=[11584],
     | 99.00th=[28544], 99.50th=[36608], 99.90th=[49920], 99.95th=[54016],
     | 99.99th=[67072]
    bw (KB  /s): min= 6095, max=74602, per=100.00%, avg=12903.65, stdev=1555.26
    lat (usec) : 50=0.01%, 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%
    lat (usec) : 1000=0.29%
    lat (msec) : 2=1.24%, 4=0.41%, 10=53.10%, 20=42.88%, 50=1.92%
    lat (msec) : 100=0.10%, 250=0.01%
  cpu          : usr=0.16%, sys=1.59%, ctx=819798, majf=0, minf=5848
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=819200/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=102400MB, aggrb=12890KB/s, minb=12890KB/s, maxb=12890KB/s, mint=8134640msec, maxt=8134640msec

I'm not sure if my testing is correct, but it still seems that reading is performing worse than writing.

sequential (buffered?) write: 897 MB/s (dd conv=fsync)

sequential (unbuffered?) write: 293 MB/s (dd oflag=sync) 304 MB/s (fio)

random write: 245 MB/s (fio)

sequential (buffered?) read: 268 MB/s (dd conv=fsync)

sequential (unbuffered?) read: 150 MB/s (dd iflag=sync) 273 MB/s (fio)

random read: 13 MB/s (fio)

Do you think these results are OK or should they be better?

kaazoo commented 8 years ago

I order to have something to compare against, I created an ext4 filesystem instead of ZFS on the initiator.

ext4 doesn't seem to support 128 KB blocksize (max seems to be 64 KB), so I just went with the standard of 4 KB., allthough this is not optimal.

# mkfs.ext4 /dev/mapper/mpath139
mke2fs 1.42.9 (4-Feb-2014)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=32 blocks, Stripe width=32 blocks
1335885824 inodes, 21374173184 blocks
1068708659 blocks (5.00%) reserved for the super user
First data block=0
652288 block groups
32768 blocks per group, 32768 fragments per group
2048 inodes per group
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
    102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, 
    2560000000, 3855122432, 5804752896, 12800000000, 17414258688

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done       

# mount -o noatime /dev/mapper/mpath139 /mnt/

I switched on readahead again:

# echo 16384 > /sys/block/dm-2/queue/read_ahead_kb
# echo 16384 > /sys/block/sd/queue/read_ahead_kb 
# echo 16384 > /sys/block/sd/queue/read_ahead_kb 

dd on initiator:

# dd if=/dev/zero bs=1M count=102400 of=/mnt/testfile conv=fsync
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 241.338 s, 445 MB/s

# echo 3 > /proc/sys/vm/drop_caches

# dd of=/dev/null bs=1M count=102400 if=/mnt/testfile conv=fsync
dd: fsync failed for ‘/dev/null’: Invalid argument
102400+0 records in
102400+0 records out
107374182400 bytes (107 GB) copied, 204.507 s, 525 MB/s

# dd if=/dev/zero bs=1M count=10240 of=/mnt/testfile2 oflag=sync
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 433.752 s, 24.8 MB/s

# echo 3 > /proc/sys/vm/drop_caches

# dd of=/dev/null bs=1M count=10240 if=/mnt/testfile2 iflag=sync
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 14.9626 s, 718 MB/s

fio sequential write on initiator:

# fio --rw=write --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
sequential: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
sequential: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [W] [100.0% done] [0KB/834.9MB/0KB /s] [0/6678/0 iops] [eta 00m:00s]
sequential: (groupid=0, jobs=1): err= 0: pid=17231: Thu Jan 14 14:43:58 2016
  write: io=102400MB, bw=544325KB/s, iops=4252, runt=192638msec
    clat (usec): min=48, max=5062.4K, avg=232.55, stdev=10358.33
     lat (usec): min=50, max=5062.4K, avg=234.12, stdev=10358.34
    clat percentiles (usec):
     |  1.00th=[   64],  5.00th=[   65], 10.00th=[   67], 20.00th=[   69],
     | 30.00th=[   76], 40.00th=[   80], 50.00th=[   84], 60.00th=[   89],
     | 70.00th=[   96], 80.00th=[  116], 90.00th=[  159], 95.00th=[  203],
     | 99.00th=[  350], 99.50th=[ 8640], 99.90th=[14784], 99.95th=[15680],
     | 99.99th=[49920]
    bw (KB  /s): min=   35, max=1711872, per=100.00%, avg=626581.29, stdev=352171.63
    lat (usec) : 50=0.01%, 100=72.86%, 250=24.97%, 500=1.20%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.54%, 20=0.37%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%, >=2000=0.01%
  cpu          : usr=1.22%, sys=41.73%, ctx=8049, majf=0, minf=672
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=819200/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400MB, aggrb=544324KB/s, minb=544324KB/s, maxb=544324KB/s, mint=192638msec, maxt=192638msec

Disk stats (read/write):
    dm-2: ios=0/12512, merge=0/87329, ticks=0/23337968, in_queue=23435524, util=99.30%, aggrios=0/6269, aggrmerge=0/0, aggrticks=0/11595746, aggrin_queue=11660066, aggrutil=81.17%
  sde: ios=0/6238, merge=0/0, ticks=0/11229532, in_queue=11348720, util=78.70%
  sdf: ios=0/6300, merge=0/0, ticks=0/11961960, in_queue=11971412, util=81.17%

fio sequential read on initiator:

# echo 3 > /proc/sys/vm/drop_caches

# fio --rw=read --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
sequential: (g=0): rw=read, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
Jobs: 1 (f=1): [R] [100.0% done] [522.8MB/0KB/0KB /s] [4181/0/0 iops] [eta 00m:00s]
sequential: (groupid=0, jobs=1): err= 0: pid=17246: Thu Jan 14 14:51:58 2016
  read : io=102400MB, bw=543450KB/s, iops=4245, runt=192948msec
    clat (usec): min=12, max=655945, avg=232.91, stdev=1707.33
     lat (usec): min=12, max=655945, avg=233.18, stdev=1707.34
    clat percentiles (usec):
     |  1.00th=[   13],  5.00th=[   14], 10.00th=[   15], 20.00th=[   17],
     | 30.00th=[   22], 40.00th=[   28], 50.00th=[   31], 60.00th=[   35],
     | 70.00th=[   39], 80.00th=[   62], 90.00th=[  145], 95.00th=[ 1080],
     | 99.00th=[ 3216], 99.50th=[ 5728], 99.90th=[20096], 99.95th=[30336],
     | 99.99th=[54016]
    bw (KB  /s): min=179851, max=743424, per=100.00%, avg=545326.82, stdev=87853.49
    lat (usec) : 20=26.91%, 50=50.69%, 100=8.95%, 250=5.97%, 500=1.36%
    lat (usec) : 750=0.42%, 1000=0.50%
    lat (msec) : 2=2.17%, 4=2.42%, 10=0.32%, 20=0.19%, 50=0.09%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu          : usr=0.56%, sys=36.64%, ctx=57842, majf=0, minf=625
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=819200/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=102400MB, aggrb=543450KB/s, minb=543450KB/s, maxb=543450KB/s, mint=192948msec, maxt=192948msec

Disk stats (read/write):
    dm-2: ios=51254/600, merge=51197/43, ticks=320060/8268, in_queue=328016, util=98.62%, aggrios=25629/317, aggrmerge=0/0, aggrticks=152020/4128, aggrin_queue=156150, aggrutil=49.03%
  sde: ios=25653/328, merge=0/0, ticks=152832/4548, in_queue=157380, util=49.03%
  sdf: ios=25606/307, merge=0/0, ticks=151208/3708, in_queue=154920, util=49.00%

fio random write on initiator:

# fio --direct=0 --rw=randwrite --bs=128k --size=100G --numjobs=1 --group_reporting --name=randwrite
randwrite: (g=0): rw=randwrite, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
randwrite: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [w] [100.0% done] [0KB/553.5MB/0KB /s] [0/4427/0 iops] [eta 00m:00s]
randwrite: (groupid=0, jobs=1): err= 0: pid=17254: Thu Jan 14 14:56:54 2016
  write: io=102400MB, bw=408593KB/s, iops=3192, runt=256631msec
    clat (usec): min=56, max=1199.9K, avg=309.51, stdev=3036.72
     lat (usec): min=57, max=1199.9K, avg=311.13, stdev=3036.76
    clat percentiles (usec):
     |  1.00th=[   68],  5.00th=[   79], 10.00th=[   95], 20.00th=[  105],
     | 30.00th=[  113], 40.00th=[  120], 50.00th=[  129], 60.00th=[  139],
     | 70.00th=[  163], 80.00th=[  207], 90.00th=[  258], 95.00th=[  310],
     | 99.00th=[ 5728], 99.50th=[ 9280], 99.90th=[16512], 99.95th=[26752],
     | 99.99th=[87552]
    bw (KB  /s): min=  240, max=1632768, per=100.00%, avg=415720.78, stdev=301620.32
    lat (usec) : 100=13.60%, 250=75.44%, 500=8.96%, 750=0.14%, 1000=0.05%
    lat (msec) : 2=0.13%, 4=0.23%, 10=1.05%, 20=0.34%, 50=0.06%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 2000=0.01%
  cpu          : usr=1.24%, sys=46.85%, ctx=396770, majf=0, minf=2807
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=819200/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400MB, aggrb=408592KB/s, minb=408592KB/s, maxb=408592KB/s, mint=256631msec, maxt=256631msec

Disk stats (read/write):
    dm-2: ios=0/601733, merge=0/46030, ticks=0/29710064, in_queue=29717184, util=99.51%, aggrios=0/301806, aggrmerge=0/0, aggrticks=0/14675630, aggrin_queue=14680310, aggrutil=74.27%
  sde: ios=0/301762, merge=0/0, ticks=0/16050024, in_queue=16058292, util=74.27%
  sdf: ios=0/301851, merge=0/0, ticks=0/13301236, in_queue=13302328, util=66.70%

fio random read on initiator:

# echo 3 > /proc/sys/vm/drop_caches

# fio --direct=0 --rw=randread --bs=128k --size=100G --numjobs=1 --group_reporting --name=randwrite
randwrite: (g=0): rw=randread, bs=128K-128K/128K-128K/128K-128K, ioengine=sync, iodepth=1
fio-2.1.3
Starting 1 process
^Cbs: 1 (f=1): [r] [2.7% done] [12915KB/0KB/0KB /s] [100/0/0 iops] [eta 02h:38m:43s]
fio: terminating on signal 2

randwrite: (groupid=0, jobs=1): err= 0: pid=17270: Thu Jan 14 15:02:01 2016
  read : io=2776.7MB, bw=10747KB/s, iops=83, runt=264558msec
    clat (msec): min=1, max=2031, avg=11.90, stdev=22.27
     lat (msec): min=1, max=2031, avg=11.90, stdev=22.27
    clat percentiles (msec):
     |  1.00th=[    7],  5.00th=[    8], 10.00th=[    9], 20.00th=[   10],
     | 30.00th=[   10], 40.00th=[   11], 50.00th=[   11], 60.00th=[   11],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   16], 95.00th=[   22],
     | 99.00th=[   40], 99.50th=[   46], 99.90th=[   98], 99.95th=[  212],
     | 99.99th=[  930]
    bw (KB  /s): min=  194, max=13407, per=100.00%, avg=10935.34, stdev=1823.38
    lat (msec) : 2=0.02%, 4=0.02%, 10=39.20%, 20=54.93%, 50=5.50%
    lat (msec) : 100=0.24%, 250=0.05%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%, >=2000=0.01%
  cpu          : usr=0.10%, sys=0.88%, ctx=24149, majf=0, minf=1597
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=22213/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=2776.7MB, aggrb=10747KB/s, minb=10747KB/s, maxb=10747KB/s, mint=264558msec, maxt=264558msec

Disk stats (read/write):
    dm-2: ios=24115/803, merge=0/58, ticks=262036/19312, in_queue=281284, util=99.17%, aggrios=12062/426, aggrmerge=0/0, aggrticks=129436/9642, aggrin_queue=139076, aggrutil=49.03%
  sde: ios=12079/421, merge=0/0, ticks=129400/8816, in_queue=138212, util=49.01%
  sdf: ios=12046/431, merge=0/0, ticks=129472/10468, in_queue=139940, util=49.03%

sequential (buffered?) write: 445 MB/s (dd conv=fsync)

sequential (unbuffered?) write: 25 MB/s (dd oflag=sync) 612 MB/s (fio)

random write: 406 MB/s (fio)

sequential (buffered?) read: 525 MB/s (dd conv=fsync)

sequential (unbuffered?) read: 718 MB/s (dd iflag=sync) 533 MB/s (fio)

random read: 11 MB/s (fio)

The results of ZFS vs. ext4 differ a lot. The relation between (fio) sequential reads and sequential writes seems to be better with ext4. It follows the relation you can see on the target. Don't know what's wrong with (dd oflag=sync). Random reads perform bad for both.

kaazoo commented 8 years ago

I did further testing with 128 KB blocksize for zvol and fio.

The following commands were use for fio:

# fio --rw=write --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
# fio --rw=read --bs=128k --size=100G --numjobs=1 --group_reporting --name=sequential
# fio --direct=0 --rw=randwrite --bs=128k --size=100G --numjobs=1 --group_reporting --name=random
# fio --direct=0 --rw=randread --bs=128k --size=100G --numjobs=1 --group_reporting --name=random
test blocksize log/phy of dev fio seq write fio seq read fio rand write fio rand read
zfs on target 4K 827 MB/s 1446 MB/s 815 MB/s 26 MB/s
zfs on initiator on LIO iSCSI disk on zvol 512 / 128K 308 MB/s 251 MB/s 334 MB/s 13 MB/s
zfs on initiator on LIO iSCSI disk on zfs 4K 127 MB/s 356 MB/s 381 MB/s 19 MB/s
zfs on initiator on SCST iSCSI disk on zvol 4K / 128K 299 MB/s 277 MB/s 328 MB/s 13 MB/s
zfs on initiator on SCST iSCSI disk on zfs 4K 208 MB/s 526 MB/s 327 MB/s 21 MB/s
ext4 on initiator on LIO iSCSI disk on zvol 512 / 128K 739 MB/s 381 MB/s 413 MB/s 10 MB/s
ext4 on initiator on SCST iSCSI disk on zvol 512 / 128K 874 MB/s 471 MB/s 456 MB/s 10 MB/s
ext4 on initiator on SCST iSCSI disk on zvol 4K / 128K 807 MB/s 514 MB/s 435 MB/s 10 MB/s

Then I also tested fio with 4 KB blocksize on 128 KB zvol blocksize.

The following commands were use for fio:

# fio --rw=write --bs=4k --size=100G --numjobs=1 --group_reporting --name=sequential_4k
# fio --rw=read --bs=4k --size=100G --numjobs=1 --group_reporting --name=sequential_4k
# fio --direct=0 --rw=randwrite --bs=4k --size=100G --numjobs=1 --group_reporting --name=random_4k
# fio --direct=0 --rw=randread --bs=4k --size=100G --numjobs=1 --group_reporting --name=random_4k
test blocksize log/phy of dev fio seq write fio seq read fio rand write fio rand read
zfs on target 4K 84 MB/s 1061 MB/s 305 KB/s 842 KB/s
zfs on initiator on LIO iSCSI disk on zvol 512 / 128K 50 MB/s 288 MB/s 200 KB/s 1 MB/s
zfs on initiator on SCST iSCSI disk on zvol 4K / 128K 52 MB/s 299 MB/s 176 KB/s 632 KB/s
ext4 on initiator on LIO iSCSI disk on zvol 512 / 128K 614 MB/s 364 MB/s 16 MB/s 300 KB/s
ext4 on initiator on SCST iSCSI disk on zvol 4K / 128K 732 MB/s 436 MB/s 46 MB/s 10 MB/s

Ext4 seems to perform much better than ZFS when running on iSCSI disk.

sempervictus commented 8 years ago

We're seeing this too, and what's interesting is that the read speeds seem indicate disk reads, not ARC as the raw ZFS on target results do. Could you post the contents of scst.conf and the saveconfig.json from LIO?

h1z1 commented 6 years ago

Curious this is still open from 2016? It was the first few hits on Google for zfs and slow iscsi.

sempervictus commented 6 years ago

A few recent commits seem to have smoothed out zvols a bit, but we are nowhere near the 0.6.4 #s these days. Zvols in general suffer from performance issues due to cow on binary targets, but they were much faster in the past.

h1z1 commented 6 years ago

Would love to see some realistic numbers for that reason. Right now I"m looking at a deadlocked zvol, happened as soon as I set sync=default. Why? Who knows, likely something I broke but don't even know where to start. Not only did it hang that zvol but the entire pool. There are I thought at least two timers that should have kicked the process by now but haven't. Point is I'd agree, zvols still have some rather odd bugs.