Very high cpu load writing to a zvol

akschu commented 6 years ago

System information: Distro is customized slackware64-14.2 Kernel 4.14.49 zfs/spl 0.7.9-1 (2) E5-2690 v3 CPUs HP P440ar raid controller (using ZFS for volume management/compression)

Also tried on (with same results): Distro is customized slackware64-14.2 Kernel 4.9.101 zfs/spl 0.7.9-1 (1) E3-1230 CPU LSI 2008 in IT mode with 4 SAS disks.

The issue is that I get poor write performance to a ZVOL, and the zvol kernel threads burn lots of CPU causing very high load averages on the machine. At first I was seeing the issue in libvirt/qemu while doing a virtual machine block copy, but reduced it down to this:

# dd if=/datastore/vm/dng-smokeping/dng-smokeping.raw of=/dev/zvol/datastore/vm/test bs=1M
51200+0 records in
51200+0 records out
53687091200 bytes (50.0GB) copied, 318.477527 seconds, 160.8MB/s

Speed isn't great, but the real issue is the load average goes through the roof:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     15503 44.2  0.0  17660  2956 pts/1    D+   12:12   1:22 dd if /datastore/vm/dng-smokeping/dng-smokeping.raw of /dev/zvol/datastore/vm/test bs 1M
root     15506 18.6  0.0      0     0 ?        R<   12:12   0:33 [zvol]
root     15505 18.6  0.0      0     0 ?        D<   12:12   0:33 [zvol]
root     48390 17.2  0.0      0     0 ?        D<   12:00   2:42 [zvol]
root     48296 17.2  0.0      0     0 ?        R<   11:59   2:43 [zvol]
root     48290 17.2  0.0      0     0 ?        R<   11:59   2:43 [zvol]
root     48289 17.2  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48287 17.2  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48282 17.2  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48280 17.2  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48274 17.2  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48273 17.2  0.0      0     0 ?        R<   11:59   2:43 [zvol]
root     48271 17.2  0.0      0     0 ?        R<   11:59   2:43 [zvol]
root     48298 17.1  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48297 17.1  0.0      0     0 ?        D<   11:59   2:42 [zvol]
root     48295 17.1  0.0      0     0 ?        D<   11:59   2:42 [zvol]
root     48293 17.1  0.0      0     0 ?        R<   11:59   2:42 [zvol]
root     48292 17.1  0.0      0     0 ?        R<   11:59   2:43 [zvol]
root     48291 17.1  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48288 17.1  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48286 17.1  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48284 17.1  0.0      0     0 ?        D<   11:59   2:42 [zvol]
root     48283 17.1  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48281 17.1  0.0      0     0 ?        D<   11:59   2:42 [zvol]
root     48279 17.1  0.0      0     0 ?        R<   11:59   2:43 [zvol]
root     48278 17.1  0.0      0     0 ?        D<   11:59   2:42 [zvol]
root     48277 17.1  0.0      0     0 ?        R<   11:59   2:42 [zvol]
root     48276 17.1  0.0      0     0 ?        D<   11:59   2:42 [zvol]
root     48275 17.1  0.0      0     0 ?        R<   11:59   2:42 [zvol]
root     48272 17.1  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root     48270 17.1  0.0      0     0 ?        D<   11:59   2:43 [zvol]
root       800 13.9  0.0      0     0 ?        D<   11:12   8:47 [zvol]
root     47832 12.2  0.0      0     0 ?        R<   11:53   2:43 [zvol]
root      3798  0.0  0.0  16764  1200 pts/0    S+   12:15   0:00 egrep USER|zvol
root      1432  0.0  0.0      0     0 ?        S    11:13   0:00 [z_zvol]

# uptime
 12:15:47 up  1:03,  2 users,  load average: 44.88, 25.17, 19.15

Now, if I go the opposite direction it's much faster and the load average isn't nearly as high:

# dd of=/datastore/vm/dng-smokeping/dng-smokeping.raw if=/dev/zvol/datastore/vm/test bs=1M
51200+0 records in
51200+0 records out
53687091200 bytes (50.0GB) copied, 94.473277 seconds, 542.0MB/s

There is only a single zvol, and the load average is normal:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     45782 67.0  0.0  17660  2928 pts/1    R+   12:19   1:00 dd of /datastore/vm/dng-smokeping/dng-smokeping.raw if /dev/zvol/datastore/vm/test bs 1M
root       800 14.6  0.0      0     0 ?        S<   11:12  10:02 [zvol]
root      1432  0.0  0.0      0     0 ?        S    11:13   0:00 [z_zvol]
root      1303  0.0  0.0  16764  1032 pts/0    S+   12:21   0:00 egrep USER|zvol

# uptime
 12:21:14 up  1:08,  2 users,  load average: 3.57, 16.60, 18.11

What is also interesting is that both of these things are on the same dataset:

# zfs list
NAME                         USED  AVAIL  REFER  MOUNTPOINT
datastore                    299G  1.78T  15.1G  /datastore
datastore/vm                 284G  1.78T  18.0G  /datastore/vm
datastore/vm/test           51.6G  1.83T  1007M  -

So not sure what to look at. As it is right now, I can't really write to a zvol without killing the machine, so I'm using raw disk images on mounted zfs filesystem to avoid the double COW.

Thanks!

behlendorf commented 6 years ago

@akschu try decreasing the default number of zvol threads. You can do this by setting the zvol_threads module option to the maximum number of allowed threads. You will need to unload and reload the modules for the change to take effect. If possible it would be very helpful if you could test a few values, say 1, 2, 4, 8, 16, and 32 for your workload and report your results. This is a change we've been meaning to make and we'd appreciate the additional performance data to settle on a good value for most systems. Could you also please include how many cores your system has.

akschu commented 6 years ago

This machine has (2) 12c CPU's, so 24 actual cores, and 48 listed in linux due to hyperthreading.

I can do some benchmarking to help out, but I can already tell that 1 or 2 is going to absolutely kill performance. The single zvol thread test has been running for over 1/2 hour and the zvol is only hitting 10% cpu.

What I'm not understanding is why reading/writing to/from a fs/zvol on the same dataset is so different in CPU load and performance when only the direction is different. Seems like I should see nearly identical performance.

If an image file on the file system is 5 times faster than zvol, then that leads me to believe there is some other issue, or perhaps I shouldn't use a zvol.

Anyway, I'll benchmark a little more, but some information about why the direction matters so much would be helpful for me, and probably others that see their machine come to it's knees with nothing more than a zvol write.

behlendorf commented 6 years ago

@akschu it would be helpful to run perf top during the high cpu load to determine where that cpu time is being spent.

akschu commented 6 years ago

I tried to cancel my single zvol thread dd test after an hour tried to ctrl-c and then kill -9 and it's stuck. I'd say that on my system that a single thread zvol isn't even usable.

In the mean time, here is what the system is busy doing:

Samples: 129K of event 'cycles:ppp', Event count (approx.): 13399755376                                                                                                                     
Overhead  Shared Object            Symbol                                                                                                                                                   
  14.27%  [kernel]                 [k] queued_spin_lock_slowpath
   6.69%  [kernel]                 [k] fletcher_4_avx2_native
   3.95%  [kernel]                 [k] _raw_spin_lock_irqsave
   3.90%  [kernel]                 [k] osq_lock
   2.40%  [kernel]                 [k] _raw_spin_lock
   2.26%  [kernel]                 [k] zio_create
   1.95%  [kernel]                 [k] taskq_thread
   1.66%  [kernel]                 [k] try_to_wake_up
   1.55%  [kernel]                 [k] arc_write
   1.52%  [kernel]                 [k] menu_select
   1.52%  [kernel]                 [k] zio_bookmark_compare
   1.50%  [kernel]                 [k] __schedule
   1.47%  [kernel]                 [k] __slab_free
   1.33%  [kernel]                 [k] new_slab
   1.31%  [kernel]                 [k] mutex_lock
   1.10%  [kernel]                 [k] abd_free
   1.04%  [kernel]                 [k] select_task_rq_fair
   1.03%  [kernel]                 [k] avl_add
   1.01%  [kernel]                 [k] dbuf_sync_list
   0.91%  [kernel]                 [k] zio_write_compress
   0.86%  [kernel]                 [k] mutex_spin_on_owner
   0.73%  [kernel]                 [k] cpuidle_enter_state
   0.72%  [kernel]                 [k] enqueue_entity
   0.71%  [kernel]                 [k] __wake_up_common
   0.66%  [kernel]                 [k] dbuf_write.isra.19
   0.65%  [kernel]                 [k] zio_execute
   0.64%  [kernel]                 [k] __indirect_thunk_start
   0.63%  [kernel]                 [k] do_idle
   0.57%  [kernel]                 [k] sched_clock
   0.56%  [kernel]                 [k] lapic_next_deadline
   0.55%  [kernel]                 [k] native_apic_msr_write
   0.54%  [kernel]                 [k] avl_first
   0.53%  [kernel]                 [k] _raw_spin_unlock_irqrestore
   0.53%  [kernel]                 [k] kmem_cache_alloc
   0.53%  [kernel]                 [k] update_cfs_rq_h_load
   0.51%  [kernel]                 [k] kmem_cache_free
   0.51%  [kernel]                 [k] ktime_get

I'll have more information tomorrow.

Thanks for the help and working on ZFS

dweeezil commented 6 years ago

I'd like to point out that dd-ing data to a block device is essentially filling memory with data and then relying on the kernel to flush it to the file system and, effectively subverts the various throttles built into zfs. Try adding oflag=direct the dd command.

GregorKopka commented 6 years ago

@akschu did you export/import the pool between the tests?

The reverse copy could well have been served the data in the zvol purely from ARC (given it being big enough), without any reads from the physical disks, which could explain the speedup (and expecially the lesser load) on the reverse copy.

Also: should the prerequisites for the nop_write feature (strong checksum and active compression) be met on datastore/vm it would lead to the data actually not being written at all to /datastore/vm/dng-smokeping/dng-smokeping.raw as identical data already exists (ZFS sees this and drops the rewrites as the data is already there so there is no need write anything to the physical disks), this would be another way to explain the speedup (and part of the lower load) you noticed on the copy back from the zvol to the image on the filesystem.

akschu commented 6 years ago

@dweeezil that might be the issue, because when I use that flag, load averages are way more normal and I only get a single zvol thread regardless of how many I specify when loading the zfs module.

The reason for dd was because I was just trying to reduce the test to the lowest common denominator and determine if this was a qemu issue or a zfs issue. When I saw the same exact behavior in dd I figured it's probably a zfs issue and reported here.

The command that I was running when I discovered this was:

virsh blockcopy testvm vda --blockdev /dev/zvol/datastore/vm/testvm --wait --verbose --pivot --format raw

This command simply copies a virtual machine raw image to a block device and then switches out the storage (pivot) to migrate the storage from a raw image to a zvol. When I copy from file to zvol it pretty much grinds the system to a halt due to crazy high load average. When I copy from zvol to a file, it's very fast and little load.

Not sure how to go about fixing this in the real world. The workload that qemu creates with a blockcopy writing to a zvol absolutely crushes the server. I think I might limit the zvol_threads to 8 and see if that imposes a threading throttle that prevents the copy from crushing the machine.

akschu commented 6 years ago

@GregorKopka, no I wasn't importing/exporting between tests. I can try that. Your theory makes sense why it works so much faster one way vs another. oflag=direct seems to confirm that as well as with that set, I see 40% better read vs write performance, instead of 300% better read without that flag.

So caching is certainly playing a role. I think the big issue for me is that writing a lot of data to a zvol hurts the machine enough to cause outages. That's why I reported it as a bug, because I don't think writing to a zvol should be able to completely take out the system, especially not a 24 core machine.

akschu commented 6 years ago

I am using cache=none and I did post to the mailing list without reply. Once I saw a simple DD crushing the server, I figured it would be reasonable to open a bug.

Here is the benchmarking data behlendorf asked for. Perhaps it provides some use:

Testing for 1 thread.
=========================================================
modprobe zfs zvol_threads=1
TEST READ from zvol: dd if=/dev/zvol/datastore/vm/test of=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 147.408 s, 233 MB/s
 11:49:43 up 1 day, 37 min,  3 users,  load average: 5.91, 2.44, 2.33

TEST WRITE from zvol: dd of=/dev/zvol/datastore/vm/test if=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 215.478 s, 159 MB/s
 11:54:19 up 1 day, 41 min,  3 users,  load average: 8.54, 4.99, 3.31

Testing for 2 threads.
=========================================================
modprobe zfs zvol_threads=2
TEST READ from zvol: dd if=/dev/zvol/datastore/vm/test of=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 152.271 s, 226 MB/s
 11:58:20 up 1 day, 45 min,  3 users,  load average: 4.62, 4.24, 3.31

TEST WRITE from zvol: dd of=/dev/zvol/datastore/vm/test if=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 230.676 s, 149 MB/s
 12:03:11 up 1 day, 50 min,  3 users,  load average: 7.28, 5.86, 4.15

Testing for 4 threads.
=========================================================
modprobe zfs zvol_threads=4
TEST READ from zvol: dd if=/dev/zvol/datastore/vm/test of=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 146.52 s, 235 MB/s
 12:07:06 up 1 day, 54 min,  3 users,  load average: 4.64, 4.82, 4.05

TEST WRITE from zvol: dd of=/dev/zvol/datastore/vm/test if=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 166.434 s, 206 MB/s
 12:10:52 up 1 day, 58 min,  3 users,  load average: 12.58, 8.24, 5.45

Testing for 8 threads.
=========================================================
modprobe zfs zvol_threads=8
TEST READ from zvol: dd if=/dev/zvol/datastore/vm/test of=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 145.9 s, 236 MB/s
 12:14:48 up 1 day,  1:02,  3 users,  load average: 5.64, 6.60, 5.35

TEST WRITE from zvol: dd of=/dev/zvol/datastore/vm/test if=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 136.174 s, 252 MB/s
 12:18:04 up 1 day,  1:05,  3 users,  load average: 17.63, 11.35, 7.33

Testing for 16 threads.
=========================================================
modprobe zfs zvol_threads=16
TEST READ from zvol: dd if=/dev/zvol/datastore/vm/test of=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 141.739 s, 242 MB/s
 12:21:55 up 1 day,  1:09,  3 users,  load average: 6.99, 8.42, 6.94

TEST WRITE from zvol: dd of=/dev/zvol/datastore/vm/test if=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 135.107 s, 254 MB/s
 12:25:10 up 1 day,  1:12,  3 users,  load average: 20.02, 13.43, 9.05

Testing for 24 threads.
=========================================================
modprobe zfs zvol_threads=24
TEST READ from zvol: dd if=/dev/zvol/datastore/vm/test of=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 144.755 s, 237 MB/s
 12:29:04 up 1 day,  1:16,  3 users,  load average: 5.37, 8.33, 7.87

TEST WRITE from zvol: dd of=/dev/zvol/datastore/vm/test if=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 140.809 s, 244 MB/s
 12:32:25 up 1 day,  1:19,  3 users,  load average: 32.49, 17.96, 11.48

Testing for 32 threads.
=========================================================
modprobe zfs zvol_threads=32
TEST READ from zvol: dd if=/dev/zvol/datastore/vm/test of=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 134.234 s, 256 MB/s
 12:36:08 up 1 day,  1:23,  3 users,  load average: 5.67, 10.82, 9.90

TEST WRITE from zvol: dd of=/dev/zvol/datastore/vm/test if=/datastore/vm/test bs=1M
32768+0 records in
32768+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 143.323 s, 240 MB/s
 12:39:31 up 1 day,  1:26,  3 users,  load average: 37.75, 22.46, 14.37

Seems like anything over 8 on my system just results in higher load averages.

dweeezil commented 6 years ago

@akschu After reading #7787, I ran your fio test from there to see what's happening. For reference, the command is:

fio --name=randwrite --filename=/dev/zvol/tank/v1 --ioengine=libaio --direct=1 --rw=write --size=30G --bs=4k --iodepth=8 --numjobs=8 --group_reporting --end_fsync=1

I added a --runtime=60 to speed things up a bit.

Before going into my findings, I'd like to repeat a bit of history here for anyone else following along with these and related issues: ZoL originally had the "zvol" taskqs but they were eventually removed in the restructuring of 37f9dac. Later, due to their apparent need in some workloads, were reinstated in 692e55b (issue #5824), but a tunable was added to revert to the previous behavior (see below).

It looks like an fio test like yours is an example of a workload that's hurt by the zvol taskqs. It spawns 8 processes, each of which attempt to maintain an IO depth of 8. This yields very high load averages due to excessive taskq dispatch and the use of spinlocks. Here's a flame graph showing the excessive CPU use: fg1 Almost 50% of the entire CPU time is spent spinning on locks while trying to dispatch to the 32 zvol taskqs.

Setting zvol_request_sync=1 to effectively revert to the post 37f9dac, pre 692e55b behavior yields the following flame graph: fg2 The CPU usage still seems a bit high to me but I suppose with 8 processes, each of which are trying to keep 8 4K IOs in-flight, the CPU usage is bound to be pretty high.

Now for the performance numbers (the pool is simply 32 7200RPM drives on the same SAS expander arranged as 16 2-drive mirrors). Here's the relevant fio stats for the first case:

# fio --name=randwrite --filename=/dev/zvol/tank/v1 --ioengine=libaio --direct=1 --rw=write --size=30G --bs=4k --iodepth=8 --numjobs=8 --group_reporting --end_fsync=1 --runtime=60
...
  write: io=24331MB, bw=400274KB/s, iops=100068, runt= 62244msec
...
  cpu          : usr=3.22%, sys=86.45%, ctx=82592, majf=0, minf=267
...

It wrote 400.2MB/s at 100K IOPS.

For the second case, in which the zvol taskqs were not used:

# fio --name=randwrite --filename=/dev/zvol/tank/v1 --ioengine=libaio --direct=1 --rw=write --size=30G --bs=4k --iodepth=8 --numjobs=8 --group_reporting --end_fsync=1 --runtime=60
...
  write: io=40862MB, bw=661216KB/s, iops=165303, runt= 63282msec
...
  cpu          : usr=8.11%, sys=65.32%, ctx=8459491, majf=0, minf=928
...

It wrote 661.2MB/s at 165K IOPS.

Finally, since you had mentioned how much faster it was to use the file system, I created a recordsize=8k (which is what zvols use) file system and run the test against a 30GiB sparse pre-created file (which is essentially what a zvol is):

fio --name=randwrite --filename=/tank/fs/testfile --ioengine=libaio --direct=1 --rw=write --size=30G --bs=4k --iodepth=8 --numjobs=8 --group_reporting --end_fsync=1 --runtime=60
...
  write: io=32452MB, bw=549072KB/s, iops=137267, runt= 60521msec
...
  cpu          : usr=6.53%, sys=66.56%, ctx=8217389, majf=0, minf=554
...

It wrote 549.0Mb/s at 137K IOPS. Somewhat worse than the zvol-without-taskq case and its CPU utilization was also similar.

If there's a problem here, I think it would be whether zvol_request_sync=0 should be the default setting or the fact that high-concurrency dispatch of threads to a taskq consume lots of CPU in spinlocks. I'll note, too, that I did not record the load average during these tests but did observe CPU loads with "htop" and they were quite a bit higher in the first case when the zvol taskqs were being used.

Finally, I'll note that your use of libvirt's "blockcopy" command is likely not using direct IO which means it will suffer from all the ills you'd see when performing non-direct bulk writes to any other block device but with the extra penalty of the overhead from all the taskqs.

richardelling commented 6 years ago

@dweeezil using --rw=write is a sequential workload and therefore subject to merging at the block layer. Thus the number of iops reported by fio is not the iops seen by the zvol. directio doesn't really matter for zvols, per se, it is there to satisfy the aio engine.

One approach that we use elsewhere is to size the number of threads by the number of CPUs. If you only have a few CPUs, then adding a bunch of threads won't help.

richardelling commented 6 years ago

Also, it is worth noting that flamegraphs, by default, elide idle time, so you have to compare the actual rates between two different flamegraphs.

dweeezil commented 6 years ago

I wasn't actually terribly interested in the absolute performance numbers, particularly given what this test is actually doing. I think the most interesting finding is that much of the excessive CPU time is being used during spinlock contention while dispatching to the taskq. I'm still not convinced that the zvol taskqs are beneficial for most (user-land) workloads.

richardelling commented 6 years ago

agree. I've tried to add some wisdom to the wiki https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zvol_threads

behlendorf commented 6 years ago

Agreed. We should consider promoting the module option to a dataset property so this can be controlled per volume. To reduce the lock contention we could additionally make the zvol taskq's per volume, instead of global, and decrease the default number of threads.

behlendorf commented 6 years ago

Making them per-dataset would potentially result in more total threads, but it would also entirely prevent lock contention between two volumes being actively used. Alternately, we could do something per-pool like the zio taskq's where there tasks are spread over multiple taskqs. That would let you bound the total number of threads and significantly reduce the contention. We'd want to experiment with what works best.

akschu commented 6 years ago

I think for my workload (virtual machines, each in their own dataset) having the ability to set the number taskqs per dataset would help a lot. That way I can spread the taskqs between VM's instead of having one VM stampede all of the taskqs and make all of the other VMs unusable.

The current proposed solution of zvol_request_sync=1 certainly lowers the load average, but I'm not sure how much that will help if all of the I/O is consumed by one VM. I'll be testing some stuff tonight when it's less intrusive.

richardelling commented 6 years ago

Setting zvol_request_sync=1 will ensure that each zvol can only accept 1 I/O request at a time. Others will queue at the block layer. In some ways, for multiple zvols this like round-robin scheduling, but not really since the amount of effort and resources to handle an I/O can vary widely.

aaronjwood commented 6 years ago

I'm seeing the same thing as the OP on my server. I noticed that zvols take up no space when first created and space is continuously allocated from the pool as the zvol is written to. My system load also goes through the roof (about 5x my core count) and the high CPU usage (~85%) from this situation is in kernel mode.

I need to go back and double check but I don't believe there is much of a CPU hit at all when reading/writing to the volume when no allocation is being done.

jonathanspw commented 5 years ago

I was running into similar issues as this on one server out of 4 similar ones, doing similar tasks (just backup storage).

When the zvol processes went crazy on my end they would make the system completely inaccessible, and the issue only seemed to occur when gzip compression was enabled. The gzip threads weren't what went haywire though, it was always zvol itself.

I limited the number of threads to equal the number of HT CPU cores and so far the issue hasn't shown itself again.

Screenshot_2019-07-31_17-05-44

Solid red is what it did to the CPU (system time gone haywire). The gap/odd data is when it was totally unresponsive. I rebooted it and set new thread limit and it's been fine. The time before/after the solid red chunks are when everything was running fine. When the solid red ends it's when I rebooted the server and changed the thread limit to 12. This is with an E5-2620 v3 and 8x 6TB drives.

shaneshort commented 4 years ago

I wouldn't mind some guidance here also. I'm running a pair of EPYC 7502P machines with NVMe storage, whenever I do any form of heavy writing to a zvol (be it qemu-img, or a storage migration) I see extremely high load averages and the machine gets quite unresponsive. I think my saving grace is having 64 threads on hand, otherwise I think the machine would be completely unresponsive.

I've seen comments around limiting the number of zvol threads, as well as setting zvol_request_sync, and/or disabling compression on the pool.

Can someone steer me in the right direction, perhaps a safe ZFS thread limit to start with?

shaneshort commented 4 years ago

@fixyourcodeplease222 it looks like I've got 'none' on all the zvols:

# cat /sys/block/zd*/queue/scheduler
none
none
none
none
none
none
none
none
none
none
none
none
none
none
none

shaneshort commented 4 years ago

The current proposed solution of zvol_request_sync=1 certainly lowers the load average, but I'm not sure how much that will help if all of the I/O is consumed by one VM. I'll be testing some stuff tonight when it's less intrusive.

@akschu How did this end up going for you? Did you see any improvement?

richardelling commented 4 years ago

@shaneshort start here http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

then, once you've determined if you care about the "load average" and know what is it showing for your experiment, look at: https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_sync_taskq_batch_pct https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_zil_clean_taskq_nthr_pct https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zio_taskq_batch_pct then revisit https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zvol_threads

There are many untunable parts and pieces that we know will not scale well to large cpu counts. So if you notice the thundering herds and can isolate them to specific operations (reads, writes, zil, prefetch) then it would help in developing more scalable algorithms for thread counts.

fixyourcodeplease commented 4 years ago

@shaneshort i meant the disk scheduler of the actual storage devices, not the zvol.

@behlendorf stop fucking blocking me from trying to comment and help others, you piece of shit.

shaneshort commented 4 years ago

Hi @richardelling,

Thanks for the input. I'd actually read Brendan's post before, the load average comment was simply to mention that I was seeing similar behaviour. I actually had user complaints of poor performance when doing a simple single-threaded volume from one pool to another.

Best I can tell, things hum along nicely (and the ARC starts to fill) then the copy stalls and I see the load average skyrocket, as well as other performance suffering. My best guess is that the zvol thread count is simply overwhelming the underlying I/O subsystem with writes (on both MLC and TLC based flash volumes), which adversely effects it's ability to read/write from the pool.

As this machine is in production and moving workloads off it is a bit of a pain, I'll see if I can set up a similar machine in my lab and reproduce it there. For now I plan to limit the zvol threads to 8 as well as setting zvol_request_sync to 1 to see if that helps in the interim.

Thanks again for your reply, it's certainly appreciated!

PowderedToastMan commented 4 years ago

This is actually quite easy for me to reproduce. Because of I/O contention, I decided to migrate all my VMs to raw image files, away from zvols. Example:

qemu-img convert -f raw -O raw /dev/zvol/tank/testvm testvm.img

The above zvol was 40GiB and the conversion basically shut down the entire zpool until it completed. This is on a pair of Samsung 1TB SSDs in RAID1.

richardelling commented 4 years ago

@shaneshort the number of threads is not related to the number of concurrent I/Os submitted to devices. The latter is controlled by the ZIO scheduler and, by default, capped at 10 I/Os per device per I/O class. Therefore, it is unlikely to be related to the performance problem you see. See https://github.com/zfsonlinux/zfs/wiki/ZIO-Scheduler and, of course, the number of issued I/Os is readily available in iostat or, better yet, node_exporter and telegraf.

Finally, if you set zvol_request_sync=1, then only one zvol thread will be used, so limiting the number of threads to 1 can reduce thundering herd as there won't be a lot of threads waiting to be active, but this can be very difficult to measure because the issues it causes are not directly visible to the OS. This will also mean your outstanding number of read I/Os will be directly limited to <=1 per device per volume. Whether this helps your situation or not is difficult to predict.

NB, the issue tracker is not an appropriate place for discussion. The email list is better.

shaneshort commented 4 years ago

I'd just like to leave a comment on this, through many different iterations of testing etc, I've had to abandon zvols for storage at the moment, as any attempt to do any kind of sequential writing tanks the machine and causes IO stalls in other VMs. Storing raw files inside a ZFS directory doesn't have this issue.

I've been able to replicate this issue on multiple machines now, so my conclusion is that zvol on ZoL has some kind of scheduling defect making it unusable for my application.

If someone would like to work with me on attempting to find a solution, let me know.

MichaelHierweck commented 3 years ago

In my tests the situation could be improved by opening the ZVOL with O_DIRECT. Furthermore (at least in my setup) the load spiked to zvol_threads + n, with n between 1 and 3. (Linux 5.9, ZFS 0.8.6)

plantroon commented 3 years ago

I'd just like to leave a comment on this, through many different iterations of testing etc, I've had to abandon zvols for storage at the moment, as any attempt to do any kind of sequential writing tanks the machine and causes IO stalls in other VMs. Storing raw files inside a ZFS directory doesn't have this issue.

I've been able to replicate this issue on multiple machines now, so my conclusion is that zvol on ZoL has some kind of scheduling defect making it unusable for my application.

If someone would like to work with me on attempting to find a solution, let me know.

I just encountered this on current Debian Sid when trying to use zvol for my kvm machines. On running a benchmark in the guests, the whole host freezes up. At first I thought it was a RAM issue (OOM killer was triggered), so I massively reduced the amount of hugepages, ARC and dropped caches to ensure I have enough free RAM. It didn't help, the host still freezes. Especially on writes. (tested with dd and Crystaldiskmark in a VM which lies on this storage). The disk being benchmarked is NVME (Intel 660p) if that could be a problem.

I changed the module parameter zvol_threads from 32 (default) to 2 and it runs much better (not perfectly though).

Distro is Debian Sid (up-to-date as of 2021-01-13) Kernel 5.10.0-1 zfs/spl 2.0.1-1

I was wondering, could this be because my system only has 2 cores, so running 32 just produces bad results ?

colttt commented 3 years ago

I've similar issues. The zfs-storage (debian stretch +backports) is connected via iscsi (LIO) to vmware esxi, al lot of machines are running (~120) and it works fine until I try to restore a machine (with veeam backup), the load goes up, the machine is unresponsable and then the LUNs are unaccessible from the ESXi servers (3) and then all VMs are crashing.

shaneshort commented 3 years ago

I've similar issues. The zfs-storage (debian stretch +backports) is connected via iscsi (LIO) to vmware esxi, al lot of machines are running (~120) and it works fine until I try to restore a machine (with veeam backup), the load goes up, the machine is unresponsable and then the LUNs are unaccessible from the ESXi servers (3) and then all VMs are crashing.

yeah, I've basically given up on zvols using ZoL, it seems that it's broken and there's no real interest in getting it fixed. I might suggest using omniOS or something solaris/bsd based.

colttt commented 3 years ago

I'm looking forward to bcachefs ;-)

plantroon commented 3 years ago

could the ones having problems try Proxmox PVE on the same hardware? It uses ZFS zvols for VMs and may not (?) suffer from this.

shaneshort commented 3 years ago

could the ones having problems try Proxmox PVE on the same hardware? It uses ZFS zvols for VMs and may not (?) suffer from this.

I first reported the issue using proxmox.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

shaneshort commented 2 years ago

Please don't close this issue, to the best of my knowledge it's very much still a problem with no clear fix or workaround apparent.

openzfs / zfs

Very high cpu load writing to a zvol #7631