openzfs / zfs

OpenZFS on Linux and FreeBSD

https://openzfs.github.io/openzfs-docs

Other

10.43k stars 1.73k forks source link

[Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs #11407

Closed Binarus closed 2 years ago

Binarus commented 3 years ago

System information

Describe the problem you're observing

Setup

Supermicro X10DRU-i+
LSI 9361-8i connected to a 6 / 12 Gbps SATA/SAS backplane
2 x Seagate ST4000NM000A (denoted sda and sdb), connected to the backplane
1 x Seagate ST4000NM0035 (denoted sdc), connected to the backplane
128 GB RAM (ECC, of course)

sda and sdb make up a mirrored ZFS VDEV. The O/S boots from this VDEV. There is only one pool, called rpool. rpool does not contain any other VDEVs besides that mirror. The root file system is mounted to rpool/system.

There is no swap file on that system (yet).

rpool has been created using the following command:

zpool create -o ashift=12 -o altroot=/mnt -O acltype=posixacl -O canmount=off -O checksum=on -O compression=off -O mountpoint=none -O sharesmb=off -O sharenfs=off -O xattr=sa rpool mirror /dev/disk/by-id/ata-ST4000...-part1 /dev/disk/by-id/ata-ST4000...-part1

That is, the pool and the VDEV have ashift=12.

rpool/system has been created using the following command:

zfs create -o aclinherit=passthrough -o acltype=posixacl -o atime=on -o canmount=on -o checksum=on -o compression=off -o mountpoint=/ -o overlay=off -o primarycache=all -o redundant_metadata=all -o relatime=off -o secondarycache=none -o setuid=on -o sharesmb=off -o sharenfs=off -o logbias=latency -o snapdev=hidden -o snapdir=hidden -o sync=standard -o xattr=sa -o casesensitivity=sensitive -o normalization=none -o utf8only=off rpool/system

We further have created a ZVOL using the following command:

zfs create -b 4096 -o checksum=on -o compression=off -o primarycache=metadata -o redundant_metadata=all -o secondarycache=none -o logbias=latency -o snapdev=hidden -o sync=standard -V 100G rpool/zvol-test

That zvol is mounted on /blob.

sdc contains a partition with a normal ext4 file system which is mounted on /mnt. That file system just contains several dozens of ISO files (average size about 6 GB).

On that machine, nothing else runs than the standard services the distribution installs. Notably, there is no VM running and nothing else which could produce substantial workload.

In this state, when starting watch -n 1 zpool iostat -ylv 1 1 and watching it for a while, there is indeed nearly no load on the ZFS disks. Once in several seconds or so some kilobytes hit the VDEV, which is expected.

Copying to the dataset (not the ZVOL): No problem

Now we open the iostats in one terminal window (watch -n 1 zpool iostat -ylv 1 1) and start to copy ISO files from sdc onto the ZFS dataset rpool/system in another terminal window (rsync --progress /mnt/*.iso ~/test, where ~/test is part of the root file system and thus is on rpool/system).

While the copy runs, rsync shows a few drops in bandwidth every now and then, but there are no noticeable holdups, and the drops in bandwidth are short. Likewise, the zpool iostats shows that the two disks in the VDEV are hit with data rates which could be expected. The changes in disk load reported by zpool iostat are surprisingly high, though (the load constantly jumps between something like 30 MB/s and 300 MB/s), but there are no real holdups either. In summary, the copy in average runs with over 100 MB/s and does not stall for a longer time.

We have interrupted that test after 30 GB or so because we didn't expect anything new from letting it run longer. However, we repeated it several times, each time copying other ISO files, and each time rebooting before. The behavior was the same each time.

Copying to the ZVOL: Problem

When we do exactly the same thing, but copy to the ZVOL instead of the dataset (rsync --progress /mnt/*.iso /blob), the situation changes. rsync initially shows the copy running with roughly about 190 MB/s for a few seconds, then it stalls. Thereafter it continues with copying for a few seconds at the rate denoted above, then stalls again after a few seconds, and so on.

The problem is that the holdups last for a long time where absolutely nothing happens, up to several minutes (!). However, zpool iostat shows that the two ZFS disks are under heavy load during this time, constantly (more or less) being hit with over 100 MB/s. Even when we interrupt copying by hitting Ctrl-c in the terminal window where rsync runs, this high load lasts for several minutes until everything returns to normal.

There must be extreme write amplification somewhere, the amplification factor being somewhere between 5 and 10. For example, if we copy 40 GB that way, this would normally take about 5 minutes. But actually it takes at least half an hour, although the ZFS disks are under heavy load all that time.

For that reason, ZVOLs are currently just not usable for us, which imposes a major problem. What could be going on there?

Our own thoughts and what we have tried already:

At first, we'd like to stress again that the ZVOL test did not happen within a VM. The problem is definitely not due to QEMU or (para)virtualization of data transfer.

Secondly, I am aware that it might not be the best idea to have ZFS running on disks which are attached to a RAID controller like the LSI 9361-8i, or to have it running on hybrid disks like the ones we have. However, we have configured that controller to JBOD mode, and the O/S sees the disks as individual ones as expected. But the ultimate key point regarding possible hardware problems is that copying large amounts of data to the ZFS dataset (rpool/system) works as expected. If the problems with the ZVOL would be due to hardware, we would have the same problems with the dataset; this is not the case, though.

Thirdly, the problem is not due to ZFS versions. Debian buster comes with ZoL 0.7.12, and we originally have noticed the problem there. We desperately need ZVOLs working, so we have installed OpenZFS 2.0.0 on that machine, which did change exactly nothing with respect to that problem.

As a further test, we created the ZVOL with volblocksize=512 and did the tests again. Again, nothing changed. We repeated the process with volblocksizes of 8192, 16384 and 128k. Again, no luck: Maybe it stalled a few seconds earlier or later, longer or shorter in each test compared to the others, but the general situation remained the same. Between the stalls, the copy ran a few seconds with expected speed, then it stalled for a lot of seconds, mostly even a few minutes while iostat was showing a constant data rate of roughly 100 MB/s for each disk, and so on. After interrupting the copy, both ZFS disks continued to be hit with a data rate of 100 MB/s or more for several minutes.

Then we tested the ZVOL with sync=disabled. That didn't change anything. The same goes for primarycache=all (instead of metadata) (but at least this was expected), and for logbias=latency (instead of throughput).

Next, we thought that it may have something to do with the physical sector size of the ZFS disks being 512 bytes, while the pool (and the VDEVs) had ashift=12. Therefore, we destroyed the pool, re-created it with ashift=9, re-created all file systems / datasets as described above, and did all tests again. Once again, this didn't change anything.

We then went back to the original pool with ashift=12 and used it for the further tests. At this point, we were out of ideas what to do next, so we read about the ZFS I/O scheduler and tested a large number of combinations of zfs_dirty_data_max, zfs_delay_scale, zfs_vdev_async_write_max_active, zfs_vdev_async_write_min_active, zfs_vdev_async_write_active_max_dirty_percent, and zfs_vdev_async_write_active_min_dirty_percent.

To our surprise, the last five of these did barely influence the behavior. However, the first one (zfs_dirty_data_max), which originally was set to 4 GB, changed the situation when we set it to a low value, e.g. 512 MB. The improvement was that there were less long-lasting holdups: there were even more holdups, but all of them were so short that it became acceptable. However, the average data rate did not increase, because now the transfer rate rsync reported was limited to about 30 MB/s, mostly hanging around 10 MB/s or 20 MB/s. There were no phases with high data rates any more.

So the copying was more "responsive" with low values of zfs_dirty_data_max, but that didn't help because the data rate per se was drastically limited. In summary, changing the I/O scheduler parameters which are explained in the document linked above did not lead to anywhere.

The last thing we were looking into was zfs_txg_timeout. Setting it to a lower value didn't improve the situation with copying to the ZVOL (but increased the load which hit the ZFS disks when the system was completely idle). Setting it to a higher value didn't improve copying either (but reduced the load on the ZFS disks when the system was idle).

Now we are completely out of ideas. We probably could look into other parameters of the ZFS module (/sys/module/zfs/parameters) or the disk drivers (/sys/block/sdx). But this would be just wild guessing and a waste of time. Therefore, we are hoping that somebody is willing to give us some hints.

What we did not try, and why not

zfs_arc_max is set to 4 GB on that system, and we did not test larger values for the following reasons:

That parameter is about reading, not writing, and the copy source is an ext4 partition of a physical disk, so no ZFS parameter would have any effect on the copy source.
We clearly have a problem with writing here, not with reading (remembering that copying to the normal dataset (not the ZVOL) works normally).
When we began working with ZFS some years ago, the first thing we had to solve was a system which started normally, but then became totally unresponsive and finally totally locked up within minutes. The cause of that problem was that ZFS was eating up all available RAM for its ARC cache until the machine crashed or hung. Since then, we always limit the ARC size (and never ever had any stability issues or crashes with ZFS again).
Our goal is to run a bunch of VMs with ZVOL storage (the tests described above are just, eehm, tests before we put even more effort into switching completely to ZFS). The number of VMs and the memory they will be given is precisely known. It would not make any sense to test larger ARC sizes, because the ARC size at the end of the day couldn't be much larger than 4 GB.

We did not try to use a secondary cache (L2ARC). Again, the copy source is not on ZFS, and therefore this wouldn't make any sense, and furthermore, we have a writing problem here, not a reading problem.

We did not try to use an SLOG. This would not make any sense, because one of our tests was to set sync=disabled on the copy destination ZVOL, and this did not change the slightest bit in the behavior observed. Therefore, we know that our problem is not due to sync writes, and thus, an SLOG wouldn't improve the behavior.

Describe how to reproduce the problem

Install a system similar to the one described above, issue the commands described above, and watch the long-lasting holdups in the terminal window where rsync runs and the heavy disk load zpool iostat shows in the other terminal window, leading to high disk wear and low bandwidth.

Since it is not easy to setup a system like ours, we are willing to give remote access to one of that systems if somebody would be interested in investigating the problem. In this case, please leave a comment, stating how we can get into contact.

Include any warning/errors/backtraces from the system logs

If somebody tells us what exactly is needed here, we'll immediately do it :-). We guess zpool iostat or other tools produce output which is more valuable than the log files, but neither being Linux nor ZFS experts, we are a bit lost here. Notably, we don't know how to operate dtrace or strace properly. If somebody tells us what to do, we'll try our best.

IvanVolosyuk commented 3 years ago

You might have to decide what you value more - consistency or throughput. Decreasing zfs write buffers (zfs_dirty_data_max) will give you more consistency - consistently bad write speed with lots of TXGs.

If you want throughput, you don't want to measure it using minimum write speed during the write operation or seconds without visible progress (effectively what you seem to be doing). Instead I would suggest to pick a smaller test set, but write it fully and measure the time it took for completion.

For throughput I would pick larger volblocksize - default 8k should work better, keep default zfs_dirty_data_max, enable compression - it will benefit slow disks, sync=disabled to avoid wasting disk time on zil, primarycache=all to cache ext4 metadata.

Because of the in memory write cache initial writes will look faster and then stall when in memory buffers hit the hard limit while disks are busy writing back the data to free space in ram for more dirty data. But if you measure the total copy operation time - this information is irrelevant for overall throughput number you get.

Also, there should be minimal write amplification when you copy large ISO files.

Personal notes: with qemu you can use raw files, which can give slightly better performance than zvol in some cases. I use: -drive file=/somefile.img,format=raw,id=disk,if=none,cache=none,aio=threads,discard=unmap

Binarus commented 3 years ago

@IvanVolosyuk Thank you very much for your help.

Your first comment (which you have removed) proposed to use logbias=throughput instead of logbias=latency. We have to apologize that we forgot to mention that we already had tested this setting, too, but it didn't change anything. I'll add this in the description of the issue, i.e. to our first post.

With respect to your other proposals, I guess I'll have to explain some specialties regarding our setup and goals:

You might have to decide what you value more - consistency or throughput.

We understand that there always is a trade-off. But what we are talking about here is a decision between between holdups of several minutes and extreme disk wear caused by extreme write amplification on one side and a data rate of 5% to 10% percent of what the hardware is able to provide on the other side. Trade-offs of such magnitude are in no way acceptable.

If you want throughput, you don't want to measure it using minimum write speed during the write operation or seconds without visible progress (effectively what you seem to be doing).

The problem here is that the holdups are taking so long and are putting the disks under such heavy load that the system cannot be used reasonably during that time. In my first post, I have explained that we did the tests without any VMs running. This is true, because we wanted to rule out any issues related to QEMU, KVM and block device drivers.

However, of course, we also did additional tests with running VMs. It turned out that the bad situation, notably the holdups, could be provoked by just copying large files from the third disk to the VM ZVOL storage in a VM, and that these holdup made the other VMs freak out - they just couldn't write data to their (virtual) storage as needed because they hit timeouts after one or two minutes.

This basically means that you can have only one VM on your server (provided you want to be it on a ZVOL), unless you are willing to put your data at risk.

Instead I would suggest to pick a smaller test set, but write it fully and measure the time it took for completion.

The outcome would be interesting. However, it is not our use case; more precisely, we have important other use cases. One of them (and an important one) is to have multiple VMs running on ZVOLs, where at least one VM will be used to copy large amounts of data scattered across a few files from a third disk to virtual ZVOL storage. It is completely not acceptable to have the other VMs freak out then, putting their data at risk.

Therefore we first have to solve the problems described in the first post before we proceed.

I have mentioned VMs solely to explain our issue in greater depth, and why it definitely is a show stopper in our case. Still, I would like to keep VM related problems out of this discussion, because the issue exists without any VM running, and adding VMs surely won't improve the situation.

For throughput I would pick larger volblocksize

This is one of tests we did and which we have described. We have tested 512 bytes, 4096 bytes, 8192 bytes, 16384 bytes and 128k bytes. None of theses settings made the behavior change in any way.

keep default zfs_dirty_data_max

Changing it was only for testing. Of course, we started testing with default values.

enable compression

We have compression disabled because (at a later stage) the ZVOL data will be encrypted (from within the VMs). Given that, compression won't do any good.

sync=disabled

We already have tested this setting without noticing any change in behavior (described in the first post I guess).

primarycache=all to cache ext4 metadata

Could you please elaborate? How could a ZFS setting influence reading from the third disk, which is not on ZFS? Did we miss something?

Apart from that, we already had tested primarycache=all, but it didn't change anything either. IMHO, this is expected because primarycache relates to reading data, while we obviously have a problem with writing.

But if you measure the total copy operation time - this information is irrelevant for overall throughput number you get.

Agreed, but the holdups would make VMs freak out (and hence, put data at risk) if we had several VMs running, which will be the case later.

Plus, the extreme write amplification we obviously experience will destroy our disks in no time.

Also, there should be minimal write amplification when you copy large ISO files.

This is exactly what we were convinced of when we began the tests. Imagine out surprise ...

We are seriously thinking of making a video which shows the two terminal windows for 10 minutes or so. Perhaps somebody could make sense of it. We'll first have to look for appropriate screen recording software, though (must be able to record cygwin terminal windows under Windows 10 at a reasonable frame rate (e.g. 10 frames / sec)).

Cheers,

Binarus

IvanVolosyuk commented 3 years ago

With primarycache=all for zvol - the filesystem on zvol (ext4 I assumed) will have its metadata cached in ARC.

I reproduced similar behavior with the copy of your settings and with the set of changed I suggested I've got some improvements when copying large files to zvol/ext4.

I think what you missed in your tunning is that filesystem on zvol will have a lot of dirty data accumulated as you have a lot of RAM. It will try to write it back when Linux kernel decides that it has too much dirty pages. You can tune it down to see if it will help with the write consistency, e.g.: echo $[128 1024 1024] >/proc/sys/vm/dirty_bytes This will force the aggressive writeback in the filesystem in zvol. Try this with the other suggestions I gave before. It made a big difference for writes consistency in my setup.

devZer0 commented 3 years ago

this not not a really new to me, i have seen quite some reports on accessing zvols performs much worse compared to accessing ordinary files on zfs datasets - and it confirms my own negative experiences with zvols, which also includes lockups/stalls etc.

this is the reason why i completely avoid using zvols on proxmox for quite a while (which are still default there)

see https://bugzilla.proxmox.com/show_bug.cgi?id=1453 or https://github.com/openzfs/zfs/issues/10095 for example

dswartz commented 3 years ago

I'm sadly familiar with awful zvol performance.

On December 28, 2020, at 7:14 PM, devZer0 notifications@github.com wrote:

this is the reason why i completely avoid using zvols on proxmox for quite a while (which are still default there)

see https://bugzilla.proxmox.com/show_bug.cgi?id=1453 or #10095 for example

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/openzfs/zfs/issues/11407#issuecomment-751902420", "url": "https://github.com/openzfs/zfs/issues/11407#issuecomment-751902420", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

sempervictus commented 3 years ago

ZVOLs are pretty pathologically broken by design. A block device is a range of raw bits, full stop. A zvol is an abstraction presenting a range but mapping them across noncontiguous space with deterministic logic involved in linearizing it. So architecturally, it inherently will be slower to resolve a request. The fact that request paths elongate with data written prior to the zvol amplifies and exacerbates the poor design. Atop that, zvols are almost never considered when changes are introduced, with performance regression after regression going into zfs for years and maintainers never having time or interest in the feature. You can find prior issues on this where we've discussed how the entire pipeline isn't even optimized for ssd, much less nvme and the expected-by-design iowait hampers not only the dmu but multiplicatively the zvols atop it. I've got benchmarks from a few years back showing drastic reduction in throughput after a zvol is filled once (write a g to a g-sized disk zvol and then write to it again, will make you sad). IMO, zvols need a rethink from the ground up to actually be as thin an abstraction atop the bit ranges handled by the DMU as possible with consistent throughput as a primary design goal and os-native block device interfaces (SG?) to avoid problems like your have today if you map SCST atop a zvol and beat it up with 100 random writers. Problem is, that's a lot of work for a shrinking number of consumers because businesses have moved away from using iscsi zvols for one (iscsi in general but zvols are now known as garbage for business use), and because other tech like ceph actually keeps up with hardware development and optimizes for modern storage busses and media (they dropped zfs as a backing store way back).
Unless the powers that be put their weight behind making zvols a primary member of the ecosystem again, we'll keep seeing issues like this every year. Thanks for filing this one, I'm frankly tired of begging for this to be resolved. Since @ryao left these discussions there's been no real improvement, and in the end the removal of the sg layer may have actually made performance worse (~0.6.4). I tried to force-sync the virtual devices way back to make writes more consistent from db workloads, but really they just need to behave like proper disks with the full range of scheduling parameters and commit/read semantic adjustments from the consumer side and a much thinner/faster underlying implementation. Anyone got a really good storage dev with time on their hands and a hefty budget to fund the work? Semper Victus would be open to joining others from the community to fund a bounty project to un... this mess, if maintainers agree that zvol performance will be a primary consideration when adopting new features or merging commits so the money and effort aren't wasted when draid hits a tag or whatever. If there are any takers, we'd even consider trading an engagement for a completed PR + $0.01 to bind a contract (anyone who contracts red or blue teams knows the cost associated) - feel free to reach out if this sounds appealing and we'll work out a scope.

sempervictus commented 3 years ago

@Binarus: for use under Qemu/libvirt, we've found that the most consistent throughput is achieved by directly mapping the ZVOL to the VM as a virtio-scsi device a la

      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap' detect_zeroes='off'/>
      <source dev='/dev/zvol/<pool_name>/<path_to_volume>' index='1'/>

This removes a lot of the intermediate buffers/copies, but still works without the ZVOL having proper SG interfaces. It allows the ZFS pipeline to deal with zeroes/compression natively, with AIO (io=native) between the visor and block device. Still a hack, not a solution, and still experiences the write amp and other issues, but at least thins out the visor interaction so you're not seeing amplification of amplifications. You may also want to revert 5731140eaf4aaf which i had intended to help with the performance degradation resulting from interactions with the linux-tier write merge and subsequent rewrites of the merged blocks in the ZVOL. If you have a lot of async writes, it should raise the peak performance, but beware of degradation over time and potentially deeper valleys.

prgwiz commented 3 years ago

The following settings helped us: options zfs zfs_arc_min=8589934592 options zfs zfs_arc_max=106300440567 options zfs zfs_prefetch_disable=1 options zfs zfs_nocacheflush=1 options zfs zfs_arc_meta_limit_percent=95

On Sun, Dec 27, 2020 at 4:08 AM Binarus notifications@github.com wrote:

System information

Type | Version/Name Linux | Debian Distribution Name | debian Distribution Version | buster (10.7) Linux Kernel | 4.19.0-12-amd64 Architecture | amd64 ZFS Version | OpenZFS 2.0.0 SPL Version | (SPL integrated in OpenZFS 2.0.0) Describe the problem you're observing Setup

Supermicro X10DRU-i+

LSI 9361-8i connected to a 6 / 12 Gbps SATA/SAS backplane

2 x Seagate ST4000NM000A (denoted sda and sdb), connected to the backplane

1 x Seagate ST4000NM0035 (denoted sdc), connected to the backplane

128 GB RAM (ECC, of course)

sda and sdb make up a mirrored ZFS VDEV. The O/S boots from this VDEV. There is only one pool, called rpool. rpool does not contain any other VDEVs besides that mirror. The root file system is mounted to rpool/system .

There is no swap file on that system (yet).

rpool has been created using the following command:

zpool create -o ashift=12 -o altroot=/mnt -O acltype=posixacl -O canmount=off -O checksum=on -O compression=off -O mountpoint=none -O sharesmb=off -O sharenfs=off -O xattr=sa rpool mirror /dev/disk/by-id/ata-ST4000...-part1 /dev/disk/by-id/ata-ST4000...-part1

That is, the pool and the VDEV have ashift=12.

rpool/system has been created using the following command:

zfs create -o aclinherit=passthrough -o acltype=posixacl -o atime=on -o canmount=on -o checksum=on -o compression=off -o mountpoint=/ -o overlay=off -o primarycache=all -o redundant_metadata=all -o relatime=off -o secondarycache=none -o setuid=on -o sharesmb=off -o sharenfs=off -o logbias=latency -o snapdev=hidden -o snapdir=hidden -o sync=standard -o xattr=sa -o casesensitivity=sensitive -o normalization=none -o utf8only=off rpool/system

We further have created a ZVOL using the following command:

zfs create -b 4096 -o checksum=on -o compression=off -o primarycache=metadata -o redundant_metadata=all -o secondarycache=none -o logbias=latency -o snapdev=hidden -o sync=standard -V 100G rpool/zvol-test

That zvol is mounted on /blob.

sdc contains a partition with a normal ext4 file system which is mounted on /mnt. That file system just contains several dozens of ISO files (average size about 6 GB).

On that machine, nothing else runs than the standard services the distribution installs. Notably, there is no VM running and nothing else which could produce substantial workload.

In this state, when starting watch -n 1 zpool iostat -ylv 1 1 and watching it for a while, there is indeed nearly no load on the ZFS disks. Once in several seconds or so some kilobytes hit the VDEV, which is expected. Copying to the dataset (not the ZVOL): No problem

Now we open the iostats in one terminal window (watch -n 1 zpool iostat -ylv 1 1) and start to copy ISO files from sdc onto the ZFS dataset rpool/system in another terminal window (rsync --progress /mnt/*.iso ~/test, where ~/test is part of the root file system and thus is on rpool/system).

While the copy runs, rsync shows a few drops in bandwidth every now and then, but there is no real stalling, and the drops are short. Likewise, the iostats show that the two disks in the VDEV are hit with data rates which could be expected. The changes in disk load reported by zpool iostat are surprisingly high, though (constantly jumping between something like 30 MB/s and 300 MB/s), but there are no real stalls either. In summary, the copy in average runs with over 100 MB/s and does not stall.

We have interrupted that test after 30 GB or so because we didn't expect anything new from letting it run longer. However, we repeated it several times, each time copying another part of the ISO files, and each time rebooting before. The behavior was the same each time. Copying to the ZVOL: Problem

When we do exactly the same thing, but copy to the ZVOL instead of the dataset (rsync --progress /mnt/*.iso /blob), the situation changes. rsync shows the copy running with roughly about 190 MB/s, then it stalls. Thereafter it continues with copying for a few seconds at the rate denoted above, and stalls again after a few seconds, and so on.

The problem is that stalling lasts for a long time where absolutely nothing happens, up to several minutes (!). However, the zpool iostat shows that the two ZFS disks are under heavy load during these holdups, constantly (more or less) being hit with over 100 MB/s. Even when we interrupt copying by hitting Ctrl-c in the terminal window where rsync runs, this high load lasts for several minutes until everything returns to normal.

There must be extreme write amplification somewhere. I estimate that roughly 10 times more data than the payload actually is hits the disks. For example, if I copy 40 GB that way, this would normally take about 5 minutes; it actually takes about an hour, though, while the disks are under heavy load.

For that reason, ZVOLs are currently just not usable for us, which imposes a major problem. What could be going on there? My own thoughts and what we have tried already:

At first, we'd like to stress again that the ZVOL test did not happen within a VM. The problem is definitely not due to QEMU or (para)virtualization of data transfer.

Secondly, I am aware that it might not be the best idea to have ZFS running on disks which are attached to a RAID controller like the LSI 9361-8i, or to have it running on hybrid disks like the ones we have. However, we have configured that controller to JBOD mode, and the O/S sees the disks as individual ones as expected. But the ultimate key point regarding possible hardware problems is that copying large amounts of data to the ZFS dataset (rpool/system) works as expected. If the problems with the ZVOL would be due to hardware, we would have the same problems with the dataset; this is not the case, though.

Thirdly, the problem is not due to ZFS versions. Debian buster comes with ZoL 0.7.12, and we originally have noticed the problem there. We desperately need ZVOLs working, so we have installed OpenZFS 2.0.0 on that machine, which did change exactly nothing with respect to that problem.

As a further test, we created the ZVOL with volblocksize=512 and did the tests again. Again, nothing changed. We repeated the process with volblocksizes of 8192, 16384 and 128k. Again, no luck: Maybe it stalled a few seconds earlier or later, longer or shorter in each test compared to the others, but the general situation remained the same. Between the stalls, the copy ran a few seconds with expected speed, then it stalled for a lot of seconds, mostly even a few minutes while iostat was showing a constant data rate of roughly 100 MB/s for each disk, and so on. After interrupting the copy, both ZFS disks continued to be hit with a data rate of 100 MB/s or more for several minutes.

Then we tested the ZVOL with sync=disabled. That didn't change anything. The same goes for primarycache=all (instead of metadata) (but at least this was expected).

Next, we thought that it may have something to do with the physical sector size of the ZFS disks being 512 bytes, while the pool (and the VDEVs) had ashift=12. Therefore, we destroyed the pool, re-created it with ashift=9, re-created all file systems / datasets as described above, and did all tests again. Once again, this didn't change anything.

We then went back to the original pool with ashift=12 and used it for the further tests. At this point, we were out of ideas what to do next, so we read about the ZFS I/O scheduler https://gist.github.com/szaydel/6244302 and tested a large number of combinations of zfs_dirty_data_max, zfs_delay_scale, zfs_vdev_async_write_max_active, zfs_vdev_async_write_min_active, zfs_vdev_async_write_active_max_dirty_percent, and zfs_vdev_async_write_active_min_dirty_percent.

To our surprise, the last five of these did barely influence the behavior. However, the first one (zfs_dirty_data_max), which originally was set to 4 GB, changed the situation when we set it to a low value, e.g. 512 MB. The improvement was that there were less long-lasting holdups: there were even more holdups, but all of them were so short that it became acceptable. However, the average data rate did not increase, because now the transfer rate rsync reported was limited to about 30 MB/s, mostly hanging around 10 MB/s or 20 MB/s. There were no phases with high data rates any more.

So the copying was more "responsive" with low values of zfs_dirty_data_max, but that didn't help because the data rate per se was drastically limited. In summary, changing the I/O scheduler parameters which are explained in the document linked above did not lead to anywhere.

The last thing we were looking into was zfs_txg_timeout. Setting it to a lower value didn't improve the situation with copying to the ZVOL (but increased the load which hit the ZFS disks when the system was completely idle). Setting it to a higher value didn't improve copying either (but reduced the load on the ZFS disks when the system was idle).

Now we are completely out of ideas. We probably could look into other parameters of the ZFS module (/sys/module/zfs/parameters) or the disk drivers (/sys/block/sdx). But this would be just wild guessing and a waste of time. Therefore, we are hoping that somebody is willing to give us some hints. What we did not try, and why not

zfs_arc_max is set to 4 GB on that system, and we did not test larger values for the following reasons:

That parameter is about reading, not writing, and the copy source is an ext4 partition of a physical disk, so no ZFS parameter would have any effect on the copy source.

We clearly have a problem with writing here, not with reading (remember that copying to the normal dataset (not the ZVOL) works normally).

When I began working with ZFS some years ago, the first thing I had to solve was a system which started normally, but then became totally unresponsive and finally totally locked up within minutes. The cause of that problem was that ZFS was eating up all available RAM for its ARC cache until the machine crashed or hung. Since then, I always limit the ARC size (and never ever had any stability issues or crashes with ZFS again).

Our goal is to run a bunch of VMs with ZVOL storage (the tests described above are just, eehm, tests before we put even more effort into switching completely to ZFS). The number of VMs and the memory they will be given is precisely known. It would not make any sense to test larger ARC sizes, because the ARC size at the end of the day couldn't be much larger than 4 GB.

We did not try to use a secondary cache (L2ARC). Again, the copy source is not on ZFS, and therefore this wouldn't make any sense, and furthermore, we have a writing problem here, not a reading problem.

We did not try to use an SLOG. This would not make any sense, because one of our tests was to set sync=disabled on the copy destination ZVOL, and this did not change the slightest bit in the behavior observed. Therefore, we know that our problem is not due to sync writes, and thus, an SLOG wouldn't improve the behavior. Describe how to reproduce the problem

Install a system similar to the one described above, issue the commands described above, and watch the long-lasting stalls in the terminal windows where rsync runs and the weird behavior zpool iostat reveals in the other terminal window.

Since it is not easy to setup a system like ours, we are willing to give remote access to one of that systems if somebody would be interested in investigating the problem. In this case, please leave a comment, stating how we can get into contact. Include any warning/errors/backtraces from the system logs

If somebody tells us what exactly is needed here, we'll immediately do it :-). We guess zpool iostat or other tools produce output which is more valuable than the log files, but neither being Linux nor ZFS experts, we are a bit lost here. Notably, we don't know how to operate dtrace or strace properly. If somebody tells us what to do, we'll try our best.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/11407, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTQJT6ODFQNY33S4ANCZRTSW32PRANCNFSM4VKUQMWQ .

sempervictus commented 3 years ago

By the way, having a SLOG is rather important for both latency and reducing fragmentation over time on sync writes. I wonder if the new special allocation classes could be used in some way to provide special short codepath areas for ZVOLs...?

sempervictus commented 3 years ago

@behlendorf @ahrens - with the new special allocation classes targeting specific workloads to VDEVs, could the paradigm be inverted similarly to how small blocks work in order to provide an allocation "arena" on all vdevs to be used for "intended to be constant latency" volume operations? Basically a storage slab with performance-oriented semantics managed by a subset of the DMU dedicated to ZVOLs. It might double as a safer (thinner?) proving ground for strategies and code to be pulled into the full ZPL as well.

Binarus commented 3 years ago

@IvanVolosyuk Thank you very much! Your comment was very helpful.

With primarycache=all for zvol - the filesystem on zvol (ext4 I assumed) will have its metadata cached in ARC.

I see. Thank you. You are right, the ZVOL is formatted with ext4 as well. However, we already had tried this, but to no avail. (I guess I had described it in my first post, which is why I initially didn't get what you meant).

I think what you missed in your tunning is that filesystem on zvol will have a lot of dirty data accumulated as you have a lot of RAM. It will try to write it back when Linux kernel decides that it has too much dirty pages. You can tune it down to see if it will help with the write consistency, e.g.: echo $[128 1024 1024] >/proc/sys/vm/dirty_bytes

Now that was a real game changer. Thank you very much for that tip. We will do further research with this and similar settings: It didn't solve the problem yet, because it decreased throughput drastically. But at least, it made the system responsive again when copying large files to the ZVOL. We achieved further improvement by also setting dirty_background_bytes, dirty_writeback_centisecs and a few more to appropriate values.

I hope that we will be able to work out between throughput and responsiveness, using various parameters.

We didn't come to the idea to change these parameters because everybody on the net says that ZFS circumvents the page cache. But I guess that this is true only for reading.

So let's see whether I get this right now:

When an application writes to a file system on a ZVOL (without O_SYNC and O_DIRECT), the data first goes into the normal page cache of Linux, and from there into the ZFS write cache (whose size is set by zfs_dirty_data_max and whose behavior is governed by the ZFS IO scheduler)?

If this is true, it explains the bad writing performance with this constellation, and tuning the six important ZFS IO scheduler parameters doesn't make much sense, because bad behavior will arise from the OS page cache, not the ZFS write cache. Did I get this right?

Binarus commented 3 years ago

@devZer0 @dswartz Thank you very much for your comments.

I already had seen the reports you mentioned. But I had the feeling that OpenZFS 2.0.0 would be a sort of restart or major improvement. I reported the issue again to let the developers know that the new version at least on our systems did not improve ZVOL behavior. In that sense, our report could be considered a confirmation of the issue for OpenZFS 2.0.0.

Furthermore, in one of my next comments, I'll describe another test which we did in the meantime and which is described rarely in other posts.

Binarus commented 3 years ago

@sempervictus Thank you very much for the many valuable comments! A few remarks:

ZVOLs are pretty pathologically broken by design. A block device is a range of raw bits, full stop. A zvol is an abstraction presenting a range but mapping them across noncontiguous space with deterministic logic involved in linearizing it. So architecturally, it inherently will be slower to resolve a request. The fact that request paths elongate with data written prior to the zvol amplifies and exacerbates the poor design.

You are completely right. However, if ZVOLs were usable in any way, we wouldn't care about the flaws in their architecture, and we would have no problem with sacrificing some throughput and responsiveness to use them, because this is technically unavoidable.

But what completely puzzles us is the magnitude of performance decrease and the fact that writing one large file to one ZVOL stalls the host as well as other ZVOLs and VMs so much that they crash. I'll never get why this happens when writing to ZVOLs, but not when writing to normal datasets. I would have expected that a ZVOL (from ZFS's internal point of view) is just a huge file on the dataset which is presented to the O/S in a special way. Admitted, there must be some overhead with translating block sizes, dealing with fragmentation etc., but that alone should decrease throughput and responsiveness by let's say 20%, but not make the system unusable.

for use under Qemu/libvirt [...]

Thank you very much again - I'll follow that advice.

But currently, we want to keep out VM-related issues. Therefore, we are testing with no VMs running. When there is such misbehavior even without a real workload, we really don't need to even think about VMs. I have mentioned VMs only to make clear that this issue is not academic, but can crash VMs (and the host as well, according to other reports), and put their data at risk.

By the way, having a SLOG is rather important for both latency and reducing fragmentation over time on sync writes.

Unless I have missed something, an SLOG wouldn't help us. One of our tests was to set sync=disabled on the ZVOL, but this didn't change anything. Our current tests are very basic; I guess copying large files via rsync produces a very small fraction of sync requests, so we didn't wonder about the outcome of this test.

with the new special allocation classes targeting specific workloads to VDEVs [...]

That sounds interesting. Could you please give us a starting point? Actually, we don't have deep knowledge of ZFS and never heard about that, but we are curious to learn about everything which could help us out.

sempervictus commented 3 years ago

@Binarus: the performance degradations, or any level of performance quotient, have never really mattered to the project when they commit - AFAIK there's no performance testing for ZVOLs in the test suite, and definitely not covering rewrites and long-lived volumes across ZFS revisions. They've been unstable/unusable for production for a long time and appear to have suffered even more in the ZoL -> OpenZFS cycle. My understanding is that the commercial interests behind OpenZFS mostly make their money on database workload optimization, and that they work out of the non-GPL OS' which have their own distinct block-layer semantics (so maybe they dont suck as much on ZVOLs). If you want to get really upset, see what happens when you make a file in the ZPL, map it to a block device, and observe how much faster a loopback blockdev atop a ZPL file is than a ZVOL. Far as SLOG, and the comment for using their allocation classes: ZFS writes transactional blocks every 5s. When sync IOs are issued by consumers in between those 5s, they have to go to disk. So the SLOG absorbs those sync writes and then shunts them out to the storage VDEVs when the 5s flush. With no SLOG, those writes have to be immediately committed to the storage VDEVs resulting in jitter/stalls.

IvanVolosyuk commented 3 years ago

When an application writes to a file system on a ZVOL (without O_SYNC and O_DIRECT), the data first goes into the normal page cache of Linux, and from there into the ZFS write cache (whose size is set by zfs_dirty_data_max and whose behavior is governed by the ZFS IO scheduler)?

If this is true, it explains the bad writing performance with this constellation, and tuning the six important ZFS IO scheduler parameters doesn't make much sense, because bad behavior will arise from the OS page cache, not the ZFS write cache. Did I get this right?

I would imagine ext4 on top of zvol should use page cache even if ZFS doesn't use it. Dirty data in ZFS will grow and shrink thus changing the amount of ram available for dirty pages in page cache. I this this can be the cause of write speed oscillations we can see.

Limiting dirty data in page cache (/proc/sys/vm/dirty_bytes) should help avoiding this oscillations and uncontrolled growth of dirty pages.

sempervictus commented 3 years ago

Adding a filesystem atop the ZVOL makes this entire mess a lot worse - the ZVOL with that nomerge flag should be bypassing the Linux caches just fine, but the FS atop it will make "decisions" about the block-level interactions and impact how the ZVOL actually absorbs written data and whether it is involved at all in reading data back if the FS atop it has cached the value of the IO request. If you write 2 512b files into the EXT4 layer at the same time/in rapid sequence they will likely be merged into a 1024b iovec to the underlying ZVOL... Considering that many modern FS already have compression and encryption, their use atop ZFS with the same options enabled will therefore double the amount of time spent encrypting, and increase the compression time until the lower-level compressor "gives up" (as LZ4 will with already compressed blocks IIRC), same concept applies to all of the caching and related scheduling. My suggestion is to either keep the Linux FS' out of this, or separate their testing to a different kernel (like map the block device to a VM or an iSCSI target) so you're not (possibly) looking at Schroedinger's cat dancing on the keyboard. EDIT: by the way, mapping ZVOLs to SCST's or LIOs loopback interfaces to give them the appearance of an SG device "works" up to a point where the overlaid block device appears to consume more IOs than it can push down to the underlying ZVOL and then "bad things" happen.

sempervictus commented 3 years ago

In regards to the SLOG metadata class to be reserved on normal VDEVs, could the same sort of allocations be made for the size of the refreservation value? This would allow the offload of snapshot data to normal allocation classes when a snap is created by treating the snapshot action as a type of commit, and keep the rewrite overhead down. You'd have some issues finding free space as you get closer to fill levels on the allocated special spaces, but if the "working set" of slabs is all in that class, you're not having to check blocks for snapped status or prior data as that would all have been offloaded/copied to the "permanent/normal" storage class once a snapshot is made. I think it would also make for a good case to change how the rather weird refereservation prop works - only apply it to the live set, dont reserve "extra space" in snapshots (always thought that to be kida stupid to be honest, reserving space in snapshots is like padding a squashfs file out to the original disk size), and let it be used to "mean" how much "fast IO" space is reserved. ping @ahrens re ^^ - is that architecturally feasible? Does it violate tenets of operation i'm not considering? Are there some massive engineering challenges involved in this or just a ton of reasonably straightforward work atop the SLOG allocation class PR?

ahrens commented 3 years ago

I think you're saying that you have created a zvol, put some sort of filesystem on top of it (which? ext4?), and then you copy some files to it using rsync. You see that much more data is written to the ZFS storage (according to iostat or zpool iostat) than is "logically" copied (according to rsync). And it takes much longer than if you copy the files to a zpl filesystem. And you see periodic pauses in writes.

Do I understand the situation properly? If so, it sounds like something has gone wrong between the filesystem (ext4?) on top of the zvol and the zvol layer. That said, I'm surprised that volblocksize=512 plus sync=disabled doesn't fix all of those problems. I guess if ext4 is writing to random offsets, the cost of zfs updating the indirect blocks of the zvol could be substantial. In general I think we would need to understand the pattern of writes to the zvol. You could do a test with dd directly to the zvol (dd of=/dev/zvol/..., no ext4 invloved) to see that zvol performance is in general reasonable.

sempervictus commented 3 years ago

@ahrens - i've filed many issues here before for ZVOLs with no FS on them showing the same issue using pure dd Several years ago, the Linux ZVOL implementation was thinned out with the "upper" half removed to reduce complexity, OS specific implementation, and hand the IO scheduling down to the ZFS pipeline. However, the internal semantics of ZVOLs do not lend themselves well to constant pefromance quotient requirements (they slow down, a lot, as they're filled, snapped, etc), and they dont appear to be tested for performance regressions.

ahrens commented 3 years ago

@sempervictus I haven't noticed the problems you mentioned. We are using zvols with the iscsi target with good performance (after #10163).

It doesn't make sense to me that zvol performance would change much if they've been written to or not -- maybe needing to read the indirect block, or read-modify-write if there are partial-block writes? Presence of snapshots should have no impact on performance (same with filesystems). What specifically about the design of zvols needs to be changed to improve performance? Could you point me to the existing issues that describe the problems you're alluding to?

Binarus commented 3 years ago

@sempervictus Thank you very much again for your explanations and ideas.

Far as SLOG, and the comment for using their allocation classes: ZFS writes transactional blocks every 5s. When sync IOs are issued by consumers in between those 5s, they have to go to disk. So the SLOG absorbs those sync writes and then shunts them out to the storage VDEVs when the 5s flush. With no SLOG, those writes have to be immediately committed to the storage VDEVs resulting in jitter/stalls.

Yes, we have understood that. Therefore, we have set sync=disabled on the ZVOL in question. This should rule out any problems, holdups or jitter due to sync writes, shouldn't it? However, it did not change anything.

Adding a filesystem atop the ZVOL makes this entire mess a lot worse [...]

We have come to the same idea. Please see a few paragraphs below regarding new tests we did in the meantime.

@IvanVolosyuk Thanks again.

I would imagine ext4 on top of zvol should use page cache even if ZFS doesn't use it. [...]

We have come to the same conclusion. Please see a few paragraphs below regarding new tests we did in the meantime.

@ahrens Thank you very much for participating here!

Do I understand the situation properly?

Yes, exactly. However, in the meantime, we also thought that putting ext4 on top of a ZVOL is a bad idea for testing, and conducted other tests; please refer to the paragraphs below.

Regarding blockvolsize=512: It didn't improve the situation. But even if it did, it would be hard to use, because it blows up a ZVOL by a factor of 1.5. That is, a ZVOL with 1 TB eats up 1.5 TB of disk space. That overhead gets much better with larger volblocksizes.

In the meantime, we did further tests:

The reason that we put ext4 on the ZVOL in question was that we wanted to see whether we could use ZVOLs as a replacement for physical block devices. Therefore, we partitioned it, put an ext4 file system on one of the partitions and conducted the rsync tests. For those tests, we chose large files because it is clear that we can't expect much from spinning disks if we have a lot of small files.

But actually, we are interested in using ZVOLs as VM storage. Once again, we currently don't test with VMs running, or from within VMs, because this would add in further variables (QEMU, virtual storage layers and the like). However, several days ago, we have seen that we had to employ another test method to simulate the throughput and the latency a ZVOL-backed VM could expect.

Therefore, we now test using dd with oflag=direct and using the ZVOL directly as destination:

dd if=/dev/zero of=/dev/zvol/rpool/zvol-test bs=4K count=2000000 status=progress oflag=direct

and let watch -n 1 zpool iostat -ylv 1 1 run in another terminal window.

This indeed circumvents the page cache (setting /proc/sys/vm/dirty_bytes and its friends to different values does not effect anything) and has improved the situation.

We still experience the disks running under high load for a while after the command above has finished or has been interrupted. We also noticed that the load, while it is high, is still way below what the disks could deliver. Therefore, we are quite sure now that we can further improve the behavior until we are happy with it, by tuning the parameters of the ZFS I/O scheduler.

However, one problem remains: When we start a few VMs (for testing) which are backed by ZVOLs in the same pool, and then run the command shown above, chances are that the VMs freak out. It seems that the load each ZVOLs is allowed to put on the disks is not always balanced (which by the way leads to a further question which I'll post separately in the discussion forum (how do we tune the I/O scheduler if we have VDEVs with fundamentally different characteristics?)). Obviously, when putting one ZVOL under load, it eventually may in turn the disks put under such load that other ZVOLs are starving.

We are currently evaluating whether we could use cgroups to solve this problem.

To summarize, we have now nearly reached a point where we are happy with the behavior when large files are copied to one ZVOL, circumventing the O/S page cache. There are still holdups which are not nice, but they aren't a show stopper because they are short (usual behavior: throughput first drops to something like 10 MB/s during a few seconds, then copying stalls for a few seconds (but not for minutes), then it continues normally).

Final personal remark regarding ZVOLs vs. normal files as VM storage:

In general, we are not against using regular files instead of block devices as VM storage. In fact, we did that before we switched to ZFS. However, normal ZFS datasets AFAIK still do not honor O_DIRECT, while ZVOLs do. In my personal opinion, it would be a very bad idea to chain VM and host caches, which inevitably happens if O_DIRECT is not honored. Imagine a Windows Server VM with 32 GB RAM hosted on a bare metal server with 128 GB; writes an application in the VM produces are first accumulated in Windows' internal cache, then go into the host's page cache, then go into the ZFS write cache. That is, you have chained three big caches, each one with it own characteristics and "eigenfrequency", leading to wild oscillations in throughput and latency (and possibly lockups) in the overall system,

You would probably need a PID closed loop regulator to make this work :-) I am aware that there are people who say "Just test it", and this a valid point of view. But the problem is that by principle we can't test every eventuality and every combination of circumstances. Hence, we believe that it's better to avoid setups which are theoretically bad from the beginning on, even though they might run satisfyingly for a while.

In other words, we'd like to have O_DIRECT supported, because we will run all VMs with QEMU option cache=none (old syntax) or cache.direct=on (new syntax), respectively, and therefore ZVOLs are our only option.

ahrens commented 3 years ago

dd if=/dev/zero of=/dev/zvol/rpool/zvol-test bs=4K

Makes sense. Note that this will produce partial-block writes with the default volblocksize=8k, but for the first write (where the zvol is empty) there will be no performance penalty. Subsequent runs on an already-written zvol will do read-modify-writes, so performance will be much worse. using bs=8k, or a multiple of volblocksize, would avoid that.

we'd like to have O_DIRECT supported

For the purpose of controlling (minimizing) what's in the ARC cache, I think that O_DIRECT would have the same effect as primarycache=metadata. Could you use that instead?

sempervictus commented 3 years ago

@ahrens: the RMW amplification is severe, and impacted by other factors (like ashift=12 with a volblocksize<8k on a RAIDZ), presence of SLOG is night|day. The iSCSI use case is where we've had the most issues, except we dont get COMSTAR here, we have SCST and LIO nowadays - not quite the well-organized storage subsystem hierarchy you might envision from the Illumos world :). The block-level interactions after the removal of those SG interfaces from ZVOLs are not pleasant under SCST. LIO seems to handle it better, but you still generally want to stick an LVM between the ZVOL and iSCSI export. All of this is compounded by the severe performance inconsistencies of ZVOLs underpinning the iSCSI transport, making them usable for VM OS disks in a cloud or something, but not for the data disks under Scylla or something like that (where having the TMR blocks deduped would be really nice, but obviously the dedup resolution overhead would make all of this even worse). We push our iSCSI SAN snaps to a lower tier of backup pools which do have dedup, so we do see the advantages of ZFS on a datacenter level, but the performance impacts of all of these layers makes them unusable in a production sense.

filip-paczynski commented 3 years ago

I'm also all too well familiar with performance problems related to ZVOLs... For me ZVOLs seemed like a great solution to VM storage (we use xen), mainly because of slides/videos/diagrams I've seen at the time, which suggested that ZVOLs are more performant due to skipping all the "ZPL overhead" (at least that's what I've gathered from aforementioned sources). It mostly worked before 0.7.x. Afterwards more and more tuning seemed to be required.

I'm yet to thoroughly test ZVOLs on 2.0.x, however I can share some config options which, over the years, seem to work - which means, performance is tolerable and VMs do not crash under heavy write. All of these options/configs are in productions use with:

ZFS 0.8.6 - manually compiled, with License: GPL set in the META file to circumvent any potential perf. disadvantages for non-GPL modules (yes, I know It might be not strictly legal, but I do not share these binaries).
Xen as the hypervisor
compression - at least zle (for swaps and such), preferably lz4
block size - at least 16k (even for swaps) - my reasoning for this (not tested) is that OS likely doesn't swap single 4K pages, but rather a bunch of them. Also, 16k helps with fragmentation, and lz4 compression.
SSD device for SLOG - this is very important - greatly reduces write wait for sync writes, and also, (I think) for zvols/datasets with logbias=latency
- for ZVOLs with DBs I set sync=always to force the use of SSD SLOG even more (this might not be needed, or even counterproductive)
zpool set autotrim=on - mainly for the sake of SLOGs
SMT = off
checksum=skein - because it seemed the fastest
I use regular SAS spinning rust drives
Xen claims to support O_DIRECT via direct-io-safe VBD param, and I use it

Please note, that these configs work for me, it might not work for everyone. Also, the specific options were turned on/modified over the years of exposure to performance-related tickets in this project. Some options might not be viable and/or needed nowadays.

Before I share what I do on VM storage machine, this is what I do not do with ZVOLs:

Large file servers - we have two of those, one runs on LVM + XEN + VM + Samba, the other runs on FreeBSD + ZFS + Jails + Samba - both work fine, even though it's ~10 years old HW
Large mailbox servers - our current one runs on FreeBSD + Jails

My general advice on ZFS:

In zpool create cmd I see partitions being used instead of whole drives. Historically, it was heavily suggested to use whole drives, due to performance reasons. I do not know if this advice still holds.
I'm yet to use ZFS on root partition - this might be a personal bias, but I'm simply too worried about something going wrong: There are no out-of-the-box "rescue" ISOs for the newest ZFS versions (that I'm aware of). Also, even if there were, it might be very time consuming to re-compile ZFS (and make i t boot) for the specific target system. Also, I gather that there are some zpool flags/features that are not supported by grub (?).
You really need a SSD (or something more advanced/performant - and costly) as a SLOG device - both for short-term (sync write perf). as well as long-term (reduce fragmentation)
Always try to match block size of ZVOL to a block size of FS on ZVOL - to reduce write amplification

Ok, sorry for the long read, these are the settings that we use, I hope You'll get some improvement by using these:
Linux cmdline / kernel boot parameters: libata.force=noncq scsi_mod.use_blk_mq=0

the first one was used to shift responsibility on aligning/coalescing/optimizing IOPs to ZFS
the second one is similar, however, after reading @sempervictus posts, maybe it also hepls with ZVOL ios? I honestly do not know all that much about linux IO layers

module options for zfs (modprobe.conf & friends):

# this is of course highly individual
options zfs     zfs_arc_max=6442450944

# increase from 5, it might depend on the number of drives
options zfs     zfs_txg_timeout=15

## this is to remedy earlier issues with xen compatibility
# dev MAJOR number (xen: too high number = problems) 12 is free, https://www.kernel.org/doc/Documentation/devices.txt
options zfs     zvol_major=12

# Prefetch - in my experience it doesn't work all that well on ZVOLs (YMMV)
options zfs     zfs_prefetch_disable=1
optiona zfs     l2arc_noprefetch=1

#### Aggregation 
options zfs     zfs_vdev_read_gap_limit=262144
options zfs     zfs_vdev_write_gap_limit=2097152

#### ZFS IO scheduler
options zfs     zfs_vdev_async_read_min_active=2  zfs_vdev_async_write_min_active=2 zfs_vdev_sync_read_min_active=4 zfs_vdev_sync_write_min_active=10
options zfs     zfs_vdev_async_read_max_active=32  zfs_vdev_async_write_max_active=10 zfs_vdev_sync_read_max_active=128 zfs_vdev_sync_write_max_active=16

#### SPL
# reduces context switches (?)
options spl     spl_taskq_thread_bind=1
# I have 4 cores on xen's dom0, let's have 3x that much zvol threads
options zfs     zvol_threads=12

### PERFORMANCE ISSUES - TEMPORARY MITIGATIONS
# this might not be needed on 2.0.x
options zfs zfs_abd_scatter_enabled=0

sempervictus commented 3 years ago

@filip-paczynski: Thanks for the input and settings dump i've not tried the GPL module bit in 2.0 yet, but IIRC it did benefit things in 0.8.4, esp after Linus went all Bobbit on exports... From an industry viewpoint, the bloody GPL is a poison pill these days, and future OS will never make this mistake again, especially in single-despot-merge-gate mode. Unfortunately part of the poison nature is that it cannot be undone without literally raising folks from the grave to get their sign-off to do so even if you could get everyone to agree. The scsi_mod.use_blk_mq=0 bit is also interesting to me - in most of our workloads, we do the exact opposite, to the point of having our KConfigs default to 1. High-throughput log writers (OSSEC head-end pumping 100k+ ev/m) have required flipping that on some systems due to core starvation. ABD as also been a godsend since the early days - we've been using it since @tuxoko first PR'd.

@Binarus: The "freak out" you describe may be related to what we've observed with "oversubscription" of the block devices by consumers. This is why we keep LVM atop the ZVOLs for iSCSI, it seems to return enough "backpressure" up the stack to tell consumers the disk can't take any more writes at the moment. My theory on this is that ZFS isn't communicating load properly back to Linux consumers (possibly because it doesnt expect ZVOLs to be as slow as they are), and those consumers submit more requests than will be handled. The fact that it started happening after the SG-layer removal from ZoL ZVOLs seems to support that theory. That said, the abysmal throughput of ZVOLs compared to their backing stores is the real problem - they shouldn't fail to commit so far short of the IOP/write throughput of their backing devices.

@ahrens: In the "big picture" view of ZFS objectives, are there any performance quotients defined for ZVOLs as compared to their backing VDEV? The simplest case being a full-disk zpool on a single SSD which can support 500/500 @ 90K iops having nothing but a ZVOL on it, the real-world case being more like a span of 10 2-nvme pen drive mirrors in some dual-socket AMD-backed whitebox, exporting ZVOLs on bonds of 100 or 200Gbit network links to a bunch of compute nodes via iSER (or iSCSI in the more classic ecosystem)?

Binarus commented 3 years ago

@ahrens Thank you very much again.

Makes sense. Note that this will produce partial-block writes with the default volblocksize=8k [...]

We have used other volblocksizes than 4k only for testing. We are usually using 4k, because we will use VMs with NTFS in them (with default cluster size 4k) and with ext4 in them (which AFAIK also is 4k-centric by default). Please refer to the first post to see how we have created the ZVOL in question and which properties we already varied.

For the purpose of controlling (minimizing) what's in the ARC cache, I think that O_DIRECT would have the same effect as primarycache=metadata. Could you use that instead?

We may be wrong, but we are quite sure that we have a problem with writing, not with reading. By using O_DIRECT, we mainly want to make sure that we bypass the O/S page cache when writing. That seems to work, because with O_DIRECT, changing page cache parameters like /proc/sys/vm/dirty_bytes does not influence the behavior any more.

Without O_DIRECT, they do, and behavior is catastrophic with default values. I just verified again: For testing, I started a VM which is backed by a ZVOL, and then started the dd command described above, but without -oflag=direct. That produced very high throughput at the begin (something like 1 GB/s) which continually decreased. After a minute or so, the VM freaked out so much that a client which I had connected to it lost the connection. I then interrupted dd and watched that the disks were running for more than a minute with full load (150 MB/s in average).

The reason clearly is the behavior @IvanVolosyuk described: First, the page cache gets filled with high speed. When it is full, which takes a while with default settings, given our 128 GB of RAM, it starts to write out the data to the ZFS write cache with maximum throughput, and ZFS can nothing else do than writing the data to the ZVOL / the VDEV as fast as possible. We don't know why it lets starve other ZVOLs in doing so, but we have a solution:

When we add oflag=direct to the dd command again, that problem does not occur any more.

Therefore, I believe that primarycache=metadata would not effect anything with respect to that problem, because it relates to reading, not to writing (please correct me if I am wrong). Apart from that, it is exactly the setting which we started with when creating the ZVOL in question (please see the first post post for the other properties), because we believe that this is the right setting for ZVOLs which later should act as VM storage. As described earlier, we also have tested none and all instead of metadata, but this didn't have any effect during our tests.

O_DIRECT clearly is a key point in solving our problem. Too bad that we started our tests with rsync which does not provide an option for O_DIRECT. We needed to switch to dd to test the influence of that flag and saw a drastic change in behavior.

And by the way, now the tunables you have described in your famous article about your ZFS I/O scheduler work as intended. Without O_DIRECT, they barely influenced the behavior and did not help solve the problem, except for one (zfs_dirty_data_max), as described earlier.

Binarus commented 3 years ago

@sempervictus Thank you very much again! Well explained ...

[...] My theory on this is that ZFS isn't communicating load properly back to Linux consumers (possibly because it doesnt expect ZVOLs to be as slow as they are), and those consumers submit more requests than will be handled. [...]

This was our impression as well. However, there seems to be an additional problem: writes to the ZVOLs are not scheduled fairly. Writing to one ZVOL could make other ZVOLs starve. Even if consumers write too fast (because they are mislead regarding the capabilities of the storage system), there should still be a fair distribution across all ZVOLs, Probably ZFS should ensure this, but obviously doesn't, at least as long as the normal page cache is active.

Whatsoever, using O_DIRECT improves the situation drastically. We will now move on to tests with multiple concurrent write streams to multiple ZVOLs and see if we can make one of them starve again despite using O_DIRECT. After that, we will look whether we can improve the situation via cgroups.

Binarus commented 3 years ago

@prgwiz Thank you very much for your help!

Regarding zfs_prefetch_disable: Setting it to 1 or 0 didn't make a difference, probably because this parameter is about reading, while we have a writing problem.

Regarding zfs_nocacheflush: We are a bit reluctant with this, because it acts system-wide (well, ZFS-wide, but in our case that means system-wide). Therefore, we would like to keep it turned off and didn't test it. However, we believe that we tested the same effect specifically for the ZVOL in question by setting sync=disabled on that ZVOL. But this didn't change anything as well, probably because our test currently is copying large data files or continuous streams, which causes a lot of async writes and nearly no sync writes.

zfs_arc_min didn't solve the problem. Making it too small made things worse, making it too big did not improve things. We also don't want to make zfs_arc_max as high as you, because when we put that server into production, the VMs will consume 96 GB (of 128 GB), and thus we have only 32 GB left for the rest (like the normal host page cache, the ARC cache, the ZFS write cache and the host memory).

We still have to test zfs_arc_meta_limit_percent. However, this again is about reading, not writing, so it probably won't help solve our problem.

IvanVolosyuk commented 3 years ago

Regarding zfs_nocacheflush: We are a bit reluctant with this, because it acts system-wide (well, ZFS-wide, but in our case that means system-wide). Therefore, we would like to keep it turned off and didn't test it. However, we believe that we tested the same effect specifically for the ZVOL in question by setting sync=disabled on that ZVOL.

zfs_nocacheflush is a completely different setting and much more dangerous that sync=disabled. Sync disabled makes ZFS ignore fsync() in a sense that it the data is confirmed as writen before it is actually persisted to media. Because ZFS is transactional storage - ordering of writes in enough for consistency purposes - no reordering of writes will happen, which is a weak requirement of the fsync() call.

zfs_nocacheflush is the internal usage of disk flush by ZFS implementation. There is no guarantee that a drive will not re-order writes across multiple ZFS TXGs. This can potentially cause data corruption, unlike usage of sync=disabled. This setting can actually make a difference for you I think.

As for QEMU, I'm happily using raw image files backed by files with 128k record size with scsi-hd driver and if=none, cache=none, aio=threads, discard=unmap. I also noticed that ZFS cause quite a lot of lag for realtime priority QEMU. The only solution I found is the preemptive kernel.

Binarus commented 3 years ago

@filip-paczynski Thank you very much for your long and detailed post! It is much appreciated. A few remarks:

[...] manually compiled, with License: GPL set in the META file to circumvent any potential perf. disadvantages for non-GPL modules [...]

We would never have come to that idea, because we were strongly assuming that we would have read about that subject. However. we will try this for sure.

We'd like to test without compression for the following reason: Our use case (at a later stage) is running VMs on ZVOLs, where the VMs' storage will be encrypted from within the VMs themselves (e.g. a Windows VM with Veracrypt installed in the VM itself). That means that the data on the ZVOL won't be compressible.

We have chosen 4k as block size for the ZVOL (instead of 16k), because we are not so much concerned about swap (in fact, we will even deactivate swap completely on most of our VMs). Instead, we thought that it might be a good idea to make the block size match the cluster size (as it is called in NTFS). In the VMs, there will be NTFS or ext4 file systems; NTFS's cluster size is 4k by default (by changing it, you lose a bunch of useful features), and AFAIK, ext4 is 4k-friendly as well (but I don't know for sure yet).

Regarding SLOG, I am becoming unsure. Everybody in this thread talks about it, although we have reported that setting sync=disabled on the ZVOL in question did not change anything. Did we miss something? How can an SLOG help when O_SYNC is ignored on that ZVOL?

In zpool create cmd I see partitions being used instead of whole drives. Historically, it was heavily suggested to use whole drives, due to performance reasons. I do not know if this advice still holds.

I personally believe that this advice never was due to technical reasons. A while ago, I have read somewhere that ZFS should be able to enable or disable the write cache of the disks, and this would be possible only if it had the disks exclusively. However, this couldn't be that much of a problem, because with hdparm and friends, viewing the current policy or enabling or disabling the cache is a matter of seconds.

The true reason for the advice may be that it prevents users from creating misaligned partitions, which would have massive performance impacts. Given that, the advice is still valid. However, we had no choice than creating the partitions manually, because this is a system which has the root file system on ZFS, i.e. which boots from ZFS, and which boots from EFI, not legacy. I can't see how to achieve that when giving the whole disks to ZFS. Of course, we have verified more than one time that our partitions are aligned properly.

Regarding repairing: We have such setups since a while. Therefore, we have prepared an USB key which boots a live Linux which contains all the necessary tools and kernel modules. That is, we can attach that stick to our servers, boot from it, and immediately have access to the ZFS system (of course unless it has been actually destroyed or hardware is damaged). We will further improve the situation by permanently installing an SSD with a fully-blown Linux system (including ZFS tools and modules) in future servers; that SSD will have an ext4 file system. Used SSDs with 128 GB are quite cheap on eBay :-)

Always try to match block size of ZVOL to a block size of FS on ZVOL [...]

This is the reason why we chose 4k as volblocksize.

libata.force=noncq scsi_mod.use_blk_mq=0

Cool, thank you very much. We will thoroughly look into these kernel parameters.

Regarding the other parameters you suggest (again, thank you very much):

We already have set zfs_txg_timeout=60; otherwise, we would have loads of several GB/hour as soon as one Windows VM runs (without any workload in the VM). We also have reduced zfs_arc_max so that it matches the free memory which will be left when we run VMs as intended (at a later stage). The ...prefetch... options relate to reading, not writing, don't they? In that case, we couldn't expect an improvement.

We are still in the process of tuning the scheduler options - thanks for publishing yours, which are a good starting point for spinning disks. Unfortunately, we haven't understood the ...gap... options yet, and are currently researching them. The same goes for spl_taskq_thread_bind and zfs_abd_scatter_enabled. We are also unsure about zvol_threads: I believe the thread pool has been removed somewhere in 2016, and that this parameter does not effect anything since. On our system, and with OpenZFS 2.0.0, it even isn't writable any more (in /sys/...).

filip-paczynski commented 3 years ago

@sempervictus Thanks for the advice on ABD, I will try and test it again (haven't tested it since 0.7.x) and also scsi_mod.use_blk_mq - it might've been set long time ago, and the reason might've been Xen-related issues.

I fully agree regarding GPL-bonanza... disabling features even in minor kernel versions, it's just crazy ;)

ahrens commented 3 years ago

@Binarus

normal ZFS datasets AFAIK still do not honor O_DIRECT, while ZVOLs do ... we'd like to have O_DIRECT supported

I was responding to this. It sounds like you chose to use ZVOLS instead of ZFS datasets because you can use O_DIRECT with zvols, but you can't use O_DIRECT with zfs filesystems. I agree that when you are using ext4 (or directly accessing /dev/zvol/...?), O_DIRECT would help due to bypassing the page cache. However, write() system calls to ZFS filesystems don't use the page cache, so O_DIRECT would have no impact on the page cache in that case. Therefore, implementing O_DIRECT for ZFS filesystems (i.e. https://github.com/openzfs/zfs/pull/10018) wouldn't help.

filip-paczynski commented 3 years ago

@Binarus Glad to be of some help :)

A few remarks from me:

Our use case (at a later stage) is running VMs on ZVOLs, where the VMs' storage will be encrypted from within the VMs

I get it, however, does veracrypt initially fill whole drive with random bits? If not, then You have zeroes, which can be compressed via zle (it does only this).

Regarding swaps: I get that it's best to avoid it when possible. I only wrote about it to emphasize that one should avoid 4K blocks.

In the VMs, there will be NTFS or ext4 file systems; NTFS's cluster size is 4k by default (by changing it, you lose a bunch of useful features), and AFAIK, ext4 is 4k-friendly as well (but I don't know for sure yet).

Yes, ext4 is 4k-friendy and indeed NTFS defaults to 4k, however:

I always try to reduce IOPS. For example dealing with 128k file with 4k blocks needs 32 iops, the same file with 16k blocks needs 8 iops. Each iop needs to go through entire ZFS pipeline - make read req, wait for it, compute checksum, compare checksum with block's checksum, possibly un-compress it - this all copies data multiple times and especially checksuming/compress could benefit from larger blocks. I think it's good to try to reduce this, especially in 'larger' setups.
NTFS defaults to 4k, but one can change that, even for OS partition, during setup. There are two possible ways:
- At (or before) "choose drive" dialog one needs to summon console window (there is some key sequence for that, I don't remember right now). From this console one can format drive with any block size. (tested on winsrv 2012r2)
- Windows supports provisioning through install scripts, maybe there is a way to automate this via such script (not tested)
NTFS fetures - I'm curious, which features are lost with bigger cluster sizes?
Generally most of the advice I've seen about ZFS/ZVOL recommends against 4k blocks. I also have experienced significant performance improvements with larger blocks (I don't have the numbers).
I cannot stress it enough, but 4k blocks can really suck, especially in larger setups, and especially for long-term pools.

Regarding SLOG, I am becoming unsure. Everybody in this thread talks about it, although we have reported that setting sync=disabled on the ZVOL in question did not change anything. Did we miss something? How can an SLOG help when O_SYNC is ignored on that ZVOL?

SLOG allegedly improves fragmentation issues. I do not know how exactly.
sync=disabled is not realistic for real world, I think. I only use this for swaps. I've seen this called NPM which has two meanings:
- NFS Performance Mode
- No Pants Mode ;)
My knowledge of ZFS is not deep enough to say what difference there might be between SLOG and sync=disabled, apart from fragmentation, which in itself might improve write performance. I think SLOG in this 'mode' acts like a write buffer, so after collecting a bunch of data, it might better optimize the actual write to spinning rust when flushing.
@sempervictus has commented on this issues (below). There might be significant architecture-related improvements due to use of SLOG instead of sync=disabled (which I guess shall never be used).

Partitions vs drives: I have no horse in that race. For me using drives is not a problem, haven't tested partitions vs drives - ever. Let's drop this subject.

Repairing: It's cool to have pre-prepared rescue OS, I agree. My reservations an ZFS-on-root are mostly due to my history with ZFS - I've started at 0 6 or even 0.5 (not sure). Back then zfs-on-root was considered unstable and difficult to get to work/upgrade. Let's drop this subject.

scsi_mod.use_blk_mq=0

This might in fact be counter-productive, see @sempervictus comment above.

zfs_txg_timeout=60

This is always a balance, workload-specific. For me the rule of thumb is this: I want data in txg to take at most 1-2 sesc to flush.

The ...prefetch... options relate to reading, not writing, don't they? In that case, we couldn't expect an improvement.

Yes, but If it's mostly VM-usage, then VMs run OSes, which have their own prefetch strategies, so I think one should avoid prefetch on top of prefetch. Also, my stats have shown very little effectiveness of prefetching, which is even more important, if one has L2ARC (prefetch reads might be turned into writes to L2ARC).

Unfortunately, we haven't understood the ...gap... options yet, and are currently researching them.

My (possibly incorrect) understanding is based on the assumption, that it is cheaper to read/write X bytes in "one" disk op than in several, which might be intertwined by read/write to entirely different parts of disk. So, for example 1MB read is not a problem in sequence, but if You split it into 32K, and introduce disk-head moves to entirely different parts of platter in-between, You will get worse performance. This might be entirely wrong way of thinking though ;)

We are also unsure about zvol_threads: I believe the thread pool has been removed somewhere in 2016, and that this parameter does not effect anything since. On our system, and with OpenZFS 2.0.0, it even isn't writable any more (in /sys/...).

There is still a thread pool. What was changed was the scaling of it. Earlier it was static: one shall have 32 threads for zvols. Now it tries to scale to the number of cores.
It was always read-only. This can only be set during insmod, so in practice at boot time. I guess it's impossible or difficult to spawn kernel threads due to module parameter change.

sempervictus commented 3 years ago

The sync=disabled thing is not ok, block devices need to respect syncio, even if doing so into a write-ahead-log (SLOG) underpinning the logical blockdev. Consider a production environment running (sorry to say this) Oracle RAC atop ZVOLs exported via iSCSI - the bloody things are bit-wise dependent on synchrony and will vomit blood while setting piles of cash alight for every minute they're not running correctly (given the use cases for which such monstrosities are deployed). Far as fragmentation goes - SLOG (or SLOG metaslabs) is needed such that the short-lived sub-slab-sized allocations from those sync writes (preemting the regularly scheduled write transaction which would bundle several of those sub-slab-sized writes together) dont end up getting temporarily written into some metaslab then removed shortly after causing holes to be created. Do that long enough and your VDEV looks like a cheesecloth with the allocator seeking for pieces of space here and there across all of the holes created. Defrag would be nice, but hopefully the new allocation classes just "solve this for all future pools"

filip-paczynski commented 3 years ago

@sempervictus Thanks for in-depth explanation. The metaslab issues might get really severe, I think I saw a video on this. It can be almost 'catastrophic' - waiting several second for a single write, and so on (while IOing like crazy on subsequent metaslabs).

From what I gather, You seem to know a great deal about ZFS internals - can I ask You for input on block sizes (is/why is 4K bad? ), and _gap_ module parameters?

sempervictus commented 3 years ago

I'm no sage, just been in the trenches with this stuff for a few years. @ahrens and @behlendorf are the ones who know what's going on :). The 512>4096 bit is related to the space-waste in RAIDZ from ashift=12 and iSCSI block sizing - we've seen oddities in ashift=12 pools and volblocksize=512 ZVOLs (absurd performance degradations over time), which has so many allocators and coalescing buffers in the way that i'm pretty lost as to even where that problem occurs.

prgwiz commented 3 years ago

These settings made a significant difference for us for a very intense I/O database situation with all SSD dataset: vm.dirty_background_bytes=134217728 vm.dirty_bytes=1073741824

MichaelHierweck commented 3 years ago

Something seems to be horribly broken. (See also #7603).

dd to ZFS file => speed 160 MBit/s, Load 2. dd to ZVOL => speed 60 MBit/s, Load 40. dd to loopback device over ZFS file (fake blockdevice) => speed 60 MBit/s, Load 2.

(Debian 10.6, Linux 5.8, zfs 0.8.5)

sempervictus commented 3 years ago

@MichaelHierweck - yup, sounds about right in terms of CPU abuse and speeds on 0.8.x without a bunch of patching.

For some historical context of ZVOL issues (just the ones i remember):

filip-paczynski commented 3 years ago

@MichaelHierweck I mostly agree. However, my problems mainly center around read/random-read performance. Since I use SLOG, writes are not a problem, for me at least (note: I run a fairly small setup, a dozen of xen vms / server, mostly read-centric).

I'm currently trying to work out whether zfetch/prefetch works at all for zvols - so far I think it does not. To mitigate these problems I'm using SLOG and L2ARC (persistent since 2.0.x)

PS: Perhaps this discussion should be moved to mailing lists... I'm not sure if github issues are a proper venue for this kind of dialog.

MichaelHierweck commented 3 years ago

Back in 2019 I worked on a (scientific) paper proposing a simple, redundant storage architecture for virtualization build on top of ZVOL plus DRBD plus QEMU. I did a lot of benchmarks with different settings (mirrored stripe, RAIDZ2, SLOG on/off), backing devices (NVME, 2,5"/3.5" RPM) and workloads. The last days I reinspected my test systems (Linux 4.9 and ZFS 0.7.12) and found out that they perform (very) much better than our production systems (Linux 5.4 and ZFS 0.8.6). Less overhead with ZVOL, especially less CPU consumption and the number of active processes stays in reasonable ranges. Even IO concurrency between different VMs (ZVOLs) on the same ZPOOL is handled better in the ZFS 0.7 series.

Should we discuss this on the developers mailing list? (Are the ZFS developers interested in getting ZVOLs improved or are ZVOLs out of scope of current ZFS development?)

However I would like to point a detail: with 0.8.6 or 2.0.1 even low IO activity leads to massive spawn of zvol processes. That does not seem to be the case with 0.7.2. Can someone explain what changed between 0.7 and 0.8?

sempervictus commented 3 years ago

@MichaelHierweck - spawning makes me think you were using the dynamic taskq function, which is a pretty well known performance bottleneck for ZVOLs. All of our iSCSI hosts are spl_taskq_thread_dynamic=0 with a 2:1 ratio for zfs.zvol_threads to host threads for this reason. I'm not sure why mailing lists keep getting mentioned - its an outmoded mechanism, even GH is pretty sparse when it comes to project management functions (they've gotten better, but its no Redmine/OpenProject). There seems to be no appetite to redo ZVOLs - its a large effort, requiring many hours of skilled developer with in-kernel performance analysis and data extraction skills, a software architect, reviewer time... which is hard to find and expensive to execute. With NVMEoF/Weka/etc coming online in a major way, the performance expectations of block media are going go through the roof this year. ZFS block devices are already laughably slow compared to their backing vdev and that gap will only grow. Vestigial status is on the horizon for this function without some commercial entity putting up effort like Datto did with Tom's time for ZFS crypto. In the very least we would need someone to spend a lot of time building debug kernels which arent too "debuggy" to hide the bugs with the tracing components, run benchmarks of ZVOL operations under various conditions, compile the data and render flame graphs or the like, and do analysis to identify hotspots so we know where the problems arise (and then figure out why). Alternatively a clean-sheet implementation is needed which is a metric f-tonne of architecture work and problematic as ZFS is an evolved ecosystem which creates its own constraints. I'd be willing to entertain a bounty for either a proper performance analysis and streamlining effort or a clean-slate for ZVOLs (since their consumer interfaces are silly-simple right now, this is actually feasible IMO). Probably would ask other commercial entities in the space to help out (we're a rather small shop - tailored security and infra), but might be handy if anyone's reading and has the relevant skillset to quote and execute such an effort.

beren12 commented 3 years ago

I always give a few GB from each special vdev ssd to a slog, but would be neat if it was automatically done, or something like that.

Binarus commented 2 years ago

At first, a happy new year, and thanks again for bringing us ZFS on Linux!

Sorry for reviving this old thread, but we're re-visiting the problems described above and are still not happy with the situation. We could improve things a little bit by adding a second VDEV (again consisting of a mirror of 2 x 4 TB spinners), though.

However, it's still hard to find a good write-up of the relationships between ZFS's caches and the page cache. So I hope I may ask some more very basic questions which hopefully are not too stupid (I got the impression that so far mainly contributors and experts with in-depth knowledge have participated in this thread :-)).

Question 1

However, write() system calls to ZFS filesystems don't use the page cache, so O_DIRECT would have no impact on the page cache in that case. Therefore, implementing O_DIRECT for ZFS filesystems (i.e. #10018) wouldn't help.

I guess that not having understood this completely is one of the obstacles which keeps us away from success, so could you please elaborate a little bit?

You wrote that write() calls to ZFS bypass the page cache. But then we would like to understand why dd to a ZVOL was behaving totally differently depending on whether we added -oflag=direct or not, and when not, why /proc/sys/vm/dirty_bytes (which is a parameter of the page cache) also changed the behavior drastically. We have documented the tests in this thread in our first post from 2021-01-05.

dd uses write() when writing to a ZVOL device, doesn't it? If this is true, why does it involve the page cache although it shouldn't?

Additionally, what exactly does the new feature described in #10018 implement with respect to writing given that write() bypasses the page cache anyway? Is it about bypassing the ZFS write cache (as opposed to the page cache)?

Question 2

Until now, I was thinking that O_DIRECT already was supported on ZVOLs, but not on datasets. I have come to this opinion from posts like the following: https://forum.proxmox.com/threads/zvol-vs-image-on-top-of-dataset.48022/post-225696

Now I am completely worried due to the issue / PR you linked (#10018). I have just read the whole page (of course without understanding all of it) and couldn't find a statement that it relates (only) to normal datasets. On the other hand, according to our tests (see above), O_DIRECT is honored for ZVOLs even in older versions of ZFS.

So could you please shortly clarify for non-experts what the new feature is about? Is it about supporting O_DIRECT for normal datasets, or is it just that we now can turn it on or off for ZVOLs (where is always was active until now according to our tests)? Does the new feature behave the same for normal datasets and ZVOLs?

Question 3

The next important question of course is which is the first official release which will incorporate that feature. I guess we'll immediately test it, because we eventually can (depending on the answer to the previous question) switch from ZVOLs to file-based VM storage which may drastically improve performance.

Question 4

Finally, at the beginning of our tests (and our learning), we were thinking that O_DIRECT on ZVOLs would bypass the ZFS write cache (which is independent from the O/S write cache) as well. But we were obviously wrong with that. Maybe there is a ZFS tunable to turn off the ZFS write cache, but we couldn't spot it. The reason for bringing this up (again, I may be silly and naive):

When we have an O/S on bare metal, the O/S manages a disk cache (read and write), and there is no further cache layer between the O/S disk cache and the physical storage (let's neglect the fact that most disks have hardware caches for the moment). O/Ss like Windows are optimized for this situation; they assume that they have exclusive access to the storage hardware.

But when running a VM on ZFS (file-based or ZVOL-based), there are at least two caches (at least for writing): The cache which is managed in the VM by the guest O/S, and the ZFS write cache (plus some buffering the VM software probably does, but let's neglect this either).

Isn't it a disadvantage to have two caches chained, and if yes, how to circumvent it? Does the new feature mentioned above have anything to do with it?

Best regards, and thank you very much in advance,

Binarus

IvanVolosyuk commented 2 years ago

I'm not expert ZFS internals, but my undestanding is following, when writting from VM you have:

write caching on guest
write caching on linux block layer for zvol
write caching on ZFS
write caching by drive's firmware

I am not expert and might be wrong, but if you use O_DIRECT in VM you can bypass (2). If you limit /proc/sys/vm/dirty_bytes you put the limit to (2). It smoothes the load on (3) and bandwidth management in (3) works better. Otherwise (2) can grow a lot and fill the host memory with dirty data. It will look like your writes very fast at the beginning and bottleneck later when free memory is exhausted. Try monitoring cat /proc/meminfo |grep Dirty. That means if you back your VM with plain file instead of zvol you should have the same effect as well using zvol with O_DIRECT. The O_DIRECT flag is not supported by ZFS, but supported by linux block layer on top of zvol and as I said affects the (2). For the VM backed by file you should just have (1), (2) and (3).

Sorry, if I'm saying something obvious, simplistic or not exactly correct ;)

Codelica commented 2 years ago

Not that I'm adding much to the conversation here considering the wizard level insight in some of the posts above, but I'll put it out there anyway. Being new to ZFS, zvol write performance was the first snag I ran into.

Basically I was unable to figure out why I was looking at 2x+ write amplification using zvol+ext4 vs ext4 on a raw partition. Not that I was expecting them to be the same, but 2x+ seemed extreme and no combination of volblocksize on the zfs side and blocksize on the ext4 side seemed to help, even when feeding it pretty benign synthetic data. Trying zvol+xfs instead did lower the write amplification considerably, but even then, under heavy write loads both the xfs and ext4 formatted zvols would hit periods where they were almost completely non-responsive. While I imagine there are ways to help reduce(?) that issue, I shy away from things that take too much tweaking, as at the end of the day storage is only one aspect of our system (and we are a small team). So we've been avoiding zvols entirely at this point, even though it would be very nice to use in a few situations.

Considering the rising popularity of ZFS based systems like Proxmox, TrueNAS Scale, etc, I guess I'm a little surprised more people aren't running into zvol performance issues. Perhaps they aren't looking under the hood or pushing things too hard, but it seems like eventually it will need some attention. As the concept of a zvol is really very very attractive IMO.

devZer0 commented 2 years ago

Considering the rising popularity of ZFS based systems like Proxmox, TrueNAS Scale, etc, I guess I'm a little surprised more people aren't running into zvol performance issues

i always wondered about this too. i switched from proxmox default to qcow2 on ordinary zfs dataset a long time ago and i'm running fine with it.

behlendorf commented 2 years ago

There are some improvements for zvol performance being worked on in PR #13148. Any feedback or test results with/without the PR for your target workload would be welcome.

sempervictus commented 2 years ago

13148 helps with some of the queuing issues around ZVOLs, but unfortunately does not address metaslab searches for free blocks when a ZVOL's been filled and erased. The async DMU thing was very promising, but unfortunately currently sitting still with no one having time+capability to move on it (highly non-trivial).

DemiMarie commented 2 years ago

The async DMU thing was very promising, but unfortunately currently sitting still with no one having time+capability to move on it (highly non-trivial).

How big a performance win was it?