ZVOL write IO merging not sufficient

samuelxhu commented 5 years ago

System information

Type	Version/Name
Distribution Name	ZFS on Linux
Distribution Version	Centos 7
Linux Kernel	3.10
Architecture	x-86
ZFS Version	0.6.5.X, 0.7.X, 0.8.X
SPL Version	0.6.5.X, 0.7.X, 0.8.X

Describe the problem you're observing

Before 0.6.5.X, e.g. 0.6.3-1.3 or 0.6.4.2, ZoL had the standard linux block device layer for ZVOL, thus one can use scheduler, deadline or others, to merge incoming IO requests. Even with the simplest noop scheduler, contiguous IO requests could still merge if they are sequential.

Things changed from 0.6.5.X on, Rao re-wrote the block layer of ZVOL, and disabled request merging at ZVOL layer, claiming that DMU does IO merging. However it seems that DMU IO merging either not work properly, or DMU IO merging is not sufficient from the performance point of view.

The problem is as follows. ZVOL has a volblocksize setting, and in many cases, e.g. for hosting VM, it is set to 32KB or so. When IO requests has a request size less than the volblocksize, read-modify-writes (RMW) will occur, leading to performance degradation. A scheduler, such as deadline, is capable of sorting and merging IO request, thus reducing the chance of RMW.

Describe how to reproduce the problem

Create a not-so-big ZVOL with volblocksize of 32KB, use FIO to issue a single sequential write IO workload of size 4KB, after a while (after the ZVOL filled with some data), either using "iostat -mx 1 10 " or "zpool iostat 1 10", one can see there are a lot of read-modify-writes. Note that at the beginning of writes, there will be less or no RMW because ZVOL is almost empty and ZFS can intelligently skip reading zeros.

In contrast, use FIO to issue sequential write IO workload of size 32KB, 64KB, or even larger, no matter how long you run the workload, there is no RMW.

Apparently IO merging logic at ZVOL is not working properly. Either we re-enable block device scheduler choice of deadline or noop, or fix the broken IO merging logic in DMU, should fix this performance issue.

Include any warning/errors/backtraces from the system logs

samuelxhu commented 5 years ago

ZVOL currently does not even support noop

samuelxhu commented 5 years ago

The default value of nomerges is 2, I will try to set it 0, re-test the cases, and report back soon.

Today I can confirm that setting nomerges to 0 has no actual effect

samuelxhu commented 5 years ago

Can somebody (who are familar with ZFS DMU code) investigate the IO merging logic inside DMU a bit, perhaps one can find a better solution there?

Just wonder why the IO merging at DMU is not working in this simple (single thread of 4KB consecutive IO writes) case.

shodanshok commented 5 years ago

@samuelxhu @kpande From how I understand it, the problem is reproducible even without zvol: if you overwrite a large-recordsize (ie: 128k) file with 4k writes, you will encounter heavy read/modify/write. The problem does not seem related to the aggregator not doing its work; rather, it depends on the fact that on partial-recordsize write, the entire recordsize must be copied in memory. For example:

a 32M sized, 128K recordsize file exists. A sequential 4k workload is generated by issuing something as simple as dd if=/dev/urandom of=<testfile> bs=4k count=1024 conv=notrunc,nocreat;
the previous command accumulates writes in memory - nothing is written until txg_sync;
by monitoring I/O on another terminal we can see that, while no writes are issued, a significant read activity happens. This is due to each 4k write belonging to a new 128K chunk to bring that specific 128K chunk in memory in the ADB (ARC data buffer) structure. In other words: the first 4k hitting the file at offset 0 will cause the entire recordsized chunk (128K) to be copied in memory, before even issuing other 4k writes and regardless if these writes completely overwrite such recordsized chunk.
at transaction flush, the DMU aggregates these individual 4k writes in much fewer 128K ones. This can be checked by running zpool iostat -r 1 on another terminal.

So, the r/m/w behavior really seems intrinsically tied to the ARC/checksumming, rather than depending on aggregator not doing its work.

However, in older ZFS versions (<= 0.6.4), zvols where somewhat immune from this problem. This stems from the fact that, unless doing direct I/O, zvols do not bypass the standard linux pagecache. In the example above, running dd if=/dev/random of=/dev/zd0 bs=4k count=1024 will place all new data into pagacache, rather than in ZFS own ARC. Is at this point, before "passing down" the writes to the ARC, that the linux kernel has a change to coalesce all these 4k writes into bigger ones (up to 512K by default). If it succeeds, the ARC will only see 128K+ sized requests, which cause no r/m/w. This, however, is not without contraindications: double caching all data in pagecache leads to much higher pressure on ARC, causing lower hit rates and higher CPU load. Bypassing the pagecache with direct I/O will instead cause r/m/w.

On ZFS >= 0.6.5, the zvol code was changed to skip some of the previous linux "canned" block layer code, simplyfing the I/O stack and bypassing the I/O scheduler entirely (side note: in recent linux kernel, none is not a noop alias anymore. Rather, it really means no scheduler is in use. I also tried setting nomerges to 0, with no changes in I/O speed or behavior). This increased performance for the common case (zvol with direct I/O), but prevented any merging in the pagecache.

For what it is worth, I feel the current behavior is the right one: in my opinion, zvols should not behave too much differently from datasets. That said, this preclude a possible optimization (ie: using the pagecache as a sort of "first stage" buffer where merging can be done before sending anything to ZFS).

samuelxhu commented 5 years ago

Sorry, I disagree that the current behavior of ZVOL is the right one. There are many use cases for zvol to behave like a normal block device, e.g. as backend storage for FC and iSCSI, hosting VMs etc. In those use-cases, a scheduler such as deadline/noop can merge smaller requests into bigger ones, thereby reducing the likelihood of RMWs.

AND using a scheduler to merge requests does not impose a big burden on memory useage!

samuelxhu commented 5 years ago

@kpande Just to confirm that setting /sys/devices/virtual/block/zdXXX/queue/nomerges to 0 does not cause contiguous IO requests to merge. It seems all kinds of IO merging are, unfortunately, disabled by the current implementation.

Ryao's original good will is to avoid double merging and let DMU do IO merging. It is mysterious that DMU does not do the correct merging either.

shodanshok commented 5 years ago

@samuelxhu I think the rationale for current behavior is that you should avoid double caching by using direct I/O to the zvols; in this case, the additional merging done by the pagacache is skipped anyway, so it is better to also skip any additional processing done by the I/O scheduler. Anyway, @ryao can surely answer you in more detailed/correct form.

The key point is that is not the DMU not merging request. Actually, it is doing I/O merging. You are asking for an additional buffer to "pre-merge" multiple write requests before passing them to the "real" ZFS code in order to avoid read amplification. While understanding your request, I think this currently is out of scope, and quite different from how ZFS is expected to work.

samuelxhu commented 5 years ago

@shodandok ZVOL has been widely used as block devices since its beginning, such as backends for FC, iSCSI, hosting VM and even stacking with md, drdb, flashcace, drdb, rdb block devices. Therefore it is extremely important to keep ZVOL like a "normal" block device, supporting scheduler such noop/deadline to merging incoming IO reqeusts. By the way, having standard scheduler behavior has nothing to do with double caching.

@kpande Using smaller volblocksize, such as volblocksize=4k, may help reduce RMWs without the need of IO request merging, however, this is far from the ideal case: with 4KB disks, this effectively prevents ZFS compression and the usage of ZFS RAIDZs. Furthermore, using extremely small volblocksize has a negative impact on throughput performance. It is widely reported that, for hosting VM, volblocksize of 32KB is a better choice in practice.

On Mon, Mar 4, 2019 at 3:54 AM kpande notifications@github.com wrote:

just use a smaller volblocksize and be aware of raidz overhead considerations if you are not using mirrors. using native 512b storage (some NVMe, some datacentre HDD up to 4TB) and ashift=9 will allow compression to work with volblocksize=4k.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/8472#issuecomment-469057592, or mute the thread https://github.com/notifications/unsubscribe-auth/ALDBAY29hniBkSaaXYAIjTGnnp9zTznGks5vTCh5gaJpZM4baO63 .

shodanshok commented 5 years ago

@samuelxhu but they are normal block devices; only the scheduler code was bypassed to improve performance in the common case. I have no problem understanding what you say and why, but please be aware you are describing a pretty narrow use case/optimization: contiguous, non-direct 4k writes to a zvols, is the only case where pagacache merging will be useful. If random I/O are issued, merging is not useful. If direct I/O Is used, merging is again not useful.

So, while I am not against the change you suggest, please be aware of its narrow scope in real world workloads.

samuelxhu commented 5 years ago

@kpande I have over 20 ZFS storage box served as FC/iSCSI backend, which use 32KB volblocksize. We run different workloads on them, and found that 32KB volblocksize strikes the best balance between IOPS and throughput. I had severals friends runing ZVOL for VMware, who recommends 32KB as well. Therefore IO reqeust merging and sorting at ZVOL can effectively reduce RMW. @shodanshok Adding scheduler layer to ZVOL will not cost much memory/CPU usage, but it will enable stacking ZVOL with many other linux block devices, embracing a much broader scope of use.

samuelxhu commented 5 years ago

Let me describe another ZVOL use case which requires the normal block device behavior with a valid scheduler: one or multiple application servers use an FC or iSCSI LUN backed by ZVOL; the servers use a server-side SSD cache, such as Flashcache or bcache to reduce latency and to accelerate application IO. Either flashcache or bcache will issue small but contiguous 4KB IO requests to the backend, anticipating the backend block device will sort and merge those contiguous IO requests.

In the above case, any other block devices include HDD, SSD, RAID, or virutal block device will have no performance issues. BUT with zvol with its current implementation, one will see siginificant performance degradation due to excessive and unneccesary RMWs.

richardelling commented 5 years ago

In general, it is unlikely that merging will benefit overall performance. However, concurrency is important and has changed during the 0.7 evolution. Unfortunately, AFIAK, there is no comprehensive study on how to tune the concurrency. See https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zvol_threads

Also, there are discussions in https://github.com/zfsonlinux/zfs/issues/7834 regarding the performance changes over time, especially with the introduction of the write and DVA throttles. If you have data to add, please add it there.

samuelxhu commented 5 years ago

Why using ZVOLs as backend block device for iSCSI/FC LUN is not a common use case? Don't be narror-minded, it is very common. This is the typical use case that ZVOL should have its own scheduler, at least for two purposes: 1) to keep compatible with linux block device model (extremely important for block device stacking), as the applications anticipate the backend storage ZVOL to do IO merging and sorting ; 2) to reduce the chance of notorious RMWs in parituclar for non-4KB ZVOL volblocksizes

I do not really understand, why ZVOL should be different from a normal block device? For those who use ZVOL with 4KB volblocksize only, set the scheduler to noop/deadline does only cost few CPU cycles, but IO merging has the big potential to reduce the chance of RMWs for non-4KB ZVOL volblocksizes.

Pity on me, i run more than hundreds FC/iSCSI ZFS ZVOL storage box with volblocksize of 32KB or even bigger for sensible reasons, missing a valid scheduler in 0.7.X causes us pains on excessive RMWs and thus performance degradation, preventing us from upgrading (from 0.6.4.2) to any later versions.

We would like to sponsor a fund to support somebody who can make a patch restoring the scheduler feature for ZVOL in 0.7.X. Anyone who are interested pls contact me at samuel.xhu@gmail.com. The patch may or may not be accepted by ZFS authority, but we would like to pay the work.

samuelxhu commented 5 years ago

@kpande Thanks a lot for pointing out the related previous commits, and i will have a careful look at it and try to find a temporary remedy for excessive RMWs.

I notice that previous zvol performance testing focuses primarily on 4KB or 8KB ZVOL, perhaps that is the reason rendering RMW issues less visible and thus RMWs are ignored by many eyes.

Let me explain a bit why a larger blocksize ZVOL still makes sense and should not be ignored: 1) to enable the use of LZ4 compression together with RAIDZ(1/2/3) to gain storage space efficiency; 2) to strike a balance between IOPS and throughput, and 32KB seems to be good for VM workloads since it is not-so-big and not-so-small either; 3) We have server-side flash cache (flashcache, bcache, enhanceIO, etc) implemented on all application servers, which absorbs random 4KB writes and then issues contiguous IO requests (semi-sequential )of 4KB or other small sizes, anticipating the backend block devices (iSCSI/FC ZVOLs) to do IO merging/sorting.

In my humble opinion, elliminating the scheduler code from ZVOL really causes the RMWs pain for non-4KB ZVOL, perhaps not for everyone, but at least for some of ZFS fans.

samuelxhu commented 5 years ago

@kpande It is interesting to notice that some people are complaining about performance degradation due to commit 37f9dac as well in https://github.com/zfsonlinux/zfs/issues/4512

Maybe it is just a coincidence, maybe not.

The commit 37f9dac may perform well for zvols with direct I/O, but there are many other use cases which are suffering from performance degradation due to the missing scheduler (merging and sorting IO requests) behavior.

shodanshok commented 5 years ago

It seems https://github.com/zfsonlinux/zfs/issues/361 basically cover the problem explained here.

Rather than using the pagecache (with its double-caching and increased memory pressure on ARC), I would suggest creating a small (~1M), front "write buffer" to coalesce writes before sending them to ARC.

@behlendorf @ryao any chances to implement something similar?

samuelxhu commented 5 years ago

@shodanshok good finding!

Indeed #361 deals essentially with the same RMW issue here. It came out in 2011, at which time ZFS practitioner can at least use deadline/noop scheduler (before 0.6.5.X) to allievate the chance of RMWs. In #4512, a few ZFS users complained about significant writes amplification just after removing the scheduler, but for unknown reasons RMWs were not paid attention to.

Given so much evidence, it seems to be the right time to take serious efforts to solve this RMW issue for ZVOL. We volunteer to take the responsibility of testing, and if needed, funding sponsorship up to 5K USD (from Horeb Data AG, Switerland) is possible for the code developer (If multiple developers involved, behlendorf pls divide)

samuelxhu commented 5 years ago

@kpande Only for database workloads we have aligned IO for ZVOLs, and unfortunately I do not observe significant performance improvement after 0.6.5.x. The reason might be that I universally have ZFS box with high-end CPU and plenty of DRAM (256GB or above), thus saving a few CPU cycles does not have material impact on IO performance. (The bottleneck is definitely on HDDs, not on CPU cycles or memory bandwidth)

Most of our workloads are not un-aligned IOs, such as hosting VMs, FC/iSCSI backed by ZVOLs, where the frontend applications generate mixed workloads of all kinds. Our engineer team currently focuses on fighting with RMWs, and I think either #361 or #4512 should already show sufficent evidence of the issue.

Before ZVOLs has an effective IO merging facility, we plan to write a shim layer block device sitting in front of ZFS to enable IO request sorting and merging to reduce the occurence of RMWs.

behlendorf commented 5 years ago

@samuelxhu one thing I'd suggest trying first is to increase the dbuf cache size. This small cache sits in front of the compressed ARC and contains an LRU of the most recently used uncompressed buffers. By increasing its size you may be able to mitigate some of the RMW penalty you're seeing. You'll need to increase the dbuf_cache_max_bytes module option.

Before ZVOLs has an effective IO merging facility, we plan to write a shim layer block device sitting in front of ZFS to enable IO request sorting and merging to reduce the occurrence of RMWs.

You might find you can use one of Linux's many existing dm devices for this layer.

Improving the performance of volumes across a wide variety of workloads is something we're interested in, but haven't had the time to work on. If you're interested, rather than implementing your own shim layer I'd be happy to discuss a design for doing the merging in the zvol implementation. As mentioned above, the current code depends on the DMU do the heavy lifting regarding merging. However, for volumes there's nothing preventing us from doing our own merging. Even just front/back merging or being aware of the volumes internal alignment might yield significant gains.

richardelling commented 5 years ago

In order to merge you need two queues: active and waiting. With the request-based scheme there is one queue with depth=zvol_threads. In other words, we'd have to pause I/Os before they become active. This is another reason why I believe merging is not the solution to the observed problem.

shodanshok commented 5 years ago

@richardelling From my test, it seems that DMU merging at writeout time is working properly. What kills the performance of smaller-than-recordsize writes (ie: 4k on a 128K recordsize/volblocksize), for both zvols and regular datasets, is the read part of the r/m/w behavior. Basically, when a small write (ie: 4k) is buffered by the ARC, it had to bring in memory the whole 128K record, irrespective of later writes overlapping (and completely accounting for) the whole recordsize.

Hence my idea of a "front-buffer" which accepts small writes as they are (irrespective of the underlying recordsize) that, after having accumulated/merged some data (say, 1 MB) writes them via the normal ARC buffering/flushing scheme. This would emulate what pagecache is doing for regular block devices, without the added memory pressure of a real pagecache (which can not be limited in any way, if I remember correctly).

I have no idea if this can be implemented without lowering ZFS excellent resilience or how difficult would be doing it, of course.

samuelxhu commented 5 years ago

@behlendorf thanks a lot for suggestions. Looks like that front merging can be easily turned on by reverting commit 5731140, but extensive IO sorting/merging inside ZVOL/DMU may take more efforts, and I may not be capable of coding much myself, but would like to contribute in testing or other ways as much as possible

zviratko commented 5 years ago

Just to chime in - we use ZFS heavily with VM workloads and there is a huge tradeoff between using a 128KiB volblocksize or smaller. Higher volblocksizes actually perform much better up to a point when throughput is saturated, while smaller volblocksizes almost always perform worse, but don't cause throughput problems. And I found it quite difficult to actually predict/benchmark this behaviour because it works very differently on new unfragmented pools, new ZVOLs (no overwrites), different layers of caching (I am absolutely certain that linux pagecache still does something with ZFS as I'm seeing misses that never hit the drives) and various caching problems (ZFS doesn't seem to cache everything it should or could in ARC).

This all makes it very hard to compare performance of ZFS/ZVOLs to any other block device, it makes it hard to tune and it makes it extremely hard to compete with "dumb" solutions like mdraid when performance is all over the place.

If there is any possibility to improve merging to avoid throughput saturation then please investigate it, the other solution (to the problems I am seeing in my environment) is to fix performance issues with smaller volblocksizes, but I guess that will be much more difficult and I have seen it already discussed elsewhere multiple times (like ZFS not being able to use vdev queues efficiently when those vdevs are fast, like NVMe where I have rarely seen a queue size >1).

janetcampbell commented 5 years ago

We did a lot of experimentation with ZVOLs here and I'd like to offer a few suggestions.

RMW can come from above you as well as from within ZFS. Depending on what parameters you're using on your filesystem and what you set for your block device, you can end up with either the VM subsystem or user land thinking that you have a large minimum IO size, and they will try to pull in data from you before they write out.

With zvols, always always always blktrace them as you're setting up to see what is going on. We found that some filesystem options (large XFS allocsize=) could provoke RMW from the pager when things were being flushed out. If you blktrace and see reads for a block coming in before the writes do, you are in this situation.

Proper setup is essential and "proper" is a matter of perspective. Usually it's best to configure a filesystem as though it was on a RAID stripe either the size of the volblocksize, or half that size. The reason you might choose a smaller size is if you are on a pool with no SLOG and you want all FIO writes to the zvol to go to ZIL blocks instead of indirect sync, as large block zvols do with full-block writes. Or, you may want to refactor your data into larger chunks for efficiency or synchronization purposes.
Poor inbound IO merge. It's best to configure a filesystem on a zvol to expose a large preferred IO size to applications, allowing FIO to come through in big chunks.
Always use primarycache=all.
If you use XFS on zvols, use a separate 4K volblocksize ZVOL for XFS filesystem journaling. This can be small, 100MB is more than enough. This keeps the constant flushing that XFS does out of your primary ZVOL, and allows things to aggregate much more effectively.

Here's an example:

zfs create -V 1g -o volblocksize=128k tank/xfs zfs create -V 100m -o volblocksize=4k tank/xfsjournal

mkfs.xfs -s size=4096 -d sw=1,su=131072 -m crc=0 -l logdev=/dev/zvol/tank/xfsjournal /dev/zvol/tank/xfs mount -o largeio,discard,noatime,logbsize=256K,logbufs=8 /dev/zvol/tank/xfs /somewhere

largeio + large stripe unit + separate XFS journal has been the winning combination for us.

Hope this helps.

samuelxhu commented 5 years ago

Very good points. Thanks a lot

Samuel

On Sat, Apr 6, 2019 at 5:08 AM janetcampbell notifications@github.com wrote:

We did a lot of experimentation with ZVOLs here and I'd like to offer a few suggestions.

RMW can come from above you as well as from within ZFS. Depending on what parameters you're using on your filesystem and what you set for your block device, you can end up with either the VM subsystem or user land thinking that you have a large minimum IO size, and they will try to pull in data from you before they write out.

With zvols, always always always blktrace them as you're setting up to see what is going on. We found that some filesystem options (large XFS allocsize=) could provoke RMW from the pager when things were being flushed out. If you blktrace and see reads for a block coming in before the writes do, you are in this situation.

1.

Proper setup is essential and "proper" is a matter of perspective. Usually it's best to configure a volume as though it was on a RAID stripe either the size of the volblocksize, or half that size. The reason you might choose a smaller size is if you are on a pool with no SLOG and you want all writes to the zvol to go to ZIL blocks instead of indirect sync, as large block zvols do with full-block writes. Or, you may want to refactor your data into larger chunks for efficiency or synchronization purposes. 2.

Poor inbound IO merge. It's best to configure a filesystem on a zvol to expose a large preferred IO size to applications, allowing FIO to come through in big chunks. 3.

Always use primarycache=all. 4.

If you use XFS on zvols, use a separate 4K volblocksize ZVOL for XFS filesystem journaling. This can be small, 100MB is more than enough. This keeps the constant flushing that XFS does out of your primary ZVOL, and allows things to aggregate much more effectively.

Here's an example:

zfs create -V 1g -o volblocksize=128k tank/xfs zfs create -V 100m -o volblocksize=4k tank/xfsjournal

mkfs.xfs -s size=4096 -d sw=1,su=131072 -m crc=0 -l logdev=/dev/zvol/tank/xfsjournal /dev/zvol/tank/xfs mount -o largeio,discard,noatime,logbsize=256K,logbufs=8 /dev/zvol/tank/xfs /somewhere

largeio + large stripe unit have been the winning combination for us.

Hope this helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/8472#issuecomment-480469011, or mute the thread https://github.com/notifications/unsubscribe-auth/ALDBAf-NmeyIqfvQFI1w277HQiTFgaaxks5veA-UgaJpZM4baO63 .

janetcampbell commented 5 years ago

Just to chime in - we use ZFS heavily with VM workloads and there is a huge tradeoff between using a 128KiB volblocksize or smaller. Higher volblocksizes actually perform much better up to a point when throughput is saturated, while smaller volblocksizes almost always perform worse, but don't cause throughput problems.

A little gem I came up with that I haven't seen elsewhere...

Large zvols cause more TxG commit activity. The big danger from this is RMW reads, which can stomp on other IO that's going around.

Measure TxG commit speed. Open the ZIO throttle. Then, set zfs_sync_taskq_batch_pct=1 and do a TxG commit. Raise it slowly until TxG commit speed is a little slower than it was before the test. This will rate limit the TxG commit and the RMW reads that come off of it, and also can help I/O aggregation. I came up with this approach when I developed a remote backup system that went to block devices on the far side of a WAN.

With this you can run long intervals between commits and carry plenty of dirty data, which helps reduce RMW. Once you set the sync taskq, turn the ZIO throttle on and adjust it to just before where it starts to have an effect. This will match these two parameters to the natural flow of the system. At this point you can usually turn aggregation way up and drop the number of async writers some.

Oh, and make sure your dirty data write throttle is calibrated correctly and has enough room to work. ndirty should stabilize in the middle of its range during high throughput workloads.

We mostly use 128K-256K zvols. They work very well and beat out ZPL mounts for MongoDB performance for us. Performance is more consistent than ZPL mounts provided you're good to them (don't do indirect sync writes with a small to moderate block size zvol unless you don't care about read performance).

janetcampbell commented 5 years ago

I realized there are a lot of comments here that are coming from the wrong place on RMW reads and how ZFS handles data going into the DMU and such. Unless in the midst of a TxG commit, ZFS will not issue RMW reads for partial blocksize writes unless they are indirect sync writes, and you can't get a partial block indirect sync write on a ZVOL due to how zvol_immediate_write_size is handled. Normally the txg commit handles all RMW reads when necessary at the start of the commit, and none happen between commits.

The RMW reads people are bothered by are actually coming from the Linux kernel, in fs/buffer.c. Here's a long winded explanation of why and how to fix it (easy with ZVOLs):

https://github.com/zfsonlinux/zfs/issues/8590

With a 4k superblock inode size you can run a ZVOL with a huge volblocksize, txg commit once a minute, and handle tiny writes without problem. Zero RMW if all the pieces of the block show up before TxG commit.

Hope this helps.

shodanshok commented 5 years ago

@janetcampbell while I agree that a reasonable sized recordsize is key to extract good read performance, especially from rotating media, I think you are missing the fact that RMW can and will happen very early in the write process, as early as accepting the write buffer into the DMU. Let me do a practical example:

# create a 128K recordsize test pool
[root@singularity ~]# zfs create tank/test
[root@singularity ~]# zfs get recordsize tank/test
NAME       PROPERTY    VALUE    SOURCE
tank/test  recordsize  128K     default

# create a 1GB test file and drop caches
[root@singularity ~]# dd if=/dev/urandom of=/tank/test/test.img bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.94275 s, 217 MB/s
[root@singularity ~]# sync
[root@singularity ~]# echo 3 > /proc/sys/vm/drop_caches

# rewrite some sequential 4k blocks
[root@singularity ~]# dd if=/dev/urandom of=/tank/test/test.img bs=4k count=1024 conv=notrunc,nocreat
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 1.05854 s, 4.0 MB/s

# on another terminal, monitor disk io - rmw happens
[root@singularity ~]# zpool iostat 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        92.0G  1.72T      0      4  2.16K   494K
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      3      0   511K      0
tank        92.0G  1.72T     27      0  3.50M      0
tank        92.0G  1.72T      0    169      0  9.35M
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0

# retry the same *without* dropping the cache
[root@singularity ~]# dd if=/dev/urandom of=/tank/test/test.img bs=4k count=1024 conv=notrunc,nocreat
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.0306379 s, 137 MB/s

# no rmw happens
[root@singularity ~]# zpool iostat 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        92.0G  1.72T      0      4  3.07K   489K
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0      0      0      0
tank        92.0G  1.72T      0     61      0  7.63M
tank        92.0G  1.72T      0    104      0  1.78M

Please note how, on the first 4k write test, rmw (with synchronous reads) happens as soon as the write buffers are accepted in the DMU (this is reflected by the very low dd throughput). This happens even if dd, being sequential, completely overwrites the affected zfs records. In other words, we don't really have a merging problem here; rather, we see io amplification due to rmw. Merging at writeout time is working correctly.

The second identical write test, which is done without dropping the cache, avoids the rmw part (especially its synchronous read part) and shows much higher write performance. Again, merging at write time is working correctly.

This is, in my opinion, the key reason why peoples tell ZFS needs tons of memory to have good performance: being so penalizing, reducing the R part of rmw using very large ARC can be extremely important. It should be noted that L2ARC works very well in this scenario, and it is the main reason why I often use cache device even on workloads with low L2ARC hit rate.

sempervictus commented 5 years ago

You can enable merge at runtime on ZVOLs, IIRC that patch was in part trying to deal with Oracle RAC/iSCSI/ZFS. Disabling merges, and upping nr_requests helps with some workloads, but doesnt do as much as one would think in my testing. You can try the same via:

diff --git a/include/zfs/linux/blkdev_compat.h b/include/zfs/linux/blkdev_compat.h
index c8cdf38ef4fe..ad3e9537d5b3 100644
--- a/include/zfs/linux/blkdev_compat.h
+++ b/include/zfs/linux/blkdev_compat.h
@@ -632,4 +632,11 @@ blk_generic_end_io_acct(struct request_queue *q, int rw,
 #endif
 }

+static inline void blk_update_nr_requests(struct request_queue *q, unsigned int nr)
+{
+        spin_lock_irq(q->queue_lock);
+        q->nr_requests = nr;
+        spin_unlock_irq(q->queue_lock);
+}
+
 #endif /* _ZFS_BLKDEV_H */
diff --git a/fs/zfs/zfs/zvol.c b/fs/zfs/zfs/zvol.c
index 6eb926cee6ef..05823a24ce5b 100644
--- a/fs/zfs/zfs/zvol.c
+++ b/fs/zfs/zfs/zvol.c
@@ -1666,8 +1666,12 @@ zvol_alloc(dev_t dev, const char *name)
        /* Limit read-ahead to a single page to prevent over-prefetching. */
        blk_queue_set_read_ahead(zv->zv_queue, 1);

+
+       /* Set deeper IO queue for modern zpools: default is 128, SSDs easily do > 512*/
+       blk_update_nr_requests(zv->zv_queue, 1024);
+
        /* Disable write merging in favor of the ZIO pipeline. */
-       blk_queue_flag_set(QUEUE_FLAG_NOMERGES, zv->zv_queue);
+       // blk_queue_flag_set(QUEUE_FLAG_NOMERGES, zv->zv_queue);

        zv->zv_disk = alloc_disk(ZVOL_MINORS);
        if (zv->zv_disk == NULL)

Temtaime commented 4 years ago

Any news? This issue is annoying. I'm seeing queue > 5K for my ssd and sometimes it starts to produce errors: [ 2111.023567] sd 11:0:0:0: [sdd] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT [ 2111.023570] sd 11:0:0:0: [sdd] tag#12 CDB: Write(10) 2a 00 2b 0e ea ea 00 00 03 00 [ 2111.023573] blk_update_request: I/O error, dev sdd, sector 722397930 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0

It's not a problem with ssd i think. Just a slow model, smart is OK and it works flawlessly under other workloads.

sempervictus commented 4 years ago

We basically gave up on ZVOLs for anything performance oriented a couple of years ago. After the 0.6.5 thing, they became unusable for long running workloads due to the degrading and inconsistent performance profile. A block device needs to work as a block device: same semantics as any other SG interface. Problem is that there is no FOSS commercial driver for the work needed to overhaul the entire block device implementation, and where there is, there is incentive to keep it in-house (cloud vendors and such). Scst loopback volumes atop files in zfs fatasets are a decent option, same goes for LIO. Overall, performance in zfs does not appear to be a first order concern - we still have hardcoded values for HDD semantics all over, while storage has now moved past SSD into direct PCI access... Maybe we need a kickstarter to fund work on zvol performance?

samuelxhu commented 4 years ago

Zvol on ZFS 0.8.x is usable and stable as long as spl_taskq_thread_dynamic is disabled. Otherwise one may see sporadically system freezing without any obvious clue under long or heavy workloads.

The root cause of ZVOL performance regression is happening since ZVOL rework on 0.6.5, where the normal block device behavior (deadline scheduler is unfortunately removed) is not supported and thus IO requests are not effectively merged before sending to ZFS ZIO, in many cases causing excessive Read-Modify-Writes for small block sizes.

It would be the right time to revert ZVOL rework of 0.6.5, in particular to enable deadline scheduler of ZVOL block device. This would solve 80% of ZVOL performance woe.

On Tue, Jan 21, 2020 at 11:58 PM RageLtMan notifications@github.com wrote:

We basically gave up on ZVOLs for anything performance oriented a couple of years ago. After the 0.6.5 thing, they became unusable for long running workloads due to the degrading and inconsistent performance profile. A block device needs to work as a block device: same semantics as any other SG interface. Problem is that there is no FOSS commercial driver for the work needed to overhaul the entire block device implementation, and where there is, there is incentive to keep it in-house (cloud vendors and such). Scst loopback volumes atop files in zfs fatasets are a decent option, same goes for LIO. Overall, performance in zfs does not appear to be a first order concern - we still have hardcoded values for HDD semantics all over, while storage has now moved past SSD into direct PCI access... Maybe we need a kickstarter to fund work on zvol performance?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/8472?email_source=notifications&email_token=ACYMCAMDNMDXIL4XUIP2PUTQ654XZA5CNFSM4G3I5232YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJRTC5I#issuecomment-576926069, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYMCAODIRRD7SXMNDYBTULQ654XZANCNFSM4G3I523Q .

samuelxhu commented 4 years ago

... in many cases causing excessive Read-Modify-Writes (for small block sizes) if the IO size is smaller than (and if not aligned with) the ZVOL volblocksize.

On Wed, Jan 22, 2020 at 8:51 AM Xiaoyu Hu samuel.xhu@gmail.com wrote:

Zvol on ZFS 0.8.x is usable and stable as long as spl_taskq_thread_dynamic is disabled. Otherwise one may see sporadically system freezing without any obvious clue under long or heavy workloads.

The root cause of ZVOL performance regression is happening since ZVOL rework on 0.6.5, where the normal block device behavior (deadline scheduler is unfortunately removed) is not supported and thus IO requests are not effectively merged before sending to ZFS ZIO, in many cases causing excessive Read-Modify-Writes for small block sizes.

It would be the right time to revert ZVOL rework of 0.6.5, in particular to enable deadline scheduler of ZVOL block device. This would solve 80% of ZVOL performance woe.

On Tue, Jan 21, 2020 at 11:58 PM RageLtMan notifications@github.com wrote:

We basically gave up on ZVOLs for anything performance oriented a couple of years ago. After the 0.6.5 thing, they became unusable for long running workloads due to the degrading and inconsistent performance profile. A block device needs to work as a block device: same semantics as any other SG interface. Problem is that there is no FOSS commercial driver for the work needed to overhaul the entire block device implementation, and where there is, there is incentive to keep it in-house (cloud vendors and such). Scst loopback volumes atop files in zfs fatasets are a decent option, same goes for LIO. Overall, performance in zfs does not appear to be a first order concern - we still have hardcoded values for HDD semantics all over, while storage has now moved past SSD into direct PCI access... Maybe we need a kickstarter to fund work on zvol performance?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/8472?email_source=notifications&email_token=ACYMCAMDNMDXIL4XUIP2PUTQ654XZA5CNFSM4G3I5232YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJRTC5I#issuecomment-576926069, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYMCAODIRRD7SXMNDYBTULQ654XZANCNFSM4G3I523Q .

Temtaime commented 4 years ago

@samuelxhu @sempervictus Thanks for your replies very much. Btw: my spl_taskq_thread_dynamic is set to 1 and zfs is 0.8.2. So i should override it to zero?

samuelxhu commented 4 years ago

From my own experience on 0.7.x, you definitely need to set spl_taskq_thread_dynamic to 0. Thus on 0.8.x I continually keep spl_taskq_thread_dynamic to 0.

I am not sure whether spl_taskq_thread_dynamic setting to 1 can work well on 0.8.x or not.

On Wed, Jan 22, 2020 at 10:04 AM Temtaime notifications@github.com wrote:

@samuelxhu https://github.com/samuelxhu @sempervictus https://github.com/sempervictus Thanks for your replies very much. Btw: my spl_taskq_thread_dynamic is set to 1 and zfs is 0.8.2. So i should override it to zero?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/8472?email_source=notifications&email_token=ACYMCAPNAMXZCV3DKPABZ3LQ7ADYPA5CNFSM4G3I5232YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJSYMNI#issuecomment-577078837, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYMCAP7UNFJH5ISC3Z5YITQ7ADYPANCNFSM4G3I523Q .

behlendorf commented 4 years ago

@samuelxhu if you can recommend better default values for most systems based on your performance investigation that would be helpful. Both decreasing the default number of zvol_threads and removing the TASKQ_DYNAMIC sounds like it would be beneficial for most workloads. If you're comfortable opening a PR that would be great, otherwise I'm happy to do so.

Tackling improving the zvol merging is clearly a bigger chunk of work. A lot has changed in the Linux block layer since that older ZFS release so it's not as straight forward as reverting the changes. But with a bit of development work it's definitely possible to perform more IO merging.

samuelxhu commented 4 years ago

@Behlendorf, Brian D. behlendorf1@llnl.gov Actually the problem I encountered with 0.7.x is even more seriously than i described. With 0.7.12/13, i first encountered sporadic freezing without any useful kernel messages, and i found some clues on the web and then changed spl_taskq_thread_dynamic to 0, and it seems work well (with volblocksize=32KB) under heavy write workloads. Later, when i tried smaller volblocksizes for the purpose of reducing excessive Read-Modify-Writes, such as 4KB/8KB/16KB, spradic freezing occured again. It looks like that, the smaller the volblocksize, the more likely the freezing could happen. The good news is, when i switch to 0.8.X (with spl_taskq_thread_dynamic to 0), the problem disappears (honestly i did not try 0.8.x with spl_taskq_thread_dynamic equal to 1). Thus I did not fire a PR yet. I think 0.8.x has done a good job to fix the relevant dead lock, by chance or intentionally.

I kept a close look at the performance complaints on the ZFS ZVOL, and it occured to me that most complaints are related to the removed deadline scheduler of ZVOL block device since 0.6.5, which fundamentally disabled IO merging before ZFS ARC. Unfortunately I am not a professional kernel developer to handle this. But I will be more than happy to do extensive testing and/or provide some financial support if someone would take this work. BTW, thank you so much for making ZFS such a great project!

Samuel

On Wed, Jan 22, 2020 at 9:01 PM Brian Behlendorf notifications@github.com wrote:

@samuelxhu https://github.com/samuelxhu if you can recommend better default values for most systems based on your performance investigation that would be helpful. Both decreasing the default number of zvol_threads and removing the TASKQ_DYNAMIC sounds like it would be beneficial for most workloads. If you're comfortable opening a PR that would be great, otherwise I'm happy to do so.

Tackling improving the zvol merging is clearly a bigger chunk of work. A lot has changed in the Linux block layer since that older ZFS release so it's not as straight forward as reverting the changes. But with a bit of development work it's definitely possible to perform more IO merging.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/8472?email_source=notifications&email_token=ACYMCALFM54BFS7RMOQRIR3Q7CQZHA5CNFSM4G3I5232YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJU5DZQ#issuecomment-577360358, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYMCAJF4IMQ374LUUV3QB3Q7CQZHANCNFSM4G3I523Q .

sempervictus commented 4 years ago

The dynamic task queue bit and setting zvol threads to 2-8x # of cores (varies by system and workload) has been our standard kernel command line invocation since 0.7, so haven't tried dynamic on 0.8. We also try to ensure a dedicated slog for these setups (even if all ssd) to help deal with sync io requests on systems with multiple consumers. That said, the long term degradation of volume performance with writes and subsequent frag especially with automated snap schedules, is its own problem. Without something like the mythical bp rewrite, a zvol can't be defragmented, nor does it truly discard/trim old data which has been deleted by the consumer as far as I understand.

behlendorf commented 4 years ago

To get the ball rolling I've opened PR #9874 which adjusts the current defaults to close to what was recommended above. Dynamic task queues are unconditionally disabled, and one thread per cpu is created by default. Any performance testing would be appreciated to confirm the new defaults are an improvement.

samuelxhu commented 4 years ago

@sempervictus I cannot agree you more on the taken measures such as disabling dynamic spl_taskq_thread_dynamic, limiting zvol threads up to 8, and always have a decent ssd as SLOG.

ZFS ZVOL definitely is not perfect in its current form, but there is no other software (except for NetAPP WAFL) which can provide hardware failure detection/separation and data protection comparable to ZFS. From my experience, ZVOL fragmentation normally will not cause much performance degradation as long as the pool is not over 90% full.

I have not done extensive tests on discard/trim (old data) property on 0.8.x, which supposes to be working well. I would much appreciate others to share relevant experience.

scineram commented 4 years ago

The SpectraLogic DMU rework that Matt Macy is attempting to upstream again could be the solution to this.

DemiMarie commented 3 years ago

Can this be worked around by using a device-mapper linear layer on top of the zvol?

shodanshok commented 3 years ago

@DemiMarie based on my tests, no: the overlaying device mapper will not expose any IO scheduler, negating early IO merging. That said, the real performance killer is the sync read IOs needed for partial record update. To somewhat mitigate that, you can use ZVOLs avoiding O_DIRECT file IO (ie: using the Linux pagecache as an upper, coelescing buffer); however, this means double-caching and possibly some bad (performance-wise) interaction between the pagecache and the ARC.

DemiMarie commented 3 years ago

@DemiMarie based on my tests, no: the overlaying device mapper will not expose any IO scheduler, negating early IO merging. That said, the real performance killer is the sync read IOs needed for partial record update. To somewhat mitigate that, you can use ZVOLs avoiding O_DIRECT file IO (ie: using the Linux pagecache as an upper, coelescing buffer); however, this means double-caching and possibly some bad (performance-wise) interaction between the pagecache and the ARC.

The use-case I am interested in is using ZFS in QubesOS, which means that the zvols are being exposed over the Xen PV disk protocol. Not sure what the best answer is there. Is @sempervictus’s patch a solution?

filip-paczynski commented 3 years ago

@DemiMarie I might be entirely wrong about this, but in Your usecase:

Xen by default doesn't use O_DIRECT. One has to add direct-io-safe parameter to VBD definition (https://xenbits.xen.org/docs/unstable/man/xl-disk-configuration.5.html#direct-io-safe)
If a VM is run on top of a ZVOL, then this VM has it's own FS and it's own scheduler for VBD, also pagecache. Therefore VM might merge IOs on it's own, but I am not sure whether this actually happens.
In my experience block size is very important parameter for ZVOL performance. Anything smaller than 16k should be avoided. Also, If using anything larger than 4k, one has to be careful to properly mkfs and also specify some flags in VM's fstab/rootflags boot param (eg: for XFS this means largeio and swalloc)

liyimeng commented 2 years ago

The SpectraLogic DMU rework that Matt Macy is attempting to upstream again could be the solution to this.

Is this done? Where is the PR?

sempervictus commented 2 years ago

@liyimeng its not done - i revived the PR (#12166) a while back but have zero time to do free work right now. Please feel free to complete the merge and validation work on that though.

tonyhutter commented 2 years ago

I would encourage all zvol users to test drive my block multi-queue PR here: #12664 . You could see pretty big performance improvements with it, depending on your workload.

mailinglists35 commented 2 years ago

lazy person asking: how far in time is that PR from being merged into main?

sempervictus commented 2 years ago

@mailinglists35 - hard to tell; even otherwise complete PRs sometimes hang out in the queue for a while as other things are implemented in master. Its in the testing phase though, so closer to it than otherwise :).

openzfs / zfs