Better performance with O_DIRECT on zvols

tonyhutter commented 3 years ago

System information

Type	Version/Name
Distribution Name	RHEL
Distribution Version	7.9
Kernel Version	3.10.0-1160.24.1.1
Architecture	x86-64
OpenZFS Version	6bc61d22c439b240931f7198db85795ddb86509a

Describe the problem you're observing

One of our users was testing zvols on DRAID with a large number of disks, and reported much better performance when using O_DIRECT than without it (using multiple parallel dd writes). That seemed like an odd result, but my tests confirmed that as well. I was seeing around 1.7GByte/s vs 2.9GByte/s with O_DIRECT (on a 46-disk draid2:8d:46c:0s-0 pool, 12GBit SAS enclosure).

The user reported seeing possible lock contention in the non-O_DIRECT case:

Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 926161513542 lost: 0/0 drop: 0/0
Overhead  Shared Object              Symbol
  74.70%  [kernel]                   [k] osq_lock
   3.99%  [kernel]                   [k] mutex_spin_on_owner
   2.73%  [kernel]                   [k] mutex_lock
   1.61%  [kernel]                   [k] __mutex_lock.isra.6
   1.02%  [kernel]                   [k] _raw_spin_lock
   0.98%  [kernel]                   [k] memcpy_erms
   0.97%  [kernel]                   [k] fletcher_4_avx512f_native
   0.57%  [kernel]                   [k] memmove
   0.48%  [kernel]                   [k] mutex_unlock
   0.43%  [kernel]                   [k] _raw_spin_lock_irqsave
   0.42%  [kernel]                   [k] copy_user_enhanced_fast_string
   0.37%  [kernel]                   [k] zfs_rangelock_compare
   0.34%  [kernel]                   [k] dbuf_hold_impl
   0.31%  [kernel]                   [k] __clear_user
   0.29%  [kernel]                   [k] taskq_thread
   0.29%  [kernel]                   [k] __list_del_entry_valid
   0.29%  [kernel]                   [k] native_queued_spin_lock_slowpath
   0.27%  [kernel]                   [k] osq_unlock
   0.22%  [kernel]                   [k] down_read
   0.20%  [kernel]                   [k] dbuf_rele_and_unlock
   0.18%  [kernel]                   [k] up_read
   0.17%  [kernel]                   [k] __slab_free
   0.16%  [kernel]                   [k] kmem_cache_free
   0.16%  [kernel]                   [k] zvol_write.isra.23

Describe how to reproduce the problem

Please adjust to your setup:

#!/bin/bash
ZPOOL=./cmd/zpool/zpool
ZFS=./cmd/zfs/zfs
volumes=10

for ddargs in "" "oflag=direct" ; do 
    echo "TEST dd args: $ddargs"
    $ZPOOL create tank draid2 /dev/disk/by-vdev/*

    freespace=$($ZPOOL list -pH -o free)

    volumesize=$(($freespace/$(($volumes * 2))))
    for i in `seq 1 $volumes` ; do 
        $ZFS create -V $volumesize -o volblocksize=1M tank/vol$i & true
    done
    wait

    for i in `seq 1 $volumes` ; do
        dd if=/dev/zero of=/dev/tank/vol$i bs=1G count=10 $ddargs &>/dev/null & true
    done

    $ZPOOL iostat -y 10 5

    wait
    $ZPOOL destroy tank
    echo ""
done

Include any warning/errors/backtraces from the system logs

Results:

TEST dd args: 
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        15.0G   335T      0  6.42K      0  1.68G
tank        35.5G   335T      0  6.06K      0  1.90G
tank        48.9G   335T      0  4.76K      0  1.36G
tank        60.0G   335T      1  4.23K  6.39K  1.12G
tank        69.1G   335T      7  3.97K  32.0K  1005M

TEST dd args: oflag=direct
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        25.4G   335T      0  14.3K      0  2.87G
tank        53.6G   335T      0  14.7K      0  2.87G
tank        78.3G   335T      0  12.9K      0  2.40G
tank         107G   335T      6  14.5K  25.6K  2.72G
tank         113G   335T     45  2.08K  4.53M   415M

tonyhutter commented 3 years ago

I did three different flamegraph runs and saw roughly the same results. Here's one of the runs:

non-O_DIRECT

O_DIRECT

(Note: for some reason these SVGs aren't interactive when I click on the links, but if I download them and run firefox odirect3.svg it works)

One thing that stood out was zvol_write(). On non-O_DIRECT, zvol_write() takes ~50% of the CPU, but on O_DIRECT, it takes about 20%. If you click on zvol_write(), you'll see that on non-O_DIRECT, most of the time is spent in dbuf_find(), whereas on O_DIRECT dbuf_find() is only a tiny fraction of the time. I'm continuing to investigate.

tonyhutter commented 3 years ago

Not sure if this helps, but below are some of /proc/spl/kstat/zfs/dbufstats. Rows with zeros are omitted. The numbers were relatively consistent across three runs.

	non-O_DIRECT	O_DIRECT	non-O_DIRECT ÷ O_DIRECT
cache_count	18	9	2.00
cache_size_bytes	2129920	720896	2.95
cache_size_bytes_max	1874608128	2152464384	0.87
cache_target_bytes	1873661094	2105699456	0.89
cache_lowater_bytes	1686294985	1895129511	0.89
cache_hiwater_bytes	2061027203	2316269401	0.89
cache_total_evicts	100634	100706	1.00
cache_level_0	18	9	2.00
cache_level_0_bytes	2129920	720896	2.95
hash_hits	27092688	324774	83.42
hash_misses	196310	273225	0.72
hash_collisions	30046	5270	5.70
hash_elements	63	43	1.47
hash_elements_max	3617	5628	0.64
hash_chains	0	0	0.00
hash_chain_max	5	2	2.50
hash_insert_race	693056	13503	51.33
metadata_cache_count	13	13	1.00
metadata_cache_size_bytes	209408	208896	1.00
metadata_cache_size_bytes_max	209408	208896	1.00
metadata_cache_overflow	0	0	0.00

tonyhutter commented 3 years ago

I think I know what's going on here.

TL;DR O_DIRECT bypasses the kernel page cache, giving better performance on zvols. Use O_DIRECT in your application or use the nocache command (https://manpages.debian.org/jessie/nocache/nocache.1.en.html) to get around it. Just make sure you don't do any non-512B (or 4k?) aligned accesses though.

Long answer: I was noticing kworker and zvol threads fighting it out in top while running my non-O_DIRECT dd tests. I ran echo l > /proc/sysrq-trigger to see what was going on, and would see traces like this:

[178916.746423]  [<ffffffffc0bbbb49>] zvol_request+0x269/0x370 [zfs]
[178916.753230]  [<ffffffff95d70727>] generic_make_request+0x177/0x3b0
[178916.760226]  [<ffffffff95d709d0>] submit_bio+0x70/0x150
[178916.766149]  [<ffffffff95ca0665>] ? bio_alloc_bioset+0x115/0x310
[178916.772953]  [<ffffffff95c9bdd7>] _submit_bh+0x127/0x160
[178916.778985]  [<ffffffff95c9c092>] __block_write_full_page+0x172/0x3b0
[178916.786275]  [<ffffffff95ca2270>] ? set_init_blocksize+0x90/0x90
[178916.793081]  [<ffffffff95ca2270>] ? set_init_blocksize+0x90/0x90
[178916.799887]  [<ffffffff95c9c4e8>] block_write_full_page+0xf8/0x120
[178916.806886]  [<ffffffff95ca29f8>] blkdev_writepage+0x18/0x20
[178916.813304]  [<ffffffff95bd4f79>] __writepage+0x19/0x50
[178916.819237]  [<ffffffff95bd5d2a>] write_cache_pages+0x24a/0x4b0
[178916.825944]  [<ffffffff95bd4f60>] ? global_dirtyable_memory+0x70/0x70
[178916.833236]  [<ffffffff95bd5fdd>] generic_writepages+0x4d/0x80
[178916.839848]  [<ffffffff95ca29be>] blkdev_writepages+0xe/0x10
[178916.846267]  [<ffffffff95bd6e21>] do_writepages+0x21/0x50

When I ran the same tests with O_DIRECT, I would see kworker running much less. I also didn't see any write_cache_pages() calls in the traces. That told me that the kernel correctly bypassing the cache with O_DIRECT, which is exactly the intent of the flag.

Just to be sure it wasn't ZFS behavior, I also explicitly cleared O_DIRECT on any zvol write:

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 8229bc9..e3c2e44 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -260,6 +260,9 @@ zfs_read(struct znode *zp, zfs_uio_t *uio, int ioflag, cred_t *cr)
        while (n > 0) {
                ssize_t nbytes = MIN(n, zfs_vnops_read_chunk_size -
                    P2PHASE(zfs_uio_offset(uio), zfs_vnops_read_chunk_size));
+
+        ioflag = ioflag & ~O_DIRECT;
+
 #ifdef UIO_NOCOPY
                if (zfs_uio_segflg(uio) == UIO_NOCOPY)
                        error = mappedread_sf(zp, nbytes, uio);

... but got the same performance results I saw earlier.

Note that ZFS will always use the ARC in both the non-O_DIRECT and O_DIRECT cases (although this may change with https://github.com/openzfs/zfs/pull/10018). Mounting zfs with -o dax has also been discussed as a possible workaround to get a sort of "mount-level O_DIRECT" (https://github.com/openzfs/zfs/issues/9986).

Closing issue.

tonyhutter commented 3 years ago

I had an interesting conversation with @behlendorf about this issue. He thought that non-O_DIRECT case might be submitting tons of tiny, one-page, BIO requests to the zvol, wheras the O_DIRECT case would pass the larger, uncached, BIO directly. He further explained that the ZFS module takes all BIOs and sticks them on a single task queue, where they are then processed by a pool of worker threads. He speculated that the single task queue may be getting overwhelmed.

I re-ran my earlier tests and confirmed this was a case. My test created 10 volumes, and then did parallel dd to them, with bs=1GB. What I saw was in the non-O_DIRECT case, zvol_request() would get ~25 million BIOs, each one a single 4K page. When I ran in the non-O_DIRECT case, I would see ~200K of 512K-sized BIOs.

I thought I could just switch out the IO scheduler to aggregate the BIOs, but no other schedulers were listed as options for the zvol:

$ cat /sys/class/block/zd0/queue/scheduler 
none

... contrast that to:

$ cat /sys/class/block/sda/queue/scheduler 
noop [deadline] cfq

I'm looking into ways to make this more efficient.

sempervictus commented 3 years ago

That might be on me - in earlier attempts to fix the ZVOL issue, i had it set NOMERGES:

commit 5731140eaf4aaf2526a8bfdbfe250195842e79eb
Author: RageLtMan <rageltman [at] sempervictus>
Date:   Sat Mar 18 00:51:36 2017 -0400

    Disable write merging on ZVOLs

    The current ZVOL implementation does not explicitly set merge
    options on ZVOL device queues, which results in the default merge
    behavior.

    Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the
    ZIO pipeline to do its work.

    Initial benchmarks (tiotest with no O_DIRECT) show random write
    performance going up almost 3X on 8K ZVOLs, even after significant
    rewrites of the logical space allocation.

    Reviewed-by: Richard Yao <ryao@gentoo.org>
    Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: RageLtMan <rageltman@sempervictus>
    Issue #5902

diff --git a/module/zfs/zvol.c b/module/zfs/zvol.c
index 7590ed3160..d0f7b9912b 100644
--- a/module/zfs/zvol.c
+++ b/module/zfs/zvol.c
@@ -1468,6 +1468,9 @@ zvol_alloc(dev_t dev, const char *name)
        blk_queue_make_request(zv->zv_queue, zvol_request);
        blk_queue_set_write_cache(zv->zv_queue, B_TRUE, B_TRUE);

+       /* Disable write merging in favor of the ZIO pipeline. */
+       queue_flag_set(QUEUE_FLAG_NOMERGES, zv->zv_queue);
+
        zv->zv_disk = alloc_disk(ZVOL_MINORS);
        if (zv->zv_disk == NULL)
                goto out_queue;

sempervictus commented 3 years ago

Doesn't BTRFS produce a volume type of some sort? How are they handling all of this?

tonyhutter commented 3 years ago

@sempervictus yea I tried commenting out those lines, but performance was unchanged :disappointed: I don't know if btrfs does volumes or not.

tonyhutter commented 3 years ago

All - I've opened a PR that improves zvol performance by using the kernel's blk-mq interface: #12664

However, there's still a lot of performance left on the table. The issue is that with O_DIRECT, the kernel gives you a struct request with a single, big, BIO:

request
    \- BIO 1 (512KB)

In the non-O_DIRECT case, it gives you a request with lots of little individual BIOs:

request
    \_ BIO 1 (4K)
    \_ BIO 2 (4K)
    \_ BIO 3 (4K)
    ...
    \_ BIO 30 (4K)

The zvol driver really wants big BIOs.

Each BIO has to do a dbuf_find(). Each dbuf_find() hashes the BIO offset/size to a dbuf, and locks the dbuf (along with temporarly locking the hash bucket). So if you have a bunch of small writes, they're all going to hash to the same dbuf, causing tons of lock contention. You can see how much time dbuf_find() takes up in zvol_write() by just looking at the flamegraph for the non-O_DIRECT case:

https://gist.githubusercontent.com/tonyhutter/2f200e08d49baf90a4aaf33d052ab47a/raw/10ec3450111e48090eda5fc66638f5065c268c62/no-odirect3.svg

There are a couple of ways to fix this. One way is to treat the BIOs in the request as a group rather than processing them individually. That way we would only have to acquire one lock for a group of contiguous small BIOs (probably at a higher level of the zvol stack). The other option is to try to make a new Frankenstein BIO that merges all the smaller BIOs. I'm planning on looking into this further and coming up with a fix once #12664 is merged.

I'm also changing the tile of this issue since the problem isn't dRAID specific.

shodanshok commented 3 years ago

$ cat /sys/class/block/zd0/queue/scheduler 
none
... contrast that to:
$ cat /sys/class/block/sda/queue/scheduler 
noop [deadline] cfq 
I'm looking into ways to make this more efficient.

In the past ZVOLs exposed all available IO schedulers, but with a successive code refactoring only none was left.

For a similar discussion, you can read here. In short:

On ZFS >= 0.6.5, the zvol code was changed to skip some of the previous linux "canned" block layer code, simplyfing the I/O stack and bypassing the I/O scheduler entirely (side note: in recent linux kernel, none is not a noop alias anymore. Rather, it really means no scheduler is in use. I also tried setting nomerges to 0, with no changes in I/O speed or behavior). This increased performance for the common case (zvol with direct I/O), but prevented any merging in the pagecache.

Many users missed the old behavior, I should say.

tonyhutter commented 3 years ago

Many users missed the old behavior, I should say.

@shodanshok my blk-mq PR #12664 actually returns the scheduler sysfs options, although they're a little different ([mq-deadline] kyber bfq none I believe). When I tested with them, they all pretty much performed the same except there was one that performed substantially worse (it was either kyber or bfq - I can't remember). None of that is surprising, as those schedulers are just going to re-jigger the BIOs contained within the struct request, rather than condense them into larger BIOs (which is what we want).

sempervictus commented 3 years ago

@tonyhutter - IIRC, there was write-merging code prior to the gutting during 0.6(.4?). If we have Linux scheduling back, does that mean we have Linux write-merging back, and then does that also mean we're doing redundant work on the Linux-half and the DMU/ZIO pipeline-half again (de-duplication of which was at least part of the intent of said gutting)?

tonyhutter commented 3 years ago

@sempervictus when it comes to "write-merging", the devil is in the details.

Under blk-mq, the kernel will "write-merge" multiple BIOs into one struct request. This write merging doesn't necessary help us, since we end up passing individual BIOs to the zvol layer, and not struct request, but it doesn't hurt us either ("mq-deadline" and "none" perform the same). We ideally would want the kernel to merge multiple small BIOs into big BIOs, but that's not what it does. We could alter the zvol code to operate on struct requests instead of individual BIOs (see https://github.com/openzfs/zfs/issues/12483#issuecomment-948859202), which could drastically reduce our dbuf hash locking, and increase performance. If we did that then we could benefit from the kernel's block scheduler, since it could be smarter about aggregating more BIOs into a single struct request.

It's possible that on earlier, zfs-0.6.x-era kernels, the kernel merged BIOs into bigger BIOs, but I don't know if that's actually true or not.

sempervictus commented 3 years ago

@tonyhutter: thanks for the clarification. It sounds like the dbuf hash locking bit is part of the problem with ZVOLs getting slower over time, so if we could have that change in this scope, it might help push out an otherwise ancient and evil problem with "filled" ZVOLs.

tonyhutter commented 3 years ago

@sempervictus yea, I plan to look into the dbuf hash locking as a follow-on PR once I get this and some other PRs checked in.

sempervictus commented 3 years ago

Thank you sir, looking forward to it. If there are no planned follow-on commits to this branch, i'll get it into our testing for QA.

openzfs / zfs