Open tonyhutter opened 3 years ago
I did three different flamegraph runs and saw roughly the same results. Here's one of the runs:
(Note: for some reason these SVGs aren't interactive when I click on the links, but if I download them and run firefox odirect3.svg
it works)
One thing that stood out was zvol_write()
. On non-O_DIRECT, zvol_write()
takes ~50% of the CPU, but on O_DIRECT, it takes about 20%. If you click on zvol_write()
, you'll see that on non-O_DIRECT, most of the time is spent in dbuf_find()
, whereas on O_DIRECT dbuf_find()
is only a tiny fraction of the time. I'm continuing to investigate.
Not sure if this helps, but below are some of /proc/spl/kstat/zfs/dbufstats
. Rows with zeros are omitted. The numbers were relatively consistent across three runs.
non-O_DIRECT | O_DIRECT | non-O_DIRECT ÷ O_DIRECT | |
---|---|---|---|
cache_count | 18 | 9 | 2.00 |
cache_size_bytes | 2129920 | 720896 | 2.95 |
cache_size_bytes_max | 1874608128 | 2152464384 | 0.87 |
cache_target_bytes | 1873661094 | 2105699456 | 0.89 |
cache_lowater_bytes | 1686294985 | 1895129511 | 0.89 |
cache_hiwater_bytes | 2061027203 | 2316269401 | 0.89 |
cache_total_evicts | 100634 | 100706 | 1.00 |
cache_level_0 | 18 | 9 | 2.00 |
cache_level_0_bytes | 2129920 | 720896 | 2.95 |
hash_hits | 27092688 | 324774 | 83.42 |
hash_misses | 196310 | 273225 | 0.72 |
hash_collisions | 30046 | 5270 | 5.70 |
hash_elements | 63 | 43 | 1.47 |
hash_elements_max | 3617 | 5628 | 0.64 |
hash_chains | 0 | 0 | 0.00 |
hash_chain_max | 5 | 2 | 2.50 |
hash_insert_race | 693056 | 13503 | 51.33 |
metadata_cache_count | 13 | 13 | 1.00 |
metadata_cache_size_bytes | 209408 | 208896 | 1.00 |
metadata_cache_size_bytes_max | 209408 | 208896 | 1.00 |
metadata_cache_overflow | 0 | 0 | 0.00 |
I think I know what's going on here.
TL;DR O_DIRECT bypasses the kernel page cache, giving better performance on zvols. Use O_DIRECT in your application or use the nocache
command (https://manpages.debian.org/jessie/nocache/nocache.1.en.html) to get around it. Just make sure you don't do any non-512B (or 4k?) aligned accesses though.
Long answer:
I was noticing kworker
and zvol
threads fighting it out in top
while running my non-O_DIRECT dd
tests. I ran echo l > /proc/sysrq-trigger
to see what was going on, and would see traces like this:
[178916.746423] [<ffffffffc0bbbb49>] zvol_request+0x269/0x370 [zfs]
[178916.753230] [<ffffffff95d70727>] generic_make_request+0x177/0x3b0
[178916.760226] [<ffffffff95d709d0>] submit_bio+0x70/0x150
[178916.766149] [<ffffffff95ca0665>] ? bio_alloc_bioset+0x115/0x310
[178916.772953] [<ffffffff95c9bdd7>] _submit_bh+0x127/0x160
[178916.778985] [<ffffffff95c9c092>] __block_write_full_page+0x172/0x3b0
[178916.786275] [<ffffffff95ca2270>] ? set_init_blocksize+0x90/0x90
[178916.793081] [<ffffffff95ca2270>] ? set_init_blocksize+0x90/0x90
[178916.799887] [<ffffffff95c9c4e8>] block_write_full_page+0xf8/0x120
[178916.806886] [<ffffffff95ca29f8>] blkdev_writepage+0x18/0x20
[178916.813304] [<ffffffff95bd4f79>] __writepage+0x19/0x50
[178916.819237] [<ffffffff95bd5d2a>] write_cache_pages+0x24a/0x4b0
[178916.825944] [<ffffffff95bd4f60>] ? global_dirtyable_memory+0x70/0x70
[178916.833236] [<ffffffff95bd5fdd>] generic_writepages+0x4d/0x80
[178916.839848] [<ffffffff95ca29be>] blkdev_writepages+0xe/0x10
[178916.846267] [<ffffffff95bd6e21>] do_writepages+0x21/0x50
When I ran the same tests with O_DIRECT, I would see kworker
running much less. I also didn't see any write_cache_pages() calls in the traces. That told me that the kernel correctly bypassing the cache with O_DIRECT, which is exactly the intent of the flag.
Just to be sure it wasn't ZFS behavior, I also explicitly cleared O_DIRECT on any zvol write:
diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 8229bc9..e3c2e44 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -260,6 +260,9 @@ zfs_read(struct znode *zp, zfs_uio_t *uio, int ioflag, cred_t *cr)
while (n > 0) {
ssize_t nbytes = MIN(n, zfs_vnops_read_chunk_size -
P2PHASE(zfs_uio_offset(uio), zfs_vnops_read_chunk_size));
+
+ ioflag = ioflag & ~O_DIRECT;
+
#ifdef UIO_NOCOPY
if (zfs_uio_segflg(uio) == UIO_NOCOPY)
error = mappedread_sf(zp, nbytes, uio);
... but got the same performance results I saw earlier.
Note that ZFS will always use the ARC in both the non-O_DIRECT and O_DIRECT cases (although this may change with https://github.com/openzfs/zfs/pull/10018). Mounting zfs with -o dax
has also been discussed as a possible workaround to get a sort of "mount-level O_DIRECT" (https://github.com/openzfs/zfs/issues/9986).
Closing issue.
I had an interesting conversation with @behlendorf about this issue. He thought that non-O_DIRECT case might be submitting tons of tiny, one-page, BIO requests to the zvol, wheras the O_DIRECT case would pass the larger, uncached, BIO directly. He further explained that the ZFS module takes all BIOs and sticks them on a single task queue, where they are then processed by a pool of worker threads. He speculated that the single task queue may be getting overwhelmed.
I re-ran my earlier tests and confirmed this was a case. My test created 10 volumes, and then did parallel dd
to them, with bs=1GB
. What I saw was in the non-O_DIRECT case, zvol_request()
would get ~25 million BIOs, each one a single 4K page. When I ran in the non-O_DIRECT case, I would see ~200K of 512K-sized BIOs.
I thought I could just switch out the IO scheduler to aggregate the BIOs, but no other schedulers were listed as options for the zvol:
$ cat /sys/class/block/zd0/queue/scheduler
none
... contrast that to:
$ cat /sys/class/block/sda/queue/scheduler
noop [deadline] cfq
I'm looking into ways to make this more efficient.
That might be on me - in earlier attempts to fix the ZVOL issue, i had it set NOMERGES:
commit 5731140eaf4aaf2526a8bfdbfe250195842e79eb
Author: RageLtMan <rageltman [at] sempervictus>
Date: Sat Mar 18 00:51:36 2017 -0400
Disable write merging on ZVOLs
The current ZVOL implementation does not explicitly set merge
options on ZVOL device queues, which results in the default merge
behavior.
Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the
ZIO pipeline to do its work.
Initial benchmarks (tiotest with no O_DIRECT) show random write
performance going up almost 3X on 8K ZVOLs, even after significant
rewrites of the logical space allocation.
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: RageLtMan <rageltman@sempervictus>
Issue #5902
diff --git a/module/zfs/zvol.c b/module/zfs/zvol.c
index 7590ed3160..d0f7b9912b 100644
--- a/module/zfs/zvol.c
+++ b/module/zfs/zvol.c
@@ -1468,6 +1468,9 @@ zvol_alloc(dev_t dev, const char *name)
blk_queue_make_request(zv->zv_queue, zvol_request);
blk_queue_set_write_cache(zv->zv_queue, B_TRUE, B_TRUE);
+ /* Disable write merging in favor of the ZIO pipeline. */
+ queue_flag_set(QUEUE_FLAG_NOMERGES, zv->zv_queue);
+
zv->zv_disk = alloc_disk(ZVOL_MINORS);
if (zv->zv_disk == NULL)
goto out_queue;
Doesn't BTRFS produce a volume type of some sort? How are they handling all of this?
@sempervictus yea I tried commenting out those lines, but performance was unchanged :disappointed: I don't know if btrfs does volumes or not.
All - I've opened a PR that improves zvol performance by using the kernel's blk-mq interface: #12664
However, there's still a lot of performance left on the table. The issue is that with O_DIRECT, the kernel gives you a struct request
with a single, big, BIO:
request
\- BIO 1 (512KB)
In the non-O_DIRECT case, it gives you a request with lots of little individual BIOs:
request
\_ BIO 1 (4K)
\_ BIO 2 (4K)
\_ BIO 3 (4K)
...
\_ BIO 30 (4K)
The zvol driver really wants big BIOs.
Each BIO has to do a dbuf_find()
. Each dbuf_find()
hashes the BIO offset/size to a dbuf, and locks the dbuf (along with temporarly locking the hash bucket). So if you have a bunch of small writes, they're all going to hash to the same dbuf, causing tons of lock contention. You can see how much time dbuf_find()
takes up in zvol_write()
by just looking at the flamegraph for the non-O_DIRECT case:
There are a couple of ways to fix this. One way is to treat the BIOs in the request as a group rather than processing them individually. That way we would only have to acquire one lock for a group of contiguous small BIOs (probably at a higher level of the zvol stack). The other option is to try to make a new Frankenstein BIO that merges all the smaller BIOs. I'm planning on looking into this further and coming up with a fix once #12664 is merged.
I'm also changing the tile of this issue since the problem isn't dRAID specific.
$ cat /sys/class/block/zd0/queue/scheduler none
... contrast that to:
$ cat /sys/class/block/sda/queue/scheduler noop [deadline] cfq
I'm looking into ways to make this more efficient.
In the past ZVOLs exposed all available IO schedulers, but with a successive code refactoring only none
was left.
For a similar discussion, you can read here. In short:
On ZFS >= 0.6.5, the zvol code was changed to skip some of the previous linux "canned" block layer code, simplyfing the I/O stack and bypassing the I/O scheduler entirely (side note: in recent linux kernel, none is not a noop alias anymore. Rather, it really means no scheduler is in use. I also tried setting nomerges to 0, with no changes in I/O speed or behavior). This increased performance for the common case (zvol with direct I/O), but prevented any merging in the pagecache.
Many users missed the old behavior, I should say.
Many users missed the old behavior, I should say.
@shodanshok my blk-mq PR #12664 actually returns the scheduler
sysfs options, although they're a little different ([mq-deadline] kyber bfq none
I believe). When I tested with them, they all pretty much performed the same except there was one that performed substantially worse (it was either kyber or bfq - I can't remember). None of that is surprising, as those schedulers are just going to re-jigger the BIOs contained within the struct request
, rather than condense them into larger BIOs (which is what we want).
@tonyhutter - IIRC, there was write-merging code prior to the gutting during 0.6(.4?). If we have Linux scheduling back, does that mean we have Linux write-merging back, and then does that also mean we're doing redundant work on the Linux-half and the DMU/ZIO pipeline-half again (de-duplication of which was at least part of the intent of said gutting)?
@sempervictus when it comes to "write-merging", the devil is in the details.
Under blk-mq, the kernel will "write-merge" multiple BIOs into one struct request
. This write merging doesn't necessary help us, since we end up passing individual BIOs to the zvol layer, and not struct request
, but it doesn't hurt us either ("mq-deadline" and "none" perform the same). We ideally would want the kernel to merge multiple small BIOs into big BIOs, but that's not what it does. We could alter the zvol code to operate on struct requests
instead of individual BIOs (see https://github.com/openzfs/zfs/issues/12483#issuecomment-948859202), which could drastically reduce our dbuf hash locking, and increase performance. If we did that then we could benefit from the kernel's block scheduler, since it could be smarter about aggregating more BIOs into a single struct request
.
It's possible that on earlier, zfs-0.6.x-era kernels, the kernel merged BIOs into bigger BIOs, but I don't know if that's actually true or not.
@tonyhutter: thanks for the clarification. It sounds like the dbuf hash locking bit is part of the problem with ZVOLs getting slower over time, so if we could have that change in this scope, it might help push out an otherwise ancient and evil problem with "filled" ZVOLs.
@sempervictus yea, I plan to look into the dbuf hash locking as a follow-on PR once I get this and some other PRs checked in.
Thank you sir, looking forward to it. If there are no planned follow-on commits to this branch, i'll get it into our testing for QA.
System information
Describe the problem you're observing
One of our users was testing zvols on DRAID with a large number of disks, and reported much better performance when using O_DIRECT than without it (using multiple parallel
dd
writes). That seemed like an odd result, but my tests confirmed that as well. I was seeing around 1.7GByte/s vs 2.9GByte/s with O_DIRECT (on a 46-disk draid2:8d:46c:0s-0 pool, 12GBit SAS enclosure).The user reported seeing possible lock contention in the non-O_DIRECT case:
Describe how to reproduce the problem
Please adjust to your setup:
Include any warning/errors/backtraces from the system logs
Results: