ZVOL write IO merging not sufficient

samuelxhu commented 5 years ago

System information

Type	Version/Name
Distribution Name	ZFS on Linux
Distribution Version	Centos 7
Linux Kernel	3.10
Architecture	x-86
ZFS Version	0.6.5.X, 0.7.X, 0.8.X
SPL Version	0.6.5.X, 0.7.X, 0.8.X

Describe the problem you're observing

Before 0.6.5.X, e.g. 0.6.3-1.3 or 0.6.4.2, ZoL had the standard linux block device layer for ZVOL, thus one can use scheduler, deadline or others, to merge incoming IO requests. Even with the simplest noop scheduler, contiguous IO requests could still merge if they are sequential.

Things changed from 0.6.5.X on, Rao re-wrote the block layer of ZVOL, and disabled request merging at ZVOL layer, claiming that DMU does IO merging. However it seems that DMU IO merging either not work properly, or DMU IO merging is not sufficient from the performance point of view.

The problem is as follows. ZVOL has a volblocksize setting, and in many cases, e.g. for hosting VM, it is set to 32KB or so. When IO requests has a request size less than the volblocksize, read-modify-writes (RMW) will occur, leading to performance degradation. A scheduler, such as deadline, is capable of sorting and merging IO request, thus reducing the chance of RMW.

Describe how to reproduce the problem

Create a not-so-big ZVOL with volblocksize of 32KB, use FIO to issue a single sequential write IO workload of size 4KB, after a while (after the ZVOL filled with some data), either using "iostat -mx 1 10 " or "zpool iostat 1 10", one can see there are a lot of read-modify-writes. Note that at the beginning of writes, there will be less or no RMW because ZVOL is almost empty and ZFS can intelligently skip reading zeros.

In contrast, use FIO to issue sequential write IO workload of size 32KB, 64KB, or even larger, no matter how long you run the workload, there is no RMW.

Apparently IO merging logic at ZVOL is not working properly. Either we re-enable block device scheduler choice of deadline or noop, or fix the broken IO merging logic in DMU, should fix this performance issue.

Include any warning/errors/backtraces from the system logs

DemiMarie commented 2 years ago

Would it be possible to use the kernel’s write IO merging layer?

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

DemiMarie commented 1 year ago

Still an issue

tonyhutter commented 1 year ago

Long story short:

Use O_DIRECT if possible

or

Try enabling block multiqueue: https://github.com/openzfs/zfs/pull/13148

See my "zvol performance" slides: https://drive.google.com/file/d/1smEcZULZ6ni6XHkbubmeuI1vZj37naED/view?usp=sharing

sempervictus commented 1 year ago

@tonyhutter - thanks for the slide deck. How do those graphs look after a ZVOL has taken enough writes to have allocated all blocks at least once before freeing? The performance degradation over use is a very unpleasant effect in real world scenarios which benchmarks don't capture.

MichaelHierweck commented 1 year ago

Will #13148 be merged into 2.2?

tonyhutter commented 1 year ago

@sempervictus unfortunately I didn't test that. All tests were on relatively new zvols.

@MichaelHierweck that's the plan, but I'm still undecided on if blk-mq should be enabled by default or not. It's not a clear performance winner in all use cases. Currently blk-mq is not enabled by default in the master branch.

shodanshok commented 1 year ago

@tonyhutter thanks for the slide.

However, I am not sure using O_DIRECT would be enough to avoid any performance degradation, as the one describe by @sempervictus and detailed here: https://github.com/openzfs/zfs/issues/8472#issuecomment-480737181 (where I am not using O_DIRECT but, writing to a zfs plain file, I still avoid the linux pagecache).

In these cases the real performance killer is the r/m/w commanded by reading the to-be-overwritten block. I think the solution is the (non-merged) async DMU patch, or am I missing something?

sempervictus commented 1 year ago

The async-DMU patch isn't just unmerged, its rather unfinished. Getting that across the finish line likely requires a fair chunk of dedicated time from one of the senior folks familiar with a wide swath of the code or a lot of time from a skilled C developer who would need to figure out all of the structures, functions, and locking semantics that effort impacts.

DemiMarie commented 1 year ago

How much time do you think it would take @sempervictus?

sempervictus commented 1 year ago

@DemiMarie - a professional C/C++ developer colleague took a cursory pass, and since he's not of ZFS-land, the complexity of handling async dispatch in C was compounded by the complexity of ZFS itself which anyone from outside the core team would need to learn along the way. He found a few bugs in the async pieces just on that first pass - he's quite good :smile:, but that likely indicates that we're not at first down with 20y to go... We figured at least several weeks of professional effort (not cheap), but had no way to get an upper bound for that (consideration at the time was to hire him to do this, but folks like that aren't exactly dripping w/ spare time). So i can't answer that question to anyone's satisfaction for budgeting/hiring/etc. The level of effort to get into that PR is above the amount of free time i have, or will have in the next year given we just spun up a totally unrelated development venture; the level of effort to take it completion is unknown. I dont have tens of thousands of dollars lying around right now to put toward such work, so its on the back burner on my list until i can either hire/delegate a hired resource, contract one, or have the time and headspace to spend on it myself.

sempervictus commented 1 year ago

I think that the best chance we have at assay on the level of effort for that async DMU piece is to ask one of the heavy-hitting OpenZFS members to slot a review aimed at ascertaining requirements into their workflow over the next X months. I've fallen off somewhat in my awareness of the goings-on around here in the last year or so (not that i don't love y'all, just that infosec is somewhat of an all consuming function these days); but off the top of my head, figure that @ryao, @behlendorf, or @tonyhutter have the breadth of code-awareness required to able to partition & delegate efforts for such a review. Personally, wouldn't be surprised if (given appropriate effort) this becomes the new "ABD patch set" during R&D :smile:

bghira commented 1 year ago

@tuxoko or @dweeezil or @pcd1193182 might be able to take a look? (sorry for volunteering you :) )

DemiMarie commented 1 year ago

I wonder if stackless coroutines could help. Those can be implemented in C with some macro abuse. If this were part of the Linux kernel I would suggest using Rust, which supports async programming natively.

sempervictus commented 1 year ago

I mostly live in Rust these days, but until the GCC front-end is done, i'm not a big fan of mixing ABI like that.

shodanshok commented 1 year ago

@DemiMarie the main point of the async DMU patch should be to avoid the synchronous read of the to-be-overwritten records, rather than issuing an async callback. In other workds, I understand it as a mean to deferring some writes to avoid useless reads. Or am I wrong?

DemiMarie commented 1 year ago

@DemiMarie the main point of the async DMU patch should be to avoid the synchronous read of the to-be-overwritten records, rather than issuing an async callback. In other workds, I understand it as a mean to deferring some writes to avoid useless reads. Or am I wrong?

I honestly am not sure what the async DMU patch actually does, but async C code is a known pain-point in general.

openzfs / zfs