ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6

ppwaskie commented 2 years ago

System information

Type	Version/Name
Distribution Name	Gentoo
Distribution Version	Rolling
Kernel Version	5.19.14-gentoo-x86_64, 5.15.72-gentoo-x86_64
Architecture	x86_64
OpenZFS Version	2.1.6 or 2.1.5

Describe the problem you're observing

I've been running ZFS 2.1.4 for quite some time on my main ZFS array, using RAIDz3 with a very large dataset (85TB online). On Gentoo, I can only run a 5.15.x or lower kernel with this version. Upgrading to a 5.18 or 5.19 kernel, I need to upgrade to use ZFS 2.1.6 to compile for the newer kernel. When I try this, my write performance goes from 100-150MB/sec of write on 5.15 and ZFS 2.1.4 (testing emerge -a =sys-kernel/gentoo-sources-5.10.144) to about 100 kB/sec on 5.19.14 and ZFS 2.1.6.

I've tried ZFS 2.1.5 and 2.1.6 with a 5.15.72 kernel, and had the exact same performance regression.

The big issue is ZFS 2.1.4 has now been removed from the main world list after an emerge --sync, so I can't revert my installed version of 2.1.6.

Describe how to reproduce the problem

Upgrade an existing host to ZFS 2.1.5 or 2.1.6, attempt writing a larger package with lots of smaller files (e.g. a Linux kernel source package) and observe the write performance reduced by a factor of about 100.

Include any warning/errors/backtraces from the system logs

I see nothing indicating anything is going wrong. Nothing in dmesg, nothing in syslogs, and zpool status is clean.

Rebooting into a 5.15 kernel with ZFS 2.1.4 on the exact same array returns the expected performance.

ryao commented 2 years ago

Would you try ZFS master via the 9999 ebuild and see if the issue is present there too?

As long as you do not run a zpool upgrade $pool command, it should be safe to go to ZFS master and then back to 2.1.4.

satarsa commented 2 years ago

The big issue is ZFS 2.1.4 has now been removed from the main world list after an emerge --sync, so I can't revert my installed version of 2.1.6.

Actually, you can. You could clone the official gentoo repo from https://gitweb.gentoo.org/repo/gentoo.git/ as your local repo, check it out to the version with zfs-kmod-2.1.4-r1 not yet dropped (I believe it would be 33344d7dd6b44bd93c17485d77d60c0e25ef71ee) and mask locally everything >=2.1.5.

ppwaskie commented 2 years ago

@satarsa thanks for that. And @ryao I connected with one of the Gentoo maintainers for ZFS offline, and he provided me with some instructions on how to use the 9999 ebuild along with bisecting between 2.1.4 and 2.1.5. I’m happy to try and find the commit where the perf regression showed up, at least for my ZFS setup.

I honestly didn’t think this would get so much activity though shortly after I opened the issue! I’m currently not at home where this server is, but I’ll try and run some of these bisect ops while I’m away this week. Worst case, I can get this nailed down this coming weekend.

All of the support is greatly appreciated!!

ryao commented 2 years ago

I did not expect you to bisect it, but if you do, that would be awesome. I should be able to figure this out quickly if you identify the bad patch through a bisect.

scineram commented 2 years ago

@ryao From the release notes only #13405 looks like it could really impact general performance.

ppwaskie commented 2 years ago

I haven’t started bisecting yet, but more info on my system/setup where I’m seeing this issue:

Intel Xeon SP system, Skylake Platinum, 2 socket, 112 cores (with SMT enabled)
128GB RAM
13 x 10TB Seagate Exos drives in RAIDz3
2 x 1TB Intel NVMe SSD’s. Half of each are split into Log and ARC cache. The other half of each is a RAID-1 mirror for the root filesystem of the host.

So I do have many cores in the system running. In that RAIDz3 pool, I have many datasets carved out, where I’m pushing about 31TB used total. Most of it is video-based streaming content for Plex, so not lots of tiny files.

I hope to have more info once I can coordinate with home and bisect on the live system.

ppwaskie commented 2 years ago

Apologies for the delay on this. I was finally able to get some time on the box and bisect this.

This is the offending commit that is killing write performance on my system:

9f6943504aec36f897f814fb7ae5987425436b11 is the first bad commit
commit 9f6943504aec36f897f814fb7ae5987425436b11
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Tue Nov 30 10:38:09 2021 -0800

    Default to zfs_dmu_offset_next_sync=1

    Strict hole reporting was previously disabled by default as a
    performance optimization.  However, this has lead to confusion
    over the expected behavior and a variety of workarounds being
    adopted by consumers of ZFS.  Change the default behavior to
    always report holes and force the TXG sync.

    Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Upstream-commit: 05b3eb6d232009db247882a39d518e7282630753
    Ref: #13261
    Closes #12746

 man/man4/zfs.4   |  8 ++++----
 module/zfs/dmu.c | 12 ++++++++----
 2 files changed, 12 insertions(+), 8 deletions(-)

I've taken this a step further and while on the build with this patch, I turned off that tunable:

# echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

And then re-tested immediately after. The issue went away. I went from about 100kB/sec write performance to 150MB/sec (note the order of magnitude difference).

UPDATE: I went ahead and built the 2.1.6 ebuilds, and confirmed I still had this issue. I then turned off the same tunable, and the performance issue went away.

Hope this helps inform how to deal with this upstream.

ryao commented 2 years ago

Nice find.

rincebrain commented 2 years ago

I should warn you, turning that off will result in sometimes treating files as dense when they're sparse if that hasn't synced out yet, IIRC, so if that's a use case you care about, you may be sad.

Of course, when you're handing it to ZFS with compression on, it'll eat the sparseness one way or another, it's just a question of whether you unnecessarily copied some zeroes only to throw them out, so, if this works for you, great, just be aware that it results in additional IO overhead if you come looking for performance bottlenecks again.

amotin commented 2 years ago

I see it not great to allow regular unprivileged user to force or depend on pool TXG commits. There should be some better solution.

amotin commented 2 years ago

I think at very least the code could be optimized to not even think to commit TXG if file is below a certain size, especially if below one block, that means it can't have holes unless it is a one big hole. If I understood right and the workload is updating Linux source tree, then I guess most/many of source files should fit within one block.

thesamesam commented 1 year ago

stale[bot] commented 8 months ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

openzfs / zfs