Closed kyle0r closed 11 months ago
given that PVE ships the ZFS modules as part of the kernel, different module versions also mean different kernel versions, so the regression might in fact be coming from that end (different behaviour w.r.t. scheduling, CPU exploit mitigations like downfall/.., ..). to rule out any side-effects like that, you could compile the newer ZFS version for the older kernel and then compare the two...
I also think that the I/O scheduler or the kernel itself may have a great impact.
In the Proxmox 5.x kernels, the I/O scheduler was "none", whereas the 6.x PVE kernels now use "mq-deadline".
You can check if the kernel is the culprit quite easily by just installing a 6.2 opt-in kernel on the old installation: apt install pve-kernel-6.2 or you can simply change the I/O scheduler and re-test on the new platform.
I'm betting on #15245/#15223 here.
I'd go check grep . /sys/module/icp/parameters/icp_*
for a missing avx
option as a heuristic on each kernel rev to see if that's happening.
Thanks for the responses all.
I did try setting the scheduler to none
but didn't see any major impacts.
rincebrain wrote:
I'm betting on #15245/#15223 here.
I'd go check
grep . /sys/module/icp/parameters/icp_*
for a missingavx
option as a heuristic on each kernel rev to see if that's happening.
Thank you @rincebrain, very helpful. icp_gcm_impl:cycle
was missing avx
in kernel 6.2.16-10-pve
Booting 6.2.16-1-pve resolved that and some quick checks look like that might of resolved encryption performance issues. I'll do thorough testing next week to verify/disqualify. There might be some slog performance issues remaining - I'll do some more research and perhaps open a separate issue for that.
Now that is strange, since the patch has been ported to 6.1.50 but not to any 6.2 kernels. If Proxmox had backported the patch themselves, I would expect it to be included in the newest 6.2 variant, not in an old one like 6.2.16-1?
I have opened a discussion on the Proxmox forum about this: https://forum.proxmox.com/threads/slow-zfs-encryption-will-we-get-a-fix-for-avx-avx2-not-being-selected.133681/
The thing someone mentioned at the end of that thread was cherrypicking the fix to what broke this, in gregkh/linux@d8f9a9cfdcd31290cb8b720746458cb110301c68.
6.2 doesn't seem to normally have torvalds/linux@2c66ca3949dc701da7f4c9407f2140ae425683a5 or torvalds/linux@b81fac906a8f, so maybe something else broke it.
e: Looks like PVE pulls Ubuntu's kernels, and Ubuntu's 6.2.0-30.30 pulled in torvalds/linux@b81fac906a8f, but has still never pulled torvalds/linux@2c66ca3949dc701da7f4c9407f2140ae425683a5. So getting Ubuntu to yank in that patch might be the fastest way to get Proxmox to fix it.
I am unsure if they pull Ubuntu only at the start of a new PVE version. From the looks of it, I guess they add only specific patches later on for incremental versions: https://git.proxmox.com/?p=pve-kernel.git;a=summary
FWIW: 6.2.16-6-pve (which really is 6.2.16-5-pve) does fix the problem as well.
It looks like they just add patches on top and keep synced with Ubuntu's tree, based on their submodule of it not being frozen at an early version.
FWIW, the next version of Proxmox kernels (6.2.16-14) will contain the cherry-picked fix (already confirmed to fix the regression, but currently still in internal testing):
https://git.proxmox.com/?p=pve-kernel.git;a=commit;h=9ba0dde971e6153a12f94e9c7a7337355ab3d0ed
also already reported on the Ubuntu side, so should be fixed there at some point in the near future as well: https://bugs.launchpad.net/bugs/2034745
I'm closing this issue.
If any one close to kernel dev could raise an issue about regression tests (why wasn't the issue detected automatically?) for this kind of issue that would be neat. One would think automated testing and/or CI would catch something this small but with large impacts?
Topic: Significant performance degradation/regression with aes-256-gcm between zfs 2.1-pve vs. 2.1.12-pve1
System information
Describe the problem you're observing
It would appear I have discovered a performance degradation or regression with OpenZFS datasets using aes-256-gcm encryption between zfs 2.1-pve vs. 2.1.12-pve1
In addition, it seems that my zpools with slog (Intel 900p) amplify the degradation 😑 which is really counter-intuitive.
I guess it makes sense to shout out to @behlendorf, @ryao, @tcaputi and @ahrens for their attention and triage.
@sempervictus maybe you'd like to take a look too?
Describe how to reproduce the problem
See the attached/included fio benchmarks and results between zfs 2.1-pve in 2022 vs. zfs 2.1.12-pve1 in 2023
Include any warning/errors/backtraces from the system logs
I don't have any warnings/errors/backtraces to share at this time. The system and kernel logs appear to be clean.
Foreword / background
First and foremost, thank you to all the authors of, and contributors to OpenZFS. Its a brilliant bit of software engineering that I use daily with great benefits. Serious data life quality improvements and gains!
To make this fairly large amount of content a little less laborious to consume, you might enjoy listening to Aurelios Instagram Reels Pack: https://on.soundcloud.com/zch5w.
This one time I added an slog device to a zpool...
So its 2023-September... I was adding an slog device to a SAS mirror pool and was verifying the
sync=always
setting was working as expected... i.e. better sync write io performance than without the slog. After adding the slog the performance dropped significantly testing withsync=always
. I was confused and went back to look at older benchmarks from 2022 on a SATA zpool and this is when I discovered something was off.I did some research and this issue covers the main findings. At first I thought something was off with my system and maybe there still is something wrong. There might be something specific about proxmox or some bad cfg somewhere but I've not been able to put my finger on it.
I need more brains and feedback on this issue.
After removing the slog from the SAS zpool, and testing
encryption=off
vs.encryption=aes-256-gcm
I was shocked to see the delta. Then re-testing with slog I was really shocked!FWIW a little background on me: I've been using OpenZFS for some years (since ~2015) and have been studying zfs performance topics in detail for a while, I'm not an expert but have some XP. I do try and take time to ensure I'm not misreporting an issue because of my setup/system (i.e. something wrong on my end or something I've overlooked). By way of example #14346 which I researched in 2022 and wrote up and published in Jan 2023. I also understand (and have experienced) most of what's going on with zvol performance issues as per #11407 and have contributed there too.
The system spec
This is my home-lab / data vault. I guess it would be classified as an entry level enterprise storage chassis, at least back at its DOM in 2017.
The slog device: Intel 900p PCIe 3.0 x4 card
TL;DR outside of OpenZFS the slog device is behaving as per the manufacturers published specifications, the
fio
XFS baseline tests between 2022 and 2023 are nearly identical. This would suggest things outside of ZFS are OK on the system.The first thing I'd like to share is a non-zfs
fio
benchmark between the mentioned kernel versions, the left benchmark was performed July 2022 and the right Sep 2023. This illustrates that the Intel SSD Optane 900p 280GB PCIe 3.0 x4, NVMe (SSDPED1D280GA) is performing as per the manufacturers published specifications, and that underlying hardware and kernel is unlikely to be some kind of problem or bottleneck, at least for XFS!The Intel 900p is my slog vdev. I typically create a 16GiB partition and then add the partition to a given zpool where I have a use case for higher performance
sync=always
workloads. For example:slog device
fio
baseline with XFSThe benchmark issues a lot of small 4k random synchronous IO. This pushes the Intel 900p to its limits. The async engine is used to issue many IO's concurrently
ioengine=libaio
and thenfdatasync=1
tells fio to issue synchronous write IO (for data blocks). 16 processes each issuing 64 in-flight operations to 16 GiB filesiodepth=64 numjobs=16 filesize=16G
to an XFS filesystem.First
fio
writes out (preallocates or lays out) the 16x16GiB files with pseudo random data ~256GiB of data. This is so the read portion of the test has pre-generated random data to read.Over the 5 minute test ~573GiB of data is read and ~65GiB of data is written concurrently. ~147 million issued reads, and ~16 million issued writes. The test is configured for 90% read and 10% write
rwmixread=90
. The Intel 900p is able to perform ~491k read IOPS and ~1920MiB/s read throughput and concurrently ~55k write IOPS and ~213MiB/s write throughput. The newer kernel performed marginally better.Summary: the Intel 900p is performing per the manufacturers spec, and can easily perform ±500k 4k random read OR synchronous write IOPS, and achieve ±2000MiB/s read OR synchronous write throughput with the 4k block size.
On this system the Intel 900p also does well under concurrent read/write workloads e.g. 50/50 read/write, the NVMe can perform ±255k read AND write IOPS and ±997MiB/s read AND write throughput concurrently.
screenshot of 2022 vs. 2023 XFS baseline tests screenshot of 50/50 read/write mix XFS baseline test from 2022
A few notes on the following
fio
testsUnless otherwise stated the
fio
tests are performed with ARC disabledprimarycache=none
in order to keep ARC out of the picture.The following
fio
tests are not as aggressive as the XFS NVEe tests above, it would be overkill and just flood the IO subsystem for the spindle disks.ashift=12
is used on all zpools.The 2022 tests used
fio-3.25
and the 2023 tests usedfio-3.33
. Given the the XFSfio
test results between these versions were nearly identical I would say its unlikely thatfio
has a performance or logic regression, but its not impossible.The 2022 the OpenZFS datasets were using
compression=on checksum=on
which would of been lz4 and fletcher4 respectively. In 2023 the OpenZFS datasets were usingcompression=zstd checksum=edonr
. I don't expect those differences to make the deltas/degradation I've experienced.SATA single SMR disk zpool - 2022 zfs 2.1-pve - no slog vs. slog
The purpose of these 2022
fio
tests was to measure the performance gains of adding the 900p slog to the zpool left is 2022 results without slog vs. right 2022 results with slog Both sets of tests were configured to useprimarycache=none
,sync=always
andencryption=aes-256-gcm
. Thefio
tests starts withrandwrite
4k, 128k, 1M, followed bywrite
(sequential).Observations
Well, its fairly clear to see that in 2022 for synchronous write IO the Intel 900p slog with OpenZFS 2.1-pve provided some substantial gains! Have a look on the right-hand diff, you'll see I've calculated some basic deltas under each test.
For example the 4k sync=always randwrite with slog saw 153,025% increase in IOPS, and 147,971% increase in BW. Also very impressive is the 128k sync=always randwrite with the slog saw 28,792% increase in IOPS, and 225,324% increase in BW. >400MiB/s on a single 5200 rpm spindle SMR pool!
SATA single SMR disk zpool with slog 2022 vs. 2023
The purpose of this test was to measure the difference in 2022 vs. 2023 results and to highlight the degradation / regression. left is 2022 results (OpenZFS 2.1-pve) vs. right 2023 (OpenZFS 2.1.12-pve1) results
Observations
I've added some yellow highlights to make some relevant deltas obvious. Take for example the 128k sync=always randwrite test... ~89% decrease in IOPS with OpenZFS 2.1.12-pve1, and a the same for BW. 😪 My words would be: disastrous! 🤯💥
OK. Lets move away from the SMR disks and onto a enterprise CMR SAS drives. The following tests were conducted on a SAS zpool with a mirror vdev.
SAS mirror CMR pool w/o slog encryption=off vs. aes-256-gcm
The purpose of this test was to measure the difference between
encryption=off
vs.encryption=aes-256-gcm
left is off vs. right aes-256-gcm No slog, and these tests were run on my current 6.2.16-10-pve kernel and OpenZFS 2.1.12-pve1Observations
Here is a look at some of the netgraphs for a randwrite 128k, the left hand side was
encryption=aes-256-gcm
and right hand side wasencryption=off
. This was for 1 of the 2 SAS mirror disks.💡 Note how with
encryption=off
the IO subsystems were able to write much larger variable IO to the physical device and subsequently achieve better performance. Withencryption=aes-256-gcm
the IO size was smaller and constant, more IOPS (on the physical disk(s) but less IO bandwidth in the overallfio
result.SAS mirror CMR pool with slog encryption=off vs. aes-256-gcm
The purpose of this test was to measure the difference between
encryption=off
vs.encryption=aes-256-gcm
left is off vs. right aes-256-gcm This time with slog, and these tests were run on my current 6.2.16-10-pve kernel and OpenZFS 2.1.12-pve1Observations
My conclusions thus far
On my system...
encryption=aes-256-gcm
. i.e. I never noticed this performance degradation in the past.fio
test results SAS CMR pool vs. the SATA SMR both with slog generate nearly identical results which is counter-intuitive? Surely given the SAS CMR physical devices are faster than the SATA SMR physical devices, one would expect the SAS pool to perform better.fio
results between w/o slog vs. with slog, only the 4k tests were faster with the slog, the 128k and 1M tests were slower with slog which is counter-intuitive.zpool iostat -v 1
andiostat -ctdmx 1
during the ZFSfio
tests, that read and write amplification are being observed in varying degrees. That is to sayfio
issues 4k IO but the IO subsystems modify the IO size the physical devices are reading/writing. I'm not sure to what extent this amplification relates to the degradation, it seems to be worse when the slog is being used, or on tests where the Intel 900p is the main pool physical data drive.I'd be interested to read comments on my findings and learn if someone else can reproduce these issues with or without slog in the picture.
I welcome critique on what I've shared here. I'm happy to answer questions, share further details of my system/zfs cfg (should be close to default), and try suggestions, and do more testing.
Cheers
Kyle
Appendices
To not distract from the main issue here, but to provide some more insights, here are some appendices.
slog device
fio
XFS baseline vs. ZFS encryption=offHere is the same fio test as the XFS baseline (left) vs. the Intel 900p as a zpool data vdev (right).
💡 Note because of out of space issues on ZFS I reduced the
filesize=16G
to1G
on the ZFS test. In theory this shouldn't have a significant impact on the results. What does it change? It meansfio
will read and write the same blocks in the file(s) more frequently in the test. Block contention could be a factor but my testing didn't highly this as an issue.As a side note,
fio
is much slower to preallocate/lay out the files on ZFS.fio
defaults tofallocate=native
and it would appear to be single threaded. Preallocation is desired to ensure random data is preallocated for the read part of the test.Observations
encrpyption=off
in the ZFS test, XFS is obviously not encrypted.fio
reports ZFS ~288MiB/s BW but when studying the netdata graphs the nvme drive was actually seeing consistent peaks close to 1024MiB/s. I witnessed this higher IO BW inzpool iostat
too.Some netdata graphs for the ZFS test
single threaded openssl performance on the system
aes-256-cbc
I appreciate this is not aes-256-gcm or ccm rather cbc.
openssl
doesn't support gcm or ccm on the cli, at least not on Debian bookworm. I also appreciate OpenZFS has its own implementation of AES. I include this to show the what a single thread can compute on this server.Here is ctc with AES-NI disabled
aes-256-ctr
Here is ctr for comparison:
Here is ctr with AES-NI disabled
single threaded /dev/urandom performance on the system