openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.55k stars 1.74k forks source link

Significant performance degradation/regression with aes-256-gcm between zfs 2.1-pve vs. 2.1.12-pve1 #15276

Closed kyle0r closed 11 months ago

kyle0r commented 1 year ago

Topic: Significant performance degradation/regression with aes-256-gcm between zfs 2.1-pve vs. 2.1.12-pve1

System information

Type Version/Name
Distribution Name Proxmox (Debian)
Distribution Version 7.1 (Debian 11 bullseye) vs. 8.0.4 (Debian 12 bookworm)
Kernel Version 5.13-pve vs. 6.2.16-10-pve
Architecture x86_64
OpenZFS Version zfs 2.1-pve vs. 2.1.12-pve1

Describe the problem you're observing

It would appear I have discovered a performance degradation or regression with OpenZFS datasets using aes-256-gcm encryption between zfs 2.1-pve vs. 2.1.12-pve1

In addition, it seems that my zpools with slog (Intel 900p) amplify the degradation 😑 which is really counter-intuitive.

I guess it makes sense to shout out to @behlendorf, @ryao, @tcaputi and @ahrens for their attention and triage.

@sempervictus maybe you'd like to take a look too?

Describe how to reproduce the problem

See the attached/included fio benchmarks and results between zfs 2.1-pve in 2022 vs. zfs 2.1.12-pve1 in 2023

Include any warning/errors/backtraces from the system logs

I don't have any warnings/errors/backtraces to share at this time. The system and kernel logs appear to be clean.

Foreword / background

First and foremost, thank you to all the authors of, and contributors to OpenZFS. Its a brilliant bit of software engineering that I use daily with great benefits. Serious data life quality improvements and gains!

To make this fairly large amount of content a little less laborious to consume, you might enjoy listening to Aurelios Instagram Reels Pack: https://on.soundcloud.com/zch5w.

This one time I added an slog device to a zpool...

So its 2023-September... I was adding an slog device to a SAS mirror pool and was verifying the sync=always setting was working as expected... i.e. better sync write io performance than without the slog. After adding the slog the performance dropped significantly testing with sync=always. I was confused and went back to look at older benchmarks from 2022 on a SATA zpool and this is when I discovered something was off.

I did some research and this issue covers the main findings. At first I thought something was off with my system and maybe there still is something wrong. There might be something specific about proxmox or some bad cfg somewhere but I've not been able to put my finger on it.

I need more brains and feedback on this issue.

After removing the slog from the SAS zpool, and testing encryption=off vs. encryption=aes-256-gcm I was shocked to see the delta. Then re-testing with slog I was really shocked!

FWIW a little background on me: I've been using OpenZFS for some years (since ~2015) and have been studying zfs performance topics in detail for a while, I'm not an expert but have some XP. I do try and take time to ensure I'm not misreporting an issue because of my setup/system (i.e. something wrong on my end or something I've overlooked). By way of example #14346 which I researched in 2022 and wrote up and published in Jan 2023. I also understand (and have experienced) most of what's going on with zvol performance issues as per #11407 and have contributed there too.

The system spec

This is my home-lab / data vault. I guess it would be classified as an entry level enterprise storage chassis, at least back at its DOM in 2017.

image

The slog device: Intel 900p PCIe 3.0 x4 card

TL;DR outside of OpenZFS the slog device is behaving as per the manufacturers published specifications, the fio XFS baseline tests between 2022 and 2023 are nearly identical. This would suggest things outside of ZFS are OK on the system.

The first thing I'd like to share is a non-zfs fio benchmark between the mentioned kernel versions, the left benchmark was performed July 2022 and the right Sep 2023. This illustrates that the Intel SSD Optane 900p 280GB PCIe 3.0 x4, NVMe (SSDPED1D280GA) is performing as per the manufacturers published specifications, and that underlying hardware and kernel is unlikely to be some kind of problem or bottleneck, at least for XFS!

The Intel 900p is my slog vdev. I typically create a 16GiB partition and then add the partition to a given zpool where I have a use case for higher performance sync=always workloads. For example:

zpool add store6 log /dev/disk/by-id/nvme-INTEL_SSDPED1D280GA_P___________280CGN-part1 # (16 GB partition)

# then for datasets where I'd like to take advantage of the sync write IO boost, I use:
zfs set sync=always <dataset>

# for datasets where async IO is OK, I typically use:
zfs set sync=disabeld <dataset>

slog device fio baseline with XFS

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=es --bs=4k --iodepth=64 --numjobs=16 --filesize=16G --fdatasync=1 --readwrite=randrw --rwmixread=90 --runtime=300 --group_reporting --time_based=1

The benchmark issues a lot of small 4k random synchronous IO. This pushes the Intel 900p to its limits. The async engine is used to issue many IO's concurrently ioengine=libaio and then fdatasync=1 tells fio to issue synchronous write IO (for data blocks). 16 processes each issuing 64 in-flight operations to 16 GiB files iodepth=64 numjobs=16 filesize=16G to an XFS filesystem.

First fio writes out (preallocates or lays out) the 16x16GiB files with pseudo random data ~256GiB of data. This is so the read portion of the test has pre-generated random data to read.

Over the 5 minute test ~573GiB of data is read and ~65GiB of data is written concurrently. ~147 million issued reads, and ~16 million issued writes. The test is configured for 90% read and 10% write rwmixread=90. The Intel 900p is able to perform ~491k read IOPS and ~1920MiB/s read throughput and concurrently ~55k write IOPS and ~213MiB/s write throughput. The newer kernel performed marginally better.

Summary: the Intel 900p is performing per the manufacturers spec, and can easily perform ±500k 4k random read OR synchronous write IOPS, and achieve ±2000MiB/s read OR synchronous write throughput with the 4k block size.

On this system the Intel 900p also does well under concurrent read/write workloads e.g. 50/50 read/write, the NVMe can perform ±255k read AND write IOPS and ±997MiB/s read AND write throughput concurrently.

screenshot of 2022 vs. 2023 XFS baseline tests image screenshot of 50/50 read/write mix XFS baseline test from 2022 image

A few notes on the following fio tests

Unless otherwise stated the fio tests are performed with ARC disabled primarycache=none in order to keep ARC out of the picture.

The following fio tests are not as aggressive as the XFS NVEe tests above, it would be overkill and just flood the IO subsystem for the spindle disks.

ashift=12 is used on all zpools.

The 2022 tests used fio-3.25 and the 2023 tests used fio-3.33. Given the the XFS fio test results between these versions were nearly identical I would say its unlikely that fio has a performance or logic regression, but its not impossible.

The 2022 the OpenZFS datasets were using compression=on checksum=on which would of been lz4 and fletcher4 respectively. In 2023 the OpenZFS datasets were using compression=zstd checksum=edonr. I don't expect those differences to make the deltas/degradation I've experienced.

SATA single SMR disk zpool - 2022 zfs 2.1-pve - no slog vs. slog

The purpose of these 2022 fio tests was to measure the performance gains of adding the 900p slog to the zpool left is 2022 results without slog vs. right 2022 results with slog Both sets of tests were configured to use primarycache=none, sync=always and encryption=aes-256-gcm. The fio tests starts with randwrite 4k, 128k, 1M, followed by write (sequential).

image

Observations

Well, its fairly clear to see that in 2022 for synchronous write IO the Intel 900p slog with OpenZFS 2.1-pve provided some substantial gains! Have a look on the right-hand diff, you'll see I've calculated some basic deltas under each test.

For example the 4k sync=always randwrite with slog saw 153,025% increase in IOPS, and 147,971% increase in BW. Also very impressive is the 128k sync=always randwrite with the slog saw 28,792% increase in IOPS, and 225,324% increase in BW. >400MiB/s on a single 5200 rpm spindle SMR pool!

SATA single SMR disk zpool with slog 2022 vs. 2023

The purpose of this test was to measure the difference in 2022 vs. 2023 results and to highlight the degradation / regression. left is 2022 results (OpenZFS 2.1-pve) vs. right 2023 (OpenZFS 2.1.12-pve1) results

image

Observations

I've added some yellow highlights to make some relevant deltas obvious. Take for example the 128k sync=always randwrite test... ~89% decrease in IOPS with OpenZFS 2.1.12-pve1, and a the same for BW. 😪 My words would be: disastrous! 🤯💥


OK. Lets move away from the SMR disks and onto a enterprise CMR SAS drives. The following tests were conducted on a SAS zpool with a mirror vdev.

SAS mirror CMR pool w/o slog encryption=off vs. aes-256-gcm

The purpose of this test was to measure the difference between encryption=off vs. encryption=aes-256-gcm left is off vs. right aes-256-gcm No slog, and these tests were run on my current 6.2.16-10-pve kernel and OpenZFS 2.1.12-pve1

image

Observations

  1. The randwrite tests saw a degradation using encryption=off vs. aes-256-gcm between 22-26%
  2. The sequential write tests saw a degradation using encryption=off vs. aes-256-gcm between 41-51%

Here is a look at some of the netgraphs for a randwrite 128k, the left hand side was encryption=aes-256-gcm and right hand side was encryption=off. This was for 1 of the 2 SAS mirror disks.

💡 Note how with encryption=off the IO subsystems were able to write much larger variable IO to the physical device and subsequently achieve better performance. With encryption=aes-256-gcm the IO size was smaller and constant, more IOPS (on the physical disk(s) but less IO bandwidth in the overall fio result.

image image

SAS mirror CMR pool with slog encryption=off vs. aes-256-gcm

The purpose of this test was to measure the difference between encryption=off vs. encryption=aes-256-gcm left is off vs. right aes-256-gcm This time with slog, and these tests were run on my current 6.2.16-10-pve kernel and OpenZFS 2.1.12-pve1

image

Observations

  1. The randwrite tests saw a degradation using encryption=off vs. aes-256-gcm between 45- 86%
  2. The sequential write tests saw a degradation using encryption=off vs. aes-256-gcm between 42-87%

My conclusions thus far

On my system...

  1. It cannot be ruled out that my system/cfg or me is at fault but I think I've spend a bit of time trying to eliminate that possibility. Hopefully you can see that I'm a detail oriented person and try to double check and research before raising issues and making call-outs!
  2. AES-NI seems to be working as expected for the Intel Xeon CPUs (6 core E5-2620 v3). See appendices for some quick sanity checks on that.
  3. In 2022 with OpenZFS 2.1-pve the slog vdev provided the SATA single SMR disk pool a substantial performance boost for sync workloads. The performance impact for the use of aes-256-gcm encryption on the datasets in 2022 appeared to be unremarkable/transparent.
  4. Until now my zpools have always performed around-a-bout the manufacturers published specifications with encryption=aes-256-gcm. i.e. I never noticed this performance degradation in the past.
  5. In 2023 with my upgrade to proxmox 8 using OpenZFS 2.1.12-pve1 datasets encrypted with aes-256-gcm appear to suffer a degradation or regression in performance as highlighted by the testing herein. zpools with slog and sync=always appear to be an amplifier of the issue and not a root cause.
  6. My fio test results SAS CMR pool vs. the SATA SMR both with slog generate nearly identical results which is counter-intuitive? Surely given the SAS CMR physical devices are faster than the SATA SMR physical devices, one would expect the SAS pool to perform better.
  7. When comparing SAS CMR fio results between w/o slog vs. with slog, only the 4k tests were faster with the slog, the 128k and 1M tests were slower with slog which is counter-intuitive.
  8. These outcomes got me wondering if there are build regression tests for these scenarios in the OpenZFS project?
  9. It would appear that when watching zpool iostat -v 1 and iostat -ctdmx 1 during the ZFS fio tests, that read and write amplification are being observed in varying degrees. That is to say fio issues 4k IO but the IO subsystems modify the IO size the physical devices are reading/writing. I'm not sure to what extent this amplification relates to the degradation, it seems to be worse when the slog is being used, or on tests where the Intel 900p is the main pool physical data drive.

I'd be interested to read comments on my findings and learn if someone else can reproduce these issues with or without slog in the picture.

I welcome critique on what I've shared here. I'm happy to answer questions, share further details of my system/zfs cfg (should be close to default), and try suggestions, and do more testing.

Cheers

Kyle


Appendices

To not distract from the main issue here, but to provide some more insights, here are some appendices.

slog device fio XFS baseline vs. ZFS encryption=off

Here is the same fio test as the XFS baseline (left) vs. the Intel 900p as a zpool data vdev (right).

💡 Note because of out of space issues on ZFS I reduced the filesize=16G to 1G on the ZFS test. In theory this shouldn't have a significant impact on the results. What does it change? It means fio will read and write the same blocks in the file(s) more frequently in the test. Block contention could be a factor but my testing didn't highly this as an issue.
As a side note, fio is much slower to preallocate/lay out the files on ZFS. fio defaults to fallocate=native and it would appear to be single threaded. Preallocation is desired to ensure random data is preallocated for the read part of the test.

image

Observations

  1. Keep in mind encrpyption=off in the ZFS test, XFS is obviously not encrypted.
  2. I don't understand how/why the performance drops so badly on ZFS.
    1. 🤯 ~85% decrease in read IOPS and BW, the same ~85% decrease for writes.
    2. 🔴 For example 501k write IOPS XFS vs. 73.6k ZFS
    3. 🚀 The XFS baseline is ~6.8 times faster than ZFS in this test case
  3. It would appear read-amplification occurring, fio reports ZFS ~288MiB/s BW but when studying the netdata graphs the nvme drive was actually seeing consistent peaks close to 1024MiB/s. I witnessed this higher IO BW in zpool iostat too.
  4. There is likely write-amplification occurring too but its harder to interpret from the netdata graphs with a quick glance.

Some netdata graphs for the ZFS test image

image

image

image

image

image

single threaded openssl performance on the system

aes-256-cbc

I appreciate this is not aes-256-gcm or ccm rather cbc. openssl doesn't support gcm or ccm on the cli, at least not on Debian bookworm. I also appreciate OpenZFS has its own implementation of AES. I include this to show the what a single thread can compute on this server.

root@viper:/sas/data/fio# timeout 10 openssl enc -aes-256-cbc -pass pass:"$PASS" -nosalt -md sha512 -iter 1000000 </dev/zero | pv >/dev/null
3.30GiB 0:00:10 [ 375MiB/s]

Here is ctc with AES-NI disabled

OPENSSL_ia32cap="~0x200000200000000" timeout 10 openssl enc -aes-256-cbc -pass pass:"$PASS" -nosalt -md sha512 -iter 1000000 </dev/zero | pv >/dev/null
1.66GiB 0:00:10 [ 178MiB/s]

aes-256-ctr

Here is ctr for comparison:

root@viper:~# timeout 10 openssl enc -aes-256-ctr -pass pass:"$PASS" -nosalt -md sha512 -iter 1000000 </dev/zero | pv >/dev/null
9.80GiB 0:00:09 [1.17GiB/s]

Here is ctr with AES-NI disabled

OPENSSL_ia32cap="~0x200000200000000" timeout 10 openssl enc -aes-256-ctr -pass pass:"$PASS" -nosalt -md sha512 -iter 1000000 </dev/zero | pv >/dev/null
2.31GiB 0:00:10 [ 269MiB/s]

single threaded /dev/urandom performance on the system

root@viper:/sas/data/fio# timeout 10 pv /dev/urandom >/dev/null
3.15GiB 0:00:09 [ 357MiB/s]
Fabian-Gruenbichler commented 1 year ago

given that PVE ships the ZFS modules as part of the kernel, different module versions also mean different kernel versions, so the regression might in fact be coming from that end (different behaviour w.r.t. scheduling, CPU exploit mitigations like downfall/.., ..). to rule out any side-effects like that, you could compile the newer ZFS version for the older kernel and then compare the two...

meyergru commented 1 year ago

I also think that the I/O scheduler or the kernel itself may have a great impact.

In the Proxmox 5.x kernels, the I/O scheduler was "none", whereas the 6.x PVE kernels now use "mq-deadline".

You can check if the kernel is the culprit quite easily by just installing a 6.2 opt-in kernel on the old installation: apt install pve-kernel-6.2 or you can simply change the I/O scheduler and re-test on the new platform.

rincebrain commented 1 year ago

I'm betting on #15245/#15223 here.

I'd go check grep . /sys/module/icp/parameters/icp_* for a missing avx option as a heuristic on each kernel rev to see if that's happening.

kyle0r commented 1 year ago

Thanks for the responses all.

I did try setting the scheduler to none but didn't see any major impacts.

rincebrain wrote:

I'm betting on #15245/#15223 here.

I'd go check grep . /sys/module/icp/parameters/icp_* for a missing avx option as a heuristic on each kernel rev to see if that's happening.

Thank you @rincebrain, very helpful. icp_gcm_impl:cycle was missing avx in kernel 6.2.16-10-pve

Booting 6.2.16-1-pve resolved that and some quick checks look like that might of resolved encryption performance issues. I'll do thorough testing next week to verify/disqualify. There might be some slog performance issues remaining - I'll do some more research and perhaps open a separate issue for that.

meyergru commented 1 year ago

Now that is strange, since the patch has been ported to 6.1.50 but not to any 6.2 kernels. If Proxmox had backported the patch themselves, I would expect it to be included in the newest 6.2 variant, not in an old one like 6.2.16-1?

I have opened a discussion on the Proxmox forum about this: https://forum.proxmox.com/threads/slow-zfs-encryption-will-we-get-a-fix-for-avx-avx2-not-being-selected.133681/

rincebrain commented 1 year ago

The thing someone mentioned at the end of that thread was cherrypicking the fix to what broke this, in gregkh/linux@d8f9a9cfdcd31290cb8b720746458cb110301c68.

6.2 doesn't seem to normally have torvalds/linux@2c66ca3949dc701da7f4c9407f2140ae425683a5 or torvalds/linux@b81fac906a8f, so maybe something else broke it.

e: Looks like PVE pulls Ubuntu's kernels, and Ubuntu's 6.2.0-30.30 pulled in torvalds/linux@b81fac906a8f, but has still never pulled torvalds/linux@2c66ca3949dc701da7f4c9407f2140ae425683a5. So getting Ubuntu to yank in that patch might be the fastest way to get Proxmox to fix it.

meyergru commented 1 year ago

I am unsure if they pull Ubuntu only at the start of a new PVE version. From the looks of it, I guess they add only specific patches later on for incremental versions: https://git.proxmox.com/?p=pve-kernel.git;a=summary

FWIW: 6.2.16-6-pve (which really is 6.2.16-5-pve) does fix the problem as well.

rincebrain commented 1 year ago

It looks like they just add patches on top and keep synced with Ubuntu's tree, based on their submodule of it not being frozen at an early version.

Fabian-Gruenbichler commented 1 year ago

FWIW, the next version of Proxmox kernels (6.2.16-14) will contain the cherry-picked fix (already confirmed to fix the regression, but currently still in internal testing):

https://git.proxmox.com/?p=pve-kernel.git;a=commit;h=9ba0dde971e6153a12f94e9c7a7337355ab3d0ed

also already reported on the Ubuntu side, so should be fixed there at some point in the near future as well: https://bugs.launchpad.net/bugs/2034745

kyle0r commented 11 months ago

I'm closing this issue.

If any one close to kernel dev could raise an issue about regression tests (why wasn't the issue detected automatically?) for this kind of issue that would be neat. One would think automated testing and/or CI would catch something this small but with large impacts?