openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.66k stars 1.75k forks source link

Using zfs native encryption considerable lowers I/O throughput when reading data from ARC #13736

Open jkool702 opened 2 years ago

jkool702 commented 2 years ago

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 36
Kernel Version 5.18.15-200.fc36.x86_64
Architecture x86_64
OpenZFS Version 2.1.5

Brief Problem Description

When ZFS is used with ZFS native encryption (aes-256-gcm, maybe others too), and when reading some recently written data that is cached in ARC and is in the MRU causes it to become classified as "frequently used" and get moved to the MFU, the data is seemingly "deep copied" . As the transfer from MRU to MFU happens:

This is in contrast to unencrypted ZFS, in which passing data from the MRU to the MFU seemingly passed by reference, and the overall ARC size does not increase.

Additionally, reading data from ARC in ZFS w/ native encryption is much slower (often ~1/3 the speed in my testing) than with unencrypted ZFS. Often, reading data from the ARC cache n ZFS w/ native encryption happens at almost the exact same speed as writing data, perhaps suggesting that this deep copy may also involve an encryption operation as well.

Testing was done using fio and (temporary) ramdisk-backed ZFS pools. More details below.

Describe the problem you're observing

I have been benchmarking ZFS (using fio and ramdisks) to compare the performance of ZFS a) unencrypted b) with ZFS native encryption c) on top of LUKS

and have found a curious performance issue on fio read operations using ZFS native encryption. The benchmark is in my "zfsEncryption_SpeedTest" repo here. This code will create the ramdisk block devices, put ZFS on them, run numerous fio benchmarks, and then break down the temp ZFS pools and the ramdisks.

Here is an example of the sort of results I am seeing

# nParallel_fio=2
----------------------------------------------------------------
||---- RESULT SUMMARY FOR A 0% READ / 100% WRITE WORKLOAD ----||
----------------------------------------------------------------

WRITE 

Block Size (KB)             :128         4                                                                         
ZFS (no encryption)         :1985MiB/s  357MiB/s
ZFS (native encryption)     :1872MiB/s  355MiB/s
ZFS (LUKS encryption)       :1831MiB/s  359MiB/s

----------------------------------------------------------------
||---- RESULT SUMMARY FOR A 50% READ / 50% WRITE WORKLOAD ----||
----------------------------------------------------------------

READ 

Block Size (KB)             :128         4                                                                         
ZFS (no encryption)         :1539MiB/s  297MiB/s
ZFS (native encryption)     :1124MiB/s  300MiB/s
ZFS (LUKS encryption)       :1446MiB/s  299MiB/s

----------------------------------------------------------------
----------------------------------------------------------------

WRITE 

Block Size (KB)             :128         4                                                                         
ZFS (no encryption)         :1539MiB/s  297MiB/s
ZFS (native encryption)     :1124MiB/s  300MiB/s
ZFS (LUKS encryption)       :1446MiB/s  299MiB/s

----------------------------------------------------------------
----------------------------------------------------------------

MIXED 

Block Size (KB)             :128         4                                                                         
ZFS (no encryption)         :3079MiB/s  595MiB/s
ZFS (native encryption)     :2247MiB/s  599MiB/s
ZFS (LUKS encryption)       :2892MiB/s  597MiB/s

----------------------------------------------------------------
||---- RESULT SUMMARY FOR A 100% READ / 0% WRITE WORKLOAD ----||
----------------------------------------------------------------

READ 

Block Size (KB)             :128         4                                                                         
ZFS (no encryption)         :7180MiB/s  1943MiB/s
ZFS (native encryption)     :2785MiB/s  1948MiB/s
ZFS (LUKS encryption)       :7201MiB/s  1918MiB/s

It is fairly clear that as the % read in the IO mix increases, the unencrypted ZFS and ZFS+LUKS cases do considerably better than the ZFS native encryption case. In trying to figure out why this happens, I noticed something strange about ARC usage. As I monitored ARC/MRU/MFU usage, the fio benchmarks share the following pattern:

  1. as fio "lays out the IO file", both ARC and MRU size increase by the total size of the fio test.
  2. once the actual benchmark starts, whatever amount of data is used for read IO testing gets moved from the MRU to the MFU.

example: on a fio benchmark size of 10G with 70% read / 30% write (meaning 7G reads and 3G writes):

  1. initially (as the file is laid out) both ARC and MRU usage increase to 10G, then
  2. after the test starts and as it progresses, the MRU usage shrinks to 3G while the MFU usage increases to 7G

Now this happens in all benchmarks, BUT

  1. On unencrypted ZFS and on ZFS+LUKS, the total ARC size remains constant as data is moved from the MRU to the MFU. i.e., in the above example ARC size stays at 10G as data is moved from the MRU to the MFU.
  2. On ZFS w/ native encryption, the total ARC size increases as data is moved from the MRU to the MFU. The increase is equal to the amount of data moved. i.e., in the above example ARC size increases to 17G as data is moved from the MRU to the MFU.
  3. Additionally, on ZFS w/ native encryption, during the test monitors like htop show things are getting stuck behind N threads for a period on time, where N is the number of parallel fio tests. i.e., it seems that on every individual fio run (run sequentially or in parallel) things get briefly stuck in some single-threaded process

The result is that read IO is only about 40% as fast with ZFS native encryption as it is with no encryption or using LUKS.

It is worth noting - running the benchmark with multiple fio runs in parallel, read IO is slightly faster than write IO. However, running a single fio run (using a single large fio file), the IO rate when reading or writing data is almost exactly the same. This combined with the increases in ARC size when moving data from MRU to MFU, makes me think that instead of moving the data by reference (e.g., passing a pointer to the data) it is doing a deep copy of the data, quite possibly coupled with decrypting and re-encrypting it during the copy. Deep copy + reencrypt would explain why the read rate doesnt surpass the write rate and why the ARC grows in size. I hope this behavior is not intentional/required for ZFS native encryption, since it puts a pretty massive performance penalty on read IO.

Describe how to reproduce the problem

Run the benchmark (like is given in the above section) and monitor ARC/MRU/MFU size and look at the output showing IO transfer rates under various conditions.

note: ARC/MRU/MFU size is easy to monitor using htop - configure it and add the zfs stats to the info given at the top of the htop display.

Include any warning/errors/backtraces from the system logs

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

blind-oracle commented 1 year ago

I wonder why nobody has any comments here. I can also confirm that native ZFS encryption falls crazily behind ZFS-over-LUKS.

I have a not-very-high-end CPU (4-core Xeon D-2123IT), but still on the same hardware the ZFS encryption is horribly slow...

hoonlight commented 8 months ago

Hi, is there an update on this bug?

jkool702 commented 7 months ago

Its a shame that none of the ZFS devs commented on this issue over the past ~18 months, though I do understand that there are only a handful of them and there is currently ~1200 issues on github, so I get that some issues (especially those without easy solutions) dont all get the dev's time.

@blind-oracle and @hoonlight - I appreciate knowing that I wasnt the only one that experienced this bug. That said, Ive long since moved on to just using ZFS on top of a LUKS volume, which has consistently much better performance. Unless you need to ability to send encrypted ZFS data streams this is probably your best bet too...considering there has been no activity in 18 months I wouldnt expect any quick resolutions here...

@blind-oracle - Its rather interesting that your speeds for "large 100% read workloads" were, proportionally, almost identical to what I saw, despite having a considerably different test setup (4 core CPU, data of disk, i/o done with dd --vs-- 14 core CPU, data on tmpfs ramdisk, i/o done with fio):

                ZFS ON LUKS     ZFS NATIVE ENCRYPTION
YOUR RESULTS:   720 MB/s        220 MB/s
MY RESULTS:     7201 MB/s       2875 MB/s

2 data points isn't much, but having such a similar slowdown despite very different hardware, storage medium, and i/o program, combined with no response from the ZFS devs, sort of makes me think that this isn't really a bug so much as it is an "unfortunate result of how ZFS native encryption is implemented". However, that is purely speculation on my part.

DNS commented 4 months ago

https://github.com/openzfs/zfs/issues/16016#issuecomment-2061076647

@jkool702 Do you use external enclosure? Have you test this on vmware/virtualbox with virtual drive (using internal SSD)?

Many hardware has different performance outcome as @ChordMankey mentioned in other issue.

blind-oracle commented 4 months ago

@DNS I don't think it can be enclosure-related since the results change when switching between naitve/LUKS encryption on the same hardware.

And no, I don't use enclosures, just 8 directly-connected SATA drives.

DNS commented 4 months ago

@blind-oracle As I write earlier, different hardware will yield different outcome (including your own computer SATA controller). Test this on vmware/virtualbox with virtual drive (using internal SSD or RAM drive). Also use well known benchmarking tools, not your own handwritten benchmarking script.

Valid issue must can be replicated by the developers. Otherwise the issue will be closed as non-reproducible.

DNS commented 4 months ago

Running ZFS on Linux is significantly slower than on FreeBSD. You might want to switch to FreeBSD.

FreeBSD14 (ZFS encryption): 2.26 GB/s Debian12 (ZFS encryption): 307 MB/s Debian12 (ZFS no encryption): 451 MB/s

Tested on vmware with NVMe virtual drive (RAM drive, DDR3 8GBx4, max ~5GB/s).

FreeBSD 14-2024-07-16-13-46-13

Debian 12 5 0 KDE-2024-07-16-14-11-02

DNS commented 4 months ago

https://github.com/openzfs/zfs/issues/9910 https://github.com/openzfs/zfs/issues/7896

DNS commented 4 months ago

Apparently ZFS on Debian set compression=off as default. When compression turned on, it improve performance but still slower than FreeBSD.

FreeBSD14 (ZFS encryption+compression): 2.26 GB/s Debian12 (ZFS encryption+compression): 1.4 GB/s Debian12 (ZFS no encryption+compression): 1.4 GB/s

ikozhukhov commented 4 months ago

Apparently ZFS on Debian set compression=off as default. When compression turned on, it improve performance but still slower than FreeBSD.

FreeBSD14 (ZFS encryption+compression): 2.26 GB/s Debian12 (ZFS encryption+compression): 1.4 GB/s Debian12 (ZFS no encryption+compression): 1.4 GB/s

could you please provide list of tests for testing on dilos? also you can try download http://apt.dilos.org/isos/dilos-4.0.0/dilos-4.0.0.13-amd64.iso and try to run your tests and provide result/feedback?

gmelikov commented 4 months ago

@DNS looks like you're testing different zfs releases. Please check version zfs -V and compare all dataset properties. openzfs releases 2.1 and 2.2 had some properties' defaults changed, and had performance optimizations too.

jkool702 commented 4 months ago

#16016 (comment)

@jkool702 Do you use external enclosure? Have you test this on vmware/virtualbox with virtual drive (using internal SSD)?

Many hardware has different performance outcome as @ChordMankey mentioned in other issue.

You should take a look at my testing script that I linked in the original issue report. It first creates a ram-backed block device - preferably with the brd kernel module, but if thats unavailable then a loop device from a file on a tmpfs. It then sets up a brand new zfs pool using that block device (or, for zfs on luks, cryptsetuo uses that block device and then zfs uses the luks block device). Finally, the actual i/o performance testing / benchmark is done using fio.

I set it up this way explicitly because, as you said:

Many hardware has different performance outcome

By using a ram-backed freshly created blank zfs pool running on bare metal (not virtualized), you ensure that disk i/o isnt limiting you, virtualization isnt slowing you down, and a whole host of possible issues from using an existing zfs pool arent present. This seemed (to me at least) the best way to minimize the effect from using different hardware.

EDIT: also, even though it isnt really relevant to the benchmark results, i'll answer your question: no, im not using an enclosure. The main system drive is an nvme drive with a zfs pool on it, and there is a 2nd raidz2 zfs pool on 10x NAS HDD's in raidz2. All 10 HDDs are in the case and attached via SATA3 directly to the motherboard.

tschoening81 commented 2 weeks ago

I've migrated ~5 TiB of data from unencrypted dataset into an encrypted one. The files are various system backups, lots of small files like mails, a few larger files like VMs etc. All of those files are backed up using RSYNC into an additional ZPOOL, which was encrypted right from the start. The data itself was migrated using zfs send [...] | zfs receive -x encryption [...] and all datasets use the same algorithm aes-256-gcm.

I'm somewhat sure to not have changed anything else, the hardware is the same, the files are the same, there's no bottleneck in the CPU and disk access in iostat etc. look somewhat similar to before. Meaning it's reading at a pretty low throughput of ~3 MiB/s mostly and most of the time iostat doesn't show at least the source disks at their limits. I guess the file iteration of RSYNC itself just slows things down for the lots of individual files I have.

Though, before the migration the backup was completed in ~3 hours, now it takes ~7 hours. Might that be related to what you you have recognized? I' somewhat surprised that reading from encrypted datasets seems to have slowed things down that much without any obvious bottleneck like CPU.