ICP: Implement Larger Encrypt/Decrypt Chunks for Non-AVX Processors

kimono-koans commented 4 years ago

@AttilaFueloep has indicated that the original question's premise, a belief that slow performance on non-AVX processors was due to a non-accelerated GHASH, was in error. @AttilaFueloep indicates the more likely cause of low performance is the FPU handling cost which may be resolved by processing larger chunks on non-AVX processors, as is already done on processors with the AVX instructions, see his comments below.

This thread is now a request to implement such chunking.

Above edited 9.10.2020 On my machine, which has AES-NI but not AVX extensions, accelerated GCM encryption is still dog slow. Intel's docs seem to indicate that SSE GHASH acceleration is possible and may provide a similar performance benefit as AVX accelerated GHASH (see https://github.com/openzfs/zfs/pull/9749) and has code already available, find gcm_sse.asm at page 14:

https://www.intel.com/content/dam/www/public/us/en/documents/software-support/enabling-high-performance-gcm.pdf

Any possibility of implementing? Thank you.

InsanePrawn commented 4 years ago

The link [6] in the PDF seems to be dead, so I couldn't even find out whether the provided implementation is a 'proof of concept/educational, do not use for actual security' or supposedly an audited, secure one.

The last GCM improvements were taken from OpenSSL AFAIK, so if they carry an optimized routine for this hardware, chances of it finding its way into OpenZFS are substancially higher.

kimono-koans commented 4 years ago

I appreciate your reply. FYI, I think this may be what we would be looking for: https://boringssl.googlesource.com/boringssl/+/refs/heads/master/crypto/fipsmodule/modes/asm/ghash-x86.pl

Indicates a more than 10x improvement. Thank you.

PrivatePuffin commented 4 years ago

@electricboogie May I inquire why you didn't use the "question" or "feature request" forms? This is clearly a question or a feature request (which one depending on personal opinion, but thats why we offer both options).

(Asking because of #10833 and feedback on #10779 )

kimono-koans commented 4 years ago

I was ignorant of those forms. Do you have a suggestion about how best to handle now that we are where we are, so that my feature request is well received? Thank you. Appreciate your feedback.

InsanePrawn commented 4 years ago

OpenSSL seems to also carry that, at least judging by the header comment. https://github.com/openssl/openssl/blob/master/crypto/modes/asm/ghash-x86_64.pl

Maybe @AttilaFueloep can comment on or even port this 🙃 I bet a number of Intel (Pentium|Celeron) J owners would be thankful.

PrivatePuffin commented 4 years ago

@electricboogie No problem.

@behlendorf Please label this "Type: Feature" :) (its also a feature worth looking into)

kimono-koans commented 4 years ago

FYI, might be Google/BoringSSL only, but see also: https://boringssl.googlesource.com/boringssl/+/refs/heads/master/crypto/fipsmodule/modes/asm/ghash-ssse3-x86_64.pl

AttilaFueloep commented 4 years ago

Unfortunately it's more than GHASH, you'd need an SSE equivalent of aesnigcm[en|de]crypt() . This routine is a complete aes-gcm implementation in assembler (AES-NI-CTR+GHASH stitch, as the comment says) and requires AVX. I skimmed over the OpenSSL sources but to no avail.

Without that, processing larger chunks of encryption data would roughly double performance by reducing the FPU state handling overhead and wouldn't be that hard to implement. But I'm afraid I don't have the capacity to make this happen anytime soon. If someone wants to tackle this, I'm happy to help.

As a side note, I'm wondering what kind of CPU would have AES-NI but no AVX? Does it support MOVBE and PCLMULQDQ?

kimono-koans commented 4 years ago

Thank you for your response.

I have an Intel J4205. CPUID indicates it has the MOVBE and PCLMULQDQ instructions. Pleased to provide full cpuid output if needed.

I shouldn't pretend I know that the non-accelerated GHASH is the root of the problem, however, fio and OpenSSL benchmarks seem to indicate there is some performance to be gained somewhere:

$ dmesg | grep gcm [ 7.708031] SSE version of gcm_enc/dec engaged.

$ openssl speed -evp aes-128-gcm ... OpenSSL 1.1.1f 31 Mar 2020 ... The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128-gcm 202107.48k 538017.41k 880800.68k 1039108.78k 1085216.09k 1089465.00k

$ fio --direct=1 --name=read --ioengine=libaio --rw=read --bs=128k --size=512m --numjobs=8 --iodepth=1 --group_reporting read: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=1 ... fio-3.16 ... READ: bw=124MiB/s (131MB/s), 124MiB/s-124MiB/s (131MB/s-131MB/s), io=4096MiB (4295MB), run=32903-32903msec

`$ hdparm -t /dev/sda

/dev/sda: Timing buffered disk reads: 1368 MB in 3.00 seconds = 455.71 MB/sec`

kimono-koans commented 4 years ago

Excuse me. My processor does have the PCLMULDQ but not the VPCLMULQDQ instruction.

$ cpuid | grep -i pcl PCLMULDQ instruction = true VPCLMULQDQ instruction = false PCLMULDQ instruction = true VPCLMULQDQ instruction = false PCLMULDQ instruction = true VPCLMULQDQ instruction = false PCLMULDQ instruction = true VPCLMULQDQ instruction = false

kimono-koans commented 4 years ago

@AttilaFueloep Excuse my (profound) ignorance, my read of your explanation as to why a AES-NI-CTR+GHASH is necessary is because the ZFS modules have to do as much as possible when they hold the FPU (because using the kernel SIMD interfaces is out of the question). Wouldn't it be possible just to ifdef in an SSE accelerated GHASH into the current stitch borrowed from the OpenSSL asm module or from a closely related fork? I'm sure that all sounds easier than it is.

See: https://boringssl.googlesource.com/boringssl/+/refs/heads/master/crypto/fipsmodule/modes/asm/ghash-x86.pl

And see: https://github.com/openssl/openssl/blob/master/crypto/modes/asm/ghash-x86_64.pl

Understand your point that not a lot of enterprise workloads are running on CPUs like this. However, I'm willing to gamble that a huge number of home NAS systems have the same or similar CPUs.

Appreciate your guidance.

AttilaFueloep commented 4 years ago

Crypto in ZFS is using the ICP kernel module which is a port of Illumos crypto and CDDL licenced. Therefore it can't use the GPL-only Linux kernel SIMD interfaces. The AES-GCM implementation is written in C and operates on single GCM blocks. Encryption and GHASH calculation are accelerated by using the AES-NI and PCLMULQDQ SIMD instructions. Since AES-GCM requires one AES and one GHASH calculation per block, this results in two FPU saves and two FPU restores done per 16 byte block. This, of course, is a massive overhead, slowing down the operation. Please see the PR Description in #9749.

The easiest way to improve performance in the non-AVX case would be to reduce this overhead by implementing the chunking described in the PR for the non-AVX cases too. This would roughly double the performance (tested while developing #9749). If you compare e.g. gcm_mode_encrypt_contiguous_blocks() against gcm_mode_encrypt_contiguous_blocks_avx() you'll see the chunking implemented at line 1182. I could do this but unfortunately right now I have a number of higher priority tasks in my queue.

Beyond that there is some optimization margin in the GHASH implementation too but I'd guess it wouldn't be that massive. To improve performance further one would need an SSE assembler version of aesni-gcm-x86_64.S but since the combination of AES-NI and SSE is quite uncommon I'd doubt that such code exists. If someone knows of such code I'd appreciate any pointer.

Wouldn't it be possible just to ifdef in an SSE accelerated GHASH into the current stitch

If you look at the source you'll realize that it heavily utilizes VEX instructions (the ones starting with the letter v) and therefore requires AVX to run anyhow. So unfortunately this is not possible.

I agree that having fast encryption for a broader range of architectures would be a good thing to have, but as it currently stands this would require implementation of one of the above options.

My processor does have the PCLMULDQ but not the VPCLMULQDQ instruction.

Yes, VEX instructions require AVX. Could you post here the output of cat /sys/module/icp/parameters/icp_aes_impl and cat /sys/module/icp/parameters/icp_gcm_impl please?

$ dmesg | grep gcm [ 7.708031] SSE version of gcm_enc/dec engaged.

That message is from Linux kernel crypto we can't use due to licence issues (GPL vs. CDDL).

kimono-koans commented 4 years ago

Appreciate your response.

cat /sys/module/icp/parameters/icp_aes_impl cycle fastest generic x86_64 [aesni]

cat /sys/module/icp/parameters/icp_gcm_impl cycle [fastest] generic pclmulqdq

AttilaFueloep commented 4 years ago

So you are using hardware acceleration for both, AES and GHASH and are paying the high FPU handling price. IIRC saving the FPU state on Goldmont is twice as expensive as on Ivy Bridge, therefore I'd expect "chunked encryption" to at least double your throughput.

kimono-koans commented 3 years ago

@AttilaFueloep I really do appreciate this free education. Understand now that the bottleneck is the high FPU handling cost. And understand, as well, if you don't have the bandwidth to implement right now, but would you mind if I reframe this question and your comments as an open feature request, according to @Ornias1993 's form?

Thank you.

PrivatePuffin commented 3 years ago

You can tag and ask behlendorf to just re-label this as "Type: Feature" ;)

kimono-koans commented 3 years ago

Thank you @Ornias1993. @behlendorf Could you re-label this question as "Type: Feature"? I would suggest a new title, given what @AttilaFueloep has told us re: the likely root of this performance issue. Perhaps "ICP: Implement Larger Encrypt/Decrypt Chunks for Non-AVX Processors"? Of course, I would defer to @AttilaFueloep or you others who have been so helpful.

PrivatePuffin commented 3 years ago

@electricboogie Feel free to rename the title and intro-text yourself! :) Thanks for your great ideas though :)

kimono-koans commented 3 years ago

FYI, perf top output on 0.8.3 also indicates that @AttilaFueloep is correct that FPU save and restore overhead during a fio run is the core issue. Pleased to provide additional info if needed. Thanks!

Samples: 327K of event 'cycles', 4000 Hz, Event count (approx.): 114513049547 lost: 0/0 drop: 0/65143 Overhead Shared Object Symbol 15.12% [kernel] [k] kfpu_restore_xsave.constprop.0 15.08% [kernel] [k] kfpu_restore_xsave.constprop.0 12.15% [kernel] [k] kfpu_save_xsave.constprop.0 12.05% [kernel] [k] kfpu_save_xsave.constprop.0 5.27% [kernel] [k] aes_encrypt_intel 4.58% [kernel] [k] aes_xor_block 4.15% [kernel] [k] gcm_mul_pclmulqdq 2.81% [kernel] [k] aes_encrypt_block 2.38% [kernel] [k] aes_aesni_encrypt

AttilaFueloep commented 3 years ago

This weekend I found some time to look more into this. Given the fact that even the newest Intel Atoms (Tremont) do not support AVX but AES-NI and SSE, the home NAS use case is a valid point to consider.

Looking around, I think I've found some suitable code to use and I'll try to come up with something once time permits. This may take a while though, and I can't give any ETA. Once done, it should perform comparable to the AVX implementation. As soon as I've something to test, I'll let you know.

Thanks for bringing this up.

Rain commented 3 years ago

I was curious about this as well so I did some testing on a system with an Intel Celeron N3060 I've been playing around with. I definitely think there is more performance that can be squeezed out of the "lower-end" non-AVX-equipped CPUs. As @AttilaFueloep mentioned, even current Intel low-power CPUs lack AVX, so it seems reasonable to invest a bit of time to at least determine if something can be done to improve performance and/or how hard it would be to implement.

I'd be happy to do any additional testing and/or test any patches if needed!

Relevant Test System Specifications: CPU: Intel Celeron N3060 Memory: 8GB DDR3L Kernel: 4.19.0-11-amd64 (most recent kernel in Debian Buster at time of testing) OpenZFS Version: 0.8.4 (latest in buster-backports at time of testing) ZFS Encryption: aes-128-gcm ZFS Parameters: sync=standard, compression=off, recordsize=128K (performance was similar with sync=disabled)

$ cat /sys/module/icp/parameters/icp_aes_impl
cycle [fastest] generic x86_64 aesni
$ cat /sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] generic pclmulqdq

OpenSSL benchmark (aes-128-gcm)

$ openssl speed -evp aes-128-gcm
Doing aes-128-gcm for 3s on 16 size blocks: 21128961 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 64 size blocks: 10614883 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 256 size blocks: 3805791 aes-128-gcm's in 2.99s
Doing aes-128-gcm for 3s on 1024 size blocks: 1077630 aes-128-gcm's in 3.00s
Doing aes-128-gcm for 3s on 8192 size blocks: 138814 aes-128-gcm's in 2.99s
Doing aes-128-gcm for 3s on 16384 size blocks: 69491 aes-128-gcm's in 2.98s
OpenSSL 1.1.1d  10 Sep 2019
built on: Mon Apr 20 20:23:01 2020 UTC
options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-8Ocme2/openssl-1.1.1d=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAESNI_ASM -DVPAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-gcm     112687.79k   226450.84k   325846.99k   367831.04k   380322.50k   382060.59

Sequential Write Performance

$ dd if=/dev/zero of=./zero.000 bs=1M count=16K
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 435.618 s, 39.4 MB/s

I also compiled the Kernel with the FPU begin/end functions re-exported and tested again. I used the current, stock Debian kernel source; the only change was the FPU function exports, otherwise identical to the previous test. As expected, performance was better, though I was surprised by how much (about 80%!).

Sequential Write Performance (With Kernel FPU Exported)

$ dd if=/dev/zero of=/testpool/temp/zero.000 bs=1M count=16K
16384+0 records in
16384+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 236.754 s, 72.6 MB/s

AttilaFueloep commented 3 years ago

As mentioned above, I'm working on a prototype which I expect to perform at least as well as OpenSSL (380 MB/s). I'll let you know once there is something to test. As an added benefit, once the SSE stuff is working, it's straight forward to add support for avx2, avx512 and avx512-vaes as well.

I also compiled the Kernel with the FPU begin/end functions re-exported and tested again. I used the current, stock Debian kernel source; the only change was the FPU function exports, otherwise identical to the previous test. As expected, performance was better, though I was surprised by how much (about 80%!).

Yes, that resembles the 100% I observed. If you can use the kernel FPU functions the FPU state is only saved on context switches which has essentially the same effect as processing larger chunks while disabling preemption.

scineram commented 3 years ago

@AttilaFueloep How about ensuring the default gcm_avx_chunk_size is at least SPA_OLD_MAXBLOCKSIZE (after rounding)? Would that allow all data blocks with default recordsize, metadata blocks (+indirect blocks? Are they authenticated?), ZIL blocks be encripted in one go?

AttilaFueloep commented 3 years ago

I don't think that's useful for two reasons.

First, while using the FPU we disable interrupts and preemption to make sure the FPU regs won't get clobbered in between. To avoid starving the system, we should minimize the time we stay in this state, so the smaller the chunk size, the better.

And second, there are diminishing returns with increasing chunk size. Let's do a rough estimate: two calls per 16 bytes results in an 100% overhead, so one call per 32 KiB produces 0.025% overhead, which is negligible already. Retrospectively I think that the chosen value of 32k is already quite large and a value of 16 KiB or 8 KiB would've been a better choice. I'm planning to refine the default value after running some benchmarks.

oberien commented 3 years ago

Coming from #11525, is there any rough time estimate (weeks, a few months, half a year, …) in which you will be able to tackle the issue for SSE4.2?

Is there a way I could help with the implementation?

AttilaFueloep commented 3 years ago

@oberien I scheduled some time in February for this task. ETA depends on how smooth this will go but I think end of February or March is a realistic time frame. Thanks for your offer, you could help testing once I've a prototype. Somewhat unrelated, what board are you using? The Atom C3538 looks promising.

AttilaFueloep commented 3 years ago

@oberien BTW, meanwhile you can roughly double performance temporarily by using the Linux kernel_fpu_{begin,end}() functions. (https://github.com/openzfs/zfs/issues/10846#issuecomment-706538703) This can be done either by recompiling the kernel or the ZFS module itself. I never did this myself but a search should bring up ways to do it. As long as you don't distribute your work this should be legal (IANAL).

oberien commented 3 years ago

I think end of February or March is a realistic time frame

That's pretty fast, great to hear! Then I won't need to use ZFS on LUKS but instead can wait a little and use the slow(er) ZFS native encryption for now.

you could help testing once I've a prototype

I will absolutely do that!

what board are you using? The Atom C3538 looks promising.

I'm using a NETGEAR ReadyNAS RN426, which comes with a custom PCB for the RN426/RN428. It uses the Atom C3538 as CPU. As it also comes with a fully functional serial console, I could access the BIOS and install a custom debian, allowing me to get ZFS (which isn't possible with the stock OS which comes without the linux-headers).

you can roughly double performance temporarily by using the Linux kernel_fpu_{begin,end}() functions.

Thanks for the suggestion, I'll try that.

oberien commented 3 years ago

you can roughly double performance temporarily by using the Linux kernel_fpu_{begin,end}()

I can confirm that this does work perfectly fine for write performance. I'm now getting the following speeds (same commands and setup as used in #11525):

write: 10737418240 bytes (11 GB, 10 GiB) copied, 50.4842 s, 213 MB/s
read:  10737418240 bytes (11 GB, 10 GiB) copied, 230.778 s, 46.5 MB/s
read2: 10737418240 bytes (11 GB, 10 GiB) copied, 183.007 s, 58.7 MB/s

The read didn't really improve performance by much. For read2 I turned off atime. I guess it somewhat doubled the performance of the read from ≈35 to ≈60MB/s for ZFS v2.0.1-3, but that's still way lower than the original bad performance I got with ZFS v0.8.6-1 (90MB/s). I'm not really sure what's up with that loss in performance in the newer ZFS version. While reading, zpool iostat -v 1 jumps from 30MB/s up to 170MB/s and back randomly, keeping the high speeds only for a few seconds before dropping back down. It's not CPU-bound though. When it loads data with 30MB/s, my CPU is at ≈20% on all 4 cores. On 170MB/s it's at ≈80% on all 4 cores, which could indicate an upper bound on the decryption performance with the current implementation. However, I don't know why it would always drop back down to 30MB/s and stay there for most of the time.

AttilaFueloep commented 3 years ago

@oberien

I'm using a NETGEAR ReadyNAS RN426

Thanks for sharing, looks like an easy way to build a custon NAS.

The read didn't really improve performance by much.

There were a couple of issues describing sawtooth like performance drops recently, none of them limited to reads though (IIRC). Since CPU stays below 20% in the slow phases, I think this issue is unrelated to encryption performance. Do you see any blocked tasks in your logs? Do you see the same behaviour on unencrypted datasets? What happens if you remove the L2ARC from the equation? The better performance with atime=off is expected since this avoids writing to the pool to update the access time on every read.

oberien commented 3 years ago

Do you see any blocked tasks in your logs?

I'm not sure what you mean. How can I do that?

What happens if you remove the L2ARC from the equation?

That completely fixes the problem. I'm having read speeds of 200MB/s now as expected with a CPU usage of 100% when reading 10GB of data.

test	read speed	CPU
with encryption
read without cache with atime=off	201MB/s	88%
read without cache with atime=on	197MB/s	85%
read with empty cache with atime=off	40-180MB/s, avg=112MB/s	20%-90%, one core 100% whenever performance wasn't 180MB/s
read with 0GB SSD L2ARC with atime=on	40-180MB/s, avg=105MB/s	20-90%, one core 100% whenever performance wasn't 180MB/s
read with 3GB SSD L2ARC	avg=92.8MB/s
read with 6GB SSD L2ARC	avg=130MB/s
read with 9.5GB SSD L2ARC	avg=192MB/s
read after new creation and write (3GB SSD L2ARC)	30MB-50MB, avg=44MB/s	30%
read after new creation and write (104MB tmpfs L2ARC)	avg=36MB/s	25%
without encryption
read without cache with atime=off	338MB/s
read without cache with atime=on	329MB/s
read with 0GB SSD L2ARC with atime=on	325MB/s
read with 1.8GB SSD L2ARC	335MB/s
read after new creation and write (2.5GB L2ARC)	85.3MB/s

I see two different problems here:

It appears as though the read performance from the HDDs drops whenever the write throughput of the SSD rises when using ZFS native encryption. When ZFS writes to the SSD-Cache with 60MB/s, the HDD read throughput is as low as 2MB/s. This hypothesis is also reinforced by the fact that the read performance improves as the L2ARC contained more of the file, even though the cache wasn't read from according to zpool iostat -v 1. Without encryption on, the read throughput stays constant, no matter how much is written to the L2ARC. My guess here would be that the encryption of the data being written to the L2ARC somehow severely negatively impacts / restricts reading from the HDD. Also the maximum SSD-cache-write throughput I've seen with encryption on was 60MB/s, averaging around 30-40MB/s. Without encryption the SSD-Cache is written to with consistently 79MB/s.
There is some very weird behaviour going on with the following workload using an L2ARC no matter if encryption is used or not:
```
zfs create -o mountpoint=/foo tank/foo
dd if=/dev/zero of=/foo/fo bs=1M count=10k
zpool export tank
zpool import tank
dd if=/foo/fo of=/dev/null bs=1M
```
The write has full performance. However, the read is very slow compared to an empty / no L2ARC. This only happens after writing with the L2ARC enabled and the new feature of ZFS reusing the L2ARC on export/import. If I drop the L2ARC and recreate it, or just not have an L2ARC, the throughput is high as expected. It seems as though the cache entries created when writing a file slow down reading the file afterwards. However, cache entries created when reading a file don't negatively impact read throughput afterwards.

My cheap SSD, which I use as L2ARC for testing, only supports ≈75MB/s unbuffered write throughput. However, I am able to reproduce (2) with encryption even when using an in-memory cache (truncate -s 120G /dev/shm/cache), even though barely anything is written to the cache (only 250MB after the whole 10GB write). I wasn't able to reproduce issue (1), as barely anything is being written to the in-memory L2ARC (both if I use it directly as file, and if I create a loop-device from the file and use the loop-device as cache).

Should I open new issues for both of those issues, or at least the second one?

AttilaFueloep commented 3 years ago

Since I'm not using L2ARC, I'm afraid I can't be of much help here. Maybe @gamanakis has more insight, especially regarding the persistent L2ARC case. Just some general remarks. AFAIK writes bypass the L2ARC entirely so this would explain why they are not affected. Using a L2ARC device with lower throughput than the main pool seems to not make much sense. 3GB of L2ARC seems rather small even if taking your 4GB of memory into account.

gamanakis commented 3 years ago

I am late in the discussion, I recommend the following: switch away from using dd for any benchmarks. Use fio in randread mode, this is the appropriate way to test L2ARC performance (assuming the default of l2arc_noprefetch=1, otherwise read mode). Use an SSD/NVMe/in-memory faster than the actual vdevs of the pools. Limit the zfs_arc_max so that it is smaller than the L2ARC size, otherwise the majority of reads will come from ARC. The file size for fio in this case should ideally be larger than the ARC.

Yes writes bypass the L2ARC as L2ARC has its own logic of caching data from ARC.

A solid example would be:

l2arc_noprefetch = 0
zfs_arc_max = 1GB
L2ARC size = 16GB
fio --name=test --size=10G --readwrite=read --directory=/test --time_based --runtime=100000

oberien commented 3 years ago

Thanks for your quick responses. I'm not testing the performance of the L2ARC here and currently don't want to test it. Currently I'm testing the workload of fully sequential reads and writes. The problem here is that just having an L2ARC while running my tests significantly impacts the performance of sequential reads and writes.

3GB of L2ARC seems rather small

3GB SSD L2ARC meant that I'm using a 120GB SSD as L2ARC and it was filled with 3GB at that time before I started the test.

4GB of memory

My machine has 32GB of RAM.

Using a L2ARC device with lower throughput than the main pool seems to not make much sense.

That's why I included the test with an in-memory L2ARC to show that throughput doesn't matter for this issue.

AttilaFueloep commented 3 years ago

3GB SSD L2ARC meant that ...

Sorry, misunderstood.

My machine has 32GB of RAM.

Ok, so you can update the RAM on the NETGEAR, nice.

That's why I included the test with an in-memory L2ARC

Right, that was just meant as an obvious note.

jumbi77 commented 3 years ago

@AttilaFueloep I don't want to bother you but I want to nicely ask for a status update. Did you make any progress? Much thanks in advance for all for effort! I am really looking forward for your improvement.

AttilaFueloep commented 3 years ago

@jumbi77 I really appreciate your encouragement. Yes, unfortunately I'm quite a bit behind schedule right now. Things tend to go haywire in these strange days. The current status is that the nasm integration is finished and I'm currently fighting some bugs with the asm integration. So I'd like to ask for another or maybe two months to finish things up.

Sorry for the delay responding BTW, been away from a keyboard for quite a while.

misuzu commented 3 years ago

Any update on this? It's sad that native encryption is unusable on consumer-grade NAS hardware.

kimono-koans commented 3 years ago

FYI, there seems to be a CDDL-licensed C implementation of GHASH accelerated by non-vectorized PCLMULQDQ in the illumos kernel. See: https://github.com/illumos/illumos-gate/blob/master/usr/src/common/crypto/modes/gcm.c#L164

katagia commented 3 years ago

I'm running a NAS with Pentium G5400 CPU (ECC, AES-NI but not AVX).

It's running perfect with truennas (Freebsd, zfs 2.0.4-3, native encrpytion). There is almost no CPU load, performance is limited by harddisks and network interface.

With the same hardware, some pool, the perfomance is very bad when I use a linux based system, e.g. Ubuntu. The CPU load is at the limit.

Did I understand this thread right that the performance issues are caused by license problems? Is there any chance to get the same performace in freebas and linux when using native encryption?

Rain commented 3 years ago

@katagia I believe you're correct; the G5400, similar to the CPUs we've discussed above, lacks the instructions that are required for proper performance in the current AES-GCM implementation used in OpenZFS on Linux.

You can regain a significate amount of performance by compiling the kernel with the relevant FRU patches exported again, though I don't think it will perform quite as well as it does on FreeBSD (until a better, more performant GCM implementation is written/adapted for OpenZFS Linux). This patch may be helpful: https://github.com/NixOS/nixpkgs/blob/693c7cd0f7e6ce6ff7c6210ac1857712dac4cad5/pkgs/os-specific/linux/kernel/export_kernel_fpu_functions_5_3.patch

AttilaFueloep commented 2 years ago

First of all, sorry to everybody. I was totally swamped the last ~4 months. Things are getting better now.

@katagia

It's running perfect with truennas (Freebsd, zfs 2.0.4-3, native encrpytion). There is almost no CPU load, performance is limited by harddisks and network interface

So now that is interesting. I had a quick look at the FreeBSD crypto sources and couldn't find much differences to the crypto code used by OpenZFS (ICP). I'd expect encrypted ZFS on FreeBSD to perform comparable to Linux with the patches @Rain mentioned applied. Maybe someone with knowledge of the FreeBSD crypto API could enlighten us here? Is there any code which does GCM processing (COUNTER, GMULT, AES) in one big SIMD (SSE/AVX) accelerated assembler routine?

katagia commented 2 years ago

@AttilaFueloep I compared the performance with a standard ubuntu system. I didn't apply the patches mentioned here. Sorry when I was not precise enough.

AttilaFueloep commented 2 years ago

Well, I think it was me who wasn't precise enough. The mentioned patches will approximately halve CPU usage. So if you are CPU bound on Ubuntu and have "almost no CPU load" on FreeBSD I wouldn't expect the patches to bring up Ubuntu performance to the FreeBSD level, meaning FreeBSD crypto performs much better than the ICP one. Of course without trying we can't be sure.

AdamLantos commented 1 year ago

Hi @AttilaFueloep, is there an update on this work? I just recently switched to TrueNAS Scale 22.12 on a C3558 system, and the encryption performance is pretty low. I was running VMs from encrypted zvols (on a pair of NVMe drives), and booting the guests was using up all the CPU on the host, and general guest IO is pretty sluggish. Run a few fio tests and the maximum I was able to get was ~200Mb/s with 100% CPU utilization on the host.

TrueNAS 22.12 has a patch applied (https://github.com/truenas/zfs/pull/95) which implements the suggested FPU patch from this issue, but even with that, the CPU cores are always pegged with IO. For now I switched to unencrypted zvols, and things are significantly better. I'm wondering if there is still planned work to support non-AVX processors better.

AttilaFueloep commented 1 year ago

@AdamLantos Thanks for the reminder. I started the effort since I was planing to get me a atom based NAS. Unfortunately due to the pandemic induced supply chain problems, it turned out to be nearly impossible to get adequate hardware back then. So the project went down in priority on my todo list.

Starting this week I can devote more time to zfs work, so I dusted off my SSE work-tree and will hopefully come up with a draft PR in a couple of weeks. Let's see how it goes.

AdamLantos commented 1 year ago

@AttilaFueloep - that's great news!

I just did some rough, unscientific testing. FreeBSD seems to have ~6x speed advantage (650Mb/s vs 100MB/s) on my atom C3000 platform while reading from encrypted datasets when compared apples to apples.

When I'm using the host VM which should have the TrueNAS FPU patch applied, the speed goes from 100Mb/s to 230MB/s, which is in line with the expected 2x improvement as discussed above. It is still 3x slower than FreeBSD 13.1.

Setup 1

TrueNAS Scale 22.12 (Debian based) running on bare metal on C3558 4 core system. ARC limited to 300Mb. WD SN750SE NVMe drive.

root@truenas[~]# dd if=/mnt/main/shares/media/test.tar of=/dev/null bs=1M status=progress
6709837824 bytes (6.7 GB, 6.2 GiB) copied, 29 s, 231 MB/s
6541+1 records in
6541+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 29.6507 s, 231 MB/s

Setup 2

Guest VM on the same machine running FreeBSD 13.1. 4 virtual cores, 512Mb of VM memory. WD SN750 NVMe drive with PCI passthrough.

CPU utilization ~70% across all physical cores, there seems to be some IO wait involved.

root@labfreebsd:~ # dd if=/main/encrypted/test.tar of=/dev/null bs=1M status=progress
  6592397312 bytes (6592 MB, 6287 MiB) transferred 10.005s, 659 MB/s
6541+1 records in
6541+1 records out
6858762240 bytes transferred in 10.322330 secs (664458708 bytes/sec)

Setup 3

Guest VM on the same machine running Debian Bullseye and zfs-dkms. 4 virtual cores, 512Mb of VM memory. WD SN750 NVMe drive with PCI passthrough (same ZFS pool and encrypted dataset as setup 2 above).

CPU utilization at 100% across all physical cores.

root@labdebian:~# dd if=/main/encrypted/test.tar of=/dev/null bs=1M status=progress
6811549696 bytes (6.8 GB, 6.3 GiB) copied, 68 s, 100 MB/s
6541+1 records in
6541+1 records out
6858762240 bytes (6.9 GB, 6.4 GiB) copied, 68.3982 s, 100 MB/s

AttilaFueloep commented 1 year ago

Yes this matches expectations. The combined AES + GCM assembly boosts real life performance roughly by a factor of ten IIRC, and once you're not CPU limited any more, encryption performance does not contribute to I/O throughput. This is a bit simplified but you get the idea.

Fun fact , the SSE 4.1 assembly I tested yesterday seems to be even faster than the existing AVX2 implementation (outside ZFS that is).

ChrisWitt commented 1 year ago

I would be very happy if this performance fix would work. I am in the small NAS or rather small InterCPU NAS situation.

Supermicro Inc. A2SDi-2C-HLN4F-B With Intel(R) Atom(TM) CPU C3338 @ 1.50GHz

I choose it for a low power device for to handle all the disks I would need.

I choose ZFS native encryption for now. Hoping the performance can be improved down the line. I get ~30Mb/s with encryption enabled. Writing to unencrypted volume via SSH I get ~90Mb/s so most likely hit the network cap.

To me the Filesystem integrated encryption of ZFS sounds the way to go. With scrubs and backup sends being possible on the encrypted blocks.

Just so you know there are people out there who what this and greatly appreciate the effort all the contributors put into this.

I am a Java Developer but was unable to follow the hints how to patch my kernel in the mean time. And I also fear messing with encryption and corrupting my data when I do something stupid :)

To sum it up: Thank You for what you have done! And hoping for news ;)

AttilaFueloep commented 1 year ago

The implementation is done and I'm currently debugging the mess I coded. This can take an undefined time (as a developer you know that) but hopefully I can come up with a PR in a week or two.

openzfs / zfs