openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.61k stars 1.75k forks source link

tragic latency/perfomance of SSD pool (with encryption) #10363

Closed etosan closed 3 years ago

etosan commented 4 years ago

System information

Type Version/Name
Distribution Name Void Linux
Distribution Version N/A
Linux Kernel 5.4.36
Architecture x86_64
ZFS Version 0.8.4-1
SPL Version 0.8.4-1

Describe the problem you're observing

When copying/moving data between two zfs natively encrypted datasets, machine behavior becomes tragic. I get cca 100-500 ms stutter everywhere (browser, terminal and machine essentially becomes unable to play 720p video concurrently to copy/move).

This is maxed-out Thinkpad W540 with 32 GB of RAM.

If fan is set to "auto" (through kernel builtin thinkpad ACPI driver), at certain point CPUs overheat and kernel throttling kicks-in (until assisted by manual cooldown):

[156447.376965] mce: CPU1: Core temperature above threshold, cpu clock throttled (total events = 632944)
[156447.376967] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 644755)
[156447.376968] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 644755)
[156447.376969] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 644755)
[156447.376972] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 644755)
...
[159842.694721] mce: CPU0: Package temperature/speed normal
[159842.694721] mce: CPU2: Package temperature/speed normal
[159842.694722] mce: CPU3: Package temperature/speed normal
[159842.694725] mce: CPU1: Package temperature/speed normal

Manually setting fan to maximum speed alleviates the overheat problem, but the stutters remain. It is essentially impossible to type at 20 ms repeat rate in Xorg:

Pool is composed of two partitioned (obviously needed due to boot and swap partition requirements) 4TB Samsung SSDs (exact models can be provided on demad).

Majority of CPU is being consumed in several z_wr_iss kernel tasks and spl_dynamic_tas is constanttly stuck in D state.

CPU load distribution as reported by htop::

Behavior is the same whether I use mv/cp or tar|tar. If doing tar|mbuffer -m512m|tar, I see mbuffer read pipe clock somewhere around 1-50 mbs, while write pipe can easily sustain 100-300 mbs (at the cost of stuttering, I guess).

Internet radio stream playing is unaffected so this is clearly disk-io vs input io problem (is it?).

As soon as copy/move op finishes machine returns back to stellar performance. Similarly heavy compiles are not causing this issues despite high CPU loads. More curious is that ~90% CPU time is spent z_wr_iss.

I don't mind if machine feels slower, but stuttering at this level is simply unnacceptable, it is imposible to work with box interactively and something is clearly wrong somewhere.

Before SSD upgrade, machine used 2x2T spindles and I never experienced such problem.

If there are any knobs I can try turning I will gladly do so.

Describe how to reproduce the problem

Create two datasets with two different encryption roots (and thus keys) and copy or move massive amounts of data between them (60-100 GB plus). At certain point machine becomes unusable.

Include any warning/errors/backtraces from the system logs

Besides to what is provided in observation (CPU overheat) section I see no problem reported anywhere in the system. If more information is needed just request it.

Besides compression and encryption everything else is left to ZFS autotuning. ZLOG is internal and L2ARC is disabled. ARC is uncapped but I plan to cap it at 16GB soon.

Disk schedulers are set to none:

: cat /sys/block/sd{a,b}/queue/scheduler 
[none] mq-deadline kyber bfq 
[none] mq-deadline kyber bfq 

Disk subsytem layout:

NAME                                            MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda                                               8:0    0  3.7T  0 disk  
├─sda1                                            8:1    0    4G  0 part  /boot/esp/b
├─sda2                                            8:2    0   48G  0 part  
│ └─md127                                         9:127  0   48G  0 raid1 
│   └─luks-********-****-****-****-************ 254:0    0   48G  0 crypt [SWAP]
└─sda3                                            8:3    0  3.6T  0 part  
sdb                                               8:16   0  3.7T  0 disk  
├─sdb1                                            8:17   0    4G  0 part  /boot/esp/a
├─sdb2                                            8:18   0   48G  0 part  
│ └─md127                                         9:127  0   48G  0 raid1 
│   └─luks-********-****-****-****-************ 254:0    0   48G  0 crypt [SWAP]
└─sdb3                                            8:19   0  3.6T  0 part  

Pool layout:

: zpool status
  pool: zmona
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:58:08 with 0 errors on Thu May  7 17:48:47 2020
config:

    NAME                                      STATE     READ WRITE CKSUM
    zmona                                     ONLINE       0     0     0
      mirror-0                                ONLINE       0     0     0
        ********-****-****-****-************  ONLINE       0     0     0
        ********-****-****-****-************  ONLINE       0     0     0

Dataset layout:

: zfs list
NAME                                                       USED  AVAIL     REFER  MOUNTPOINT
zmona                                                     2.49T  1000G      320K  none
zmona/BOOT
zmona/BOOT/void                                            265M  1000G      265M  /boot
...
zmona/ROOT                                                20.9G  1000G      320K  none
zmona/ROOT/void                                           20.9G  1000G     20.9G  /
...
# simply too many fses to list (and essentially unimportant)

Keep in mind /boot here is used just for staging, it is periodically refreshed to EFI ESPs on both disks (sda1, sdb1).

AttilaFueloep commented 4 years ago

What encryption algorithm are you using on the effected datasets, AES-CCM or AES-GCM? You can find this out by running zfs get encryption <dataset>.

Can you run perf top while encountering the issue, that would show where the load is generated.

etosan commented 4 years ago

Both datasets, source and destination, are using default method (I think I never messed with that property):

: zfs get encryption zmona/HOMES/eto      
NAME             PROPERTY    VALUE        SOURCE
zmona/HOMES/eto  encryption  aes-256-gcm  -

target:

: zfs get encryption zmona/HOMES/eto/downloads/*/new   
NAME                             PROPERTY    VALUE        SOURCE
zmona/HOMES/eto/downloads/*/new  encryption  aes-256-gcm  -

Hmm perf output is interesting. I have yet to learn to use it, but you are saying it can spit out backtrace/callsite in z_wr_iss where it is spending most of it's time?

etosan commented 4 years ago

I cannot do tests at the moment, as they are quite heavy (space and time wise) and this is also my main worker currently, but in a day or two, I will rerun them and report perf output for affected threads.

Meanwhile I would like to know whether aes-256-gcm was bad choice.

kneutron commented 4 years ago

Meanwhile I would like to know whether aes-256-gcm was bad choice

--Before employing encryption, a good sysadmin rule is "don't use it where it's not needed. If needed, benchmark it first." IMHO, unless you're a business user with a corporate laptop or multi-terabyte environment with business secrets, whole-environment encryption is overkill and makes things more fragile. Home users shouldn't really "need" it -- it would seem to make much more sense to setup a password-protected dedicated area to hide your $secrets in unless you don't want people to even boot your machine without the password. And remember, you can fully encrypt a VM pretty easily (at least for Virtualbox and Vmware.)

--Regardless of MHO. there is a good hint in this thread on how to check encryption / raidz parity speeds on Linux:

https://www.reddit.com/r/zfs/comments/clq88s/have_there_been_any_benchmarks_yet_on_the_speed/

--You could also try LUKS benchmark:

https://unix.stackexchange.com/questions/254017/how-to-interpret-cryptsetup-benchmark-results

See also - REF: https://news.ycombinator.com/item?id=14591452

REF: https://www.cyberciti.biz/faq/how-to-find-out-aes-ni-advanced-encryption-enabled-on-linux-system/

--I'll probably try writing a bash script to test various ZFS at-rest encryption methods, but in the meantime you can try ' openssl speed aes ' to run some basic non-ZFS-specific benchmarks.

kneutron commented 4 years ago

--In an effort to be useful, I put together a bash script to create ZFS datasets with all the currently available encryption types (depends on up-to-date man page and should adapt for future encryption types). It has a few features - such as auto-testing openssl / cryptsetup encryption speeds, optionally copying an ISO file to ramdisk for later timed-copying to each encrypted dataset, and cleans up after itself if you ask it nicely.

--It is HIGHLY recommended to review/edit the script before running it, as there are a few EDITME's scattered in it that make some assumptions.

--ZFS devs, please feel free to adapt my script and include it in the ZFS test suite if it is deemed worthy.

https://pastebin.com/w5WbF1gb

--Sample output (from the logfile)

` Tue May 26 19:16:35 CDT 2020 - mintLatitude-E6540 - BEGIN ZFS encryption speed tests o Kernel: 5.3.0-46-generic o Zpool version: zfs-0.8.4-1 zfs-kmod-0.8.4-1 [ 19.600250] ZFS: Loaded module v0.8.4-1, ZFS pool version 5000, ZFS filesystem version 5 o CPU detected: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz o Number of CPU cores/hyperthreads: 8 o Actual number of cpu cores : 4 o CPU supports AES acceleration (blank=NO): aes o Crypto modules loaded: module : aesni_intel module : aes_x86_64 module : cmac module : crc32_pclmul module : crct10dif_pclmul module : cryptd module : ecdh_generic module : ghash_clmulni_intel module : kernel module : md4 module : serpent_avx2 module : serpent_avx_x86_64 module : serpent_generic module : serpent_sse2_x86_64 module : twofish_avx_x86_64 module : twofish_generic module : twofish_x86_64 module : twofish_x86_64_3way o Check for openssl / run benchmarks OpenSSL 1.1.1 11 Sep 2018 built on: Tue Nov 12 16:58:35 2019 UTC options:bn(64,64) rc4(16x,int) des(int) aes(partial) blowfish(ptr) compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -fdebug-prefix-map=/build/openssl-kxN_24/openssl-1.1.1=. -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_CPUID_OBJ -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DKECCAK1600_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DX25519_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2 The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes aes-128 cbc 137681.24k 161742.81k 165453.23k 166719.49k 167583.74k 167417.17k aes-192 cbc 124583.80k 135018.45k 137080.83k 138062.85k 138483.03k 138177.19k aes-256 cbc 107935.74k 115796.42k 117538.13k 117877.42k 118390.78k 118450.86k o Check for cryptsetup - please wait, this will take a few

Tests are approximate using memory only (no storage IO).

PBKDF2-sha1 1519675 iterations per second for 256-bit key PBKDF2-sha256 1702233 iterations per second for 256-bit key PBKDF2-sha512 1328993 iterations per second for 256-bit key PBKDF2-ripemd160 1107260 iterations per second for 256-bit key PBKDF2-whirlpool 788403 iterations per second for 256-bit key argon2i 6 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time) argon2id 6 iterations, 1048576 memory, 4 parallel threads (CPUs) for 256-bit key (requested 2000 ms time)

Algorithm | Key | Encryption | Decryption

    aes-cbc   128b   712.6 MiB/s  2873.1 MiB/s
serpent-cbc   128b    93.5 MiB/s   599.9 MiB/s
twofish-cbc   128b   191.6 MiB/s   377.2 MiB/s
    aes-cbc   256b   526.8 MiB/s  2242.1 MiB/s
serpent-cbc   256b    93.8 MiB/s   600.2 MiB/s
twofish-cbc   256b   190.9 MiB/s   379.6 MiB/s
    aes-xts   256b  1885.6 MiB/s  1914.3 MiB/s
serpent-xts   256b   600.5 MiB/s   580.0 MiB/s
twofish-xts   256b   374.0 MiB/s   373.2 MiB/s
    aes-xts   512b  1561.3 MiB/s  1554.9 MiB/s
serpent-xts   512b   598.7 MiB/s   579.0 MiB/s
twofish-xts   512b   374.8 MiB/s   373.7 MiB/s

o PREP - Copy ~1GB iso file to ramdisk if not already there o Supported ZFS encryption types (per ' man zfs ') aes-128-ccm,aes-192-ccm,aes-256-ccm,aes-128-gcm,aes-192-gcm,aes-256-gcm o Creating zmsata480 dataset for aes-128-ccm NAME PROPERTY VALUE SOURCE zmsata480/Test-aes-128-ccm encryption aes-128-ccm - zmsata480/Test-aes-128-ccm keylocation file:///root/zek-testencr-zfs.key local o Creating zmsata480 dataset for aes-192-ccm NAME PROPERTY VALUE SOURCE zmsata480/Test-aes-192-ccm encryption aes-192-ccm - zmsata480/Test-aes-192-ccm keylocation file:///root/zek-testencr-zfs.key local o Creating zmsata480 dataset for aes-256-ccm NAME PROPERTY VALUE SOURCE zmsata480/Test-aes-256-ccm encryption aes-256-ccm - zmsata480/Test-aes-256-ccm keylocation file:///root/zek-testencr-zfs.key local o Creating zmsata480 dataset for aes-128-gcm NAME PROPERTY VALUE SOURCE zmsata480/Test-aes-128-gcm encryption aes-128-gcm - zmsata480/Test-aes-128-gcm keylocation file:///root/zek-testencr-zfs.key local o Creating zmsata480 dataset for aes-192-gcm NAME PROPERTY VALUE SOURCE zmsata480/Test-aes-192-gcm encryption aes-192-gcm - zmsata480/Test-aes-192-gcm keylocation file:///root/zek-testencr-zfs.key local o Creating zmsata480 dataset for aes-256-gcm NAME PROPERTY VALUE SOURCE zmsata480/Test-aes-256-gcm encryption aes-256-gcm - zmsata480/Test-aes-256-gcm keylocation file:///root/zek-testencr-zfs.key local Tue May 26 19:18:06 CDT 2020 ./zfs-test-encryption-speeds.sh - mintLatitude-E6540 - Ready for testing ./zfs-test-encryption-speeds.sh - Cleanup called ./zfs-test-encryption-speeds.sh - Destroying encryption test dataset: Test-aes-128-ccm ./zfs-test-encryption-speeds.sh - Destroying encryption test dataset: Test-aes-128-gcm ./zfs-test-encryption-speeds.sh - Destroying encryption test dataset: Test-aes-192-ccm ./zfs-test-encryption-speeds.sh - Destroying encryption test dataset: Test-aes-192-gcm ./zfs-test-encryption-speeds.sh - Destroying encryption test dataset: Test-aes-256-ccm ./zfs-test-encryption-speeds.sh - Destroying encryption test dataset: Test-aes-256-gcm ./zfs-test-encryption-speeds.sh - Tue May 26 19:23:45 CDT 2020 - Cleanup called ./zfs-test-encryption-speeds.sh - Destroying encryption test dataset: Test-aes* `

etosan commented 4 years ago

Before employing encryption, a good sysadmin rule is "don't use it where it's not needed. If needed, benchmark it first." IMHO, unless you're a business user with a corporate laptop or multi-terabyte environment with business secrets, whole-environment encryption is overkill and makes things more fragile. Home users shouldn't really "need" it -- it would seem to make much more sense to setup a password-protected dedicated area to hide your $secrets in unless you don't want people to even boot your machine without the password.

I admit, my screw up, I had no performance "problems" with LUKS, so I assumed ZFS native encryption would fare at least comparably.

Besides VM/experiments sub-datasets (where transient VMs hdds live) everything including md-mirrored swap (I use it for hybernation) is encrypted.

I am both corporate empolyee and consultant on side, and majority of my data belongs (or is otherwise related) to clients, so I must have most datasets encrypted. Full disk/full pool encryption in my case is warranted. It protects data at rest of my clients should notebook get stolen.

I can decrypt my downloads dataset, of course, but I will still have the same problem when moving clients testing data around other datasets.

Second thing is: I have been using ZFS since 32bit single core machines, I don't mind "slowness" in exchange of features I get from ZFS, I am patient in that way.

Unfortunately I never experienced this kind of stuttering with ZFS even on (clearly) underequipped 32bit machines (of course we are not talking about video streaming here, that was just simple test, but inability to type is pretty problematic).


I will try to get to tests this evening.

EDIT: typos

kneutron commented 4 years ago

--Yah, I remember trying zfs on 32-bit single CPUs - and learning to stay the hell away from that and go with 64-bit + at least 4GB of RAM ;-)

--Sorry about the stuttering issues, hopefully my script might be able to help you with testing? Please let me know if you find it helpful and if there are any bugs or feature requests.

--It might also be helpful to open up a few different xterms and have the following running while you're testing:

o top -d 5 o iostat -k 5 # sysstat package o zpool iostat -y -T d -v 5

tonyhutter commented 4 years ago

zpool iostat -l zpool iostat -r and zpool iostat -w might also help you.

kneutron commented 4 years ago

Any updates on this?

AttilaFueloep commented 4 years ago

I'm afraid with the information given so far I can't be of much help here, sorry. I failed to reproduce this loacaly.

@etosan It would also be good to see the output of the following commands:

cat /sys/module/icp/parameters/icp_aes_impl
cat /sys/module/icp/parameters/icp_gcm_impl
cat /sys/module/icp/parameters/icp_gcm_avx_chunk_size

EDIT: typo

AttilaFueloep commented 4 years ago

@etosan Sorry, just realized that I left your questions unanswered.

Meanwhile I would like to know whether aes-256-gcm was bad choice.

Hopefully not, it should be the fastest choice, is the default cipher since 0.8.4 and widely used elsewhere. I use it myself.

Hmm perf output is interesting. I have yet to learn to use it, but you are saying it can spit out backtrace/callsite in z_wr_iss where it is spending most of it's time?

Think of it like top at a function call level.

etosan commented 4 years ago

I had too much $work to look deeper into this.

For now all I can say is that, after initial moves I have been moving less 30Gigs between datasets encrypted with different roots.

Those amounts don't cause latency nor any other issues and are rather fast so I presume even output of counters would be useless.

I will try to create situation with bigger data moves in the following weeks.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

dm17 commented 1 month ago

What did you end up doing @etosan? This is a fairly ideal setup for many. I've seen encrypted zfs mirrors on an E-series Thinkpads workout alright but they didn't have a swap at all.

etosan commented 1 month ago

@dm17 in the end I stopped receiving and moving around so much client "testing" data in time. So the problem "went away". I am heavy dataset user:

: zfs list | wc -l
283

Once most of the stuff I need settled-in into their own dataset "buckets", I stopped doing big cross dataset data transfers and learned to live with it.

Also list of my clients stabilized insanely, so once I have most stuff I need for cooperation, data exchange trickled down a lot.

Currently I am working on migrating to a new machine in the meantime, but encryption is proving quite problematic there as well :D (encryptionroot= and it's inheritance is slightly ... weird ... almost borked).

This newer machine also has much higher specs than my elder daily work driver, so I will try to do tests there eventually, and if it goes well, that will probably completely invalidate this issue for me, I am afraid (and I hope :D).