openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

Terrible scrub performance with encrypted ZFS filesystems #15885

Open davidcpgray opened 6 months ago

davidcpgray commented 6 months ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version 12.5
Kernel Version 6.1.0-17
Architecture amd64
OpenZFS Version 2.1.11-1

Describe the problem you're observing

zpool scrub operation on zpool with mounted encrypted filesystems exhibits extremely high I/O load on drives but very poor read data rate and negligable progress through the scrub opration.

Describe how to reproduce the problem

zpool scrub poolname

System has 4-disk raidz1 pool with 16TB Western DIgital Red Pro HDD's. All filesystems in the pool are created with native ZFS encryption

Performance of scrub operations initiated on the pool with encryption key loaded and filesystems mounted is extremely poor. 'iostat -xmc 2' reports: ~300-350 r/s, 3-5 rMB/s, and 100 %util. (per drive)

Have left the scrub operation running for 12 hours or more and it makes hardly any progress with 'zpool status' reporting 'no estimated completion time'

If however the pool is exported, then immediately re-imported without loading the encryption key and the scrub operation is re-run, then 'expected' performance is observed: 250-500 r/s, 170-240 rMB/s, 95-100% util (per drive)

Key metric here is rMB's which is effectively at or near max sequential data rate for these drives.

Once the scrub is running with expected 'good' perfomance characteristics, then can re-load the encryption key and re-mount the filesystems and the scrub continues with good performace and eventually completes in ~20 hours which is broadly expected for this system and drives.

Both 'good' and 'poor' performance modes are readily reproducable by importing or exporting the pool, no system reboots are required.

Include any warning/errors/backtraces from the system logs

Nothing reported in system logs / dmesg.

This behaviour has been observed with multiple ZFS versions since around 2.0.3. But only recently discovered the correlation between encrypted fs mounted / unmounted affecting performance.

Happy to provide additional information on request.

rincebrain commented 6 months ago

My guess without more performance data would be that it's doing some workload while the dataset(s) are unlocked (like quota metadata recalculation) that's drowning the pool in IOs, and since scrub is lowest priority compared to anything else, it's going to keep getting bottom-fed if something is doing a bunch of IO.

But it can't do that calculation if the dataset isn't unlocked, since some of the data it needs to do that is encrypted, I believe.

Depending on what the IO it's doing mostly is, that would be my guess, absent more data.

(My other guess would be that your system config is one where the SIMD acceleration of various stuff isn't working for some or all of it...you could go check /sys/module/icp/parameters/icp_aes_impl, /sys/module/icp/parameters/icp_gcm_impl, /proc/spl/kstat/zfs/fletcher_4_bench, and /proc/spl/kstat/zfs/vdev_raidz_bench for more data.)

davidcpgray commented 6 months ago

Hi,

Thanks for having a look at this...

This system is very lightly loaded most of the time, including when the scrub operations run. So the only way the pool is 'drowning in IOs' is if the scrub operation itself is generating that IO. And if the scrub was generating enough IO to kill performance, then surely this would also happen to regular non-encrypted systems. So I don't believe this to be the case.

Very happy to generate more performance data, but would need some guidance on what might be required/useful here.

/sys/module/icp/parameters/icp_aes_impl
cycle [fastest] generic x86_64 aesni 

/sys/module/icp/parameters/icp_gcm_impl
cycle [fastest] avx generic pclmulqdq 

/proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 2836382540 298647963538696
implementation   native         byteswap       
scalar           8617482592     6724286558     
superscalar      7476357776     8467565223     
superscalar4     8298708095     7951912804     
sse2             17322514911    10947136176    
ssse3            18113077177    14824031285    
avx2             30054657319    23233022400    
fastest          avx2           avx2           

/proc/spl/kstat/zfs/vdev_raidz_bench
18 0 0x01 -1 0 3138481885 298647964084633
implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr         
original         644998740       372316431       150949561       1727026378      332465367       53011442        148503162       30191008        30194757        20960306        
scalar           1955625227      545753670       237305602       1944392248      660812362       486452663       349016606       263140101       179607347       138426858       
sse2             3119871353      1435585630      754741384       3370304972      1197325623      977363501       603857033       552382761       335089470       148852669       
ssse3            3225892195      1435188628      755202070       3362852444      1912160887      1467897009      1125390665      948306439       676437394       530878920       
avx2             5878558585      2449914268      1342233423      5653674308      3674654552      2961701604      1962847190      1709563836      1266078086      987526682       
fastest          avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2            avx2            
rincebrain commented 6 months ago

Well, my general suggestion then would be to look at the output of, say, mpstat -P ALL 1 for 30-60s while it's scrubbing in poor performance mode, and the same when it's not running poorly, and see if it's spending most of its time in %sys or %iowait when it's running poorly, compared to baseline.

If it's all in %sys, then go do something like look at perf top or generate a FlameGraph to see where it's spending that time. If it's not, that's a slightly more complicated question.