openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.7k stars 1.76k forks source link

Reads from ZFS volumes cause system instability when SIMD acceleration is enabled #9346

Closed aerusso closed 4 years ago

aerusso commented 5 years ago

System information

I'm duplicating Debian bug report 940932. Because of the severity of the bug report (claims data corruption), I'm directly posting it here before trying to confirm with the original poster. If this is inappropriate, I apologize, and please close the bug report.

Type Version/Name
Distribution Name Debian
Distribution Version stable
Linux Kernel 4.19.67
Architecture amd64 (Ryzen 5 2600X and Ryzen 5 2600 on X470 GAMING PLUS (MS-7B79) BIOS version: 7B79vAC)
ZFS Version zfs-linux/0.8.1-4~bpo10+1

Describe the problem you're observing

Rounding error failure in mprime torture test that goes away when /sys/module/zfs/parameters/zfs_vdev_raidz_impl and /sys/module/zcommon/parameters/zfs_fletcher_4_impl are set to scalar.

Describe how to reproduce the problem

Quoting the bug report:

recently I have noticed some instability on one of my machines. The mprime (https://www.mersenne.org/download/) Torture Tests would occasionaly show errors like

"FATAL ERROR: Rounding was 0.5, expected less than 0.4 Hardware failure detected, consult stress.txt file."

random commands would occasionaly segfault.

While trying to narrow down the problem I have replaced the PSU, RAM and the CPU. Multiple hour long runs of memtest86 did not show any problem.

Finally I was able to narrow down the reads from ZFS volumes as the trigger for the instability. Scrubbing the volume would cause mprime to error out especially quickly.

As a workaround I switched the SIMD acceleration off by piping "scalar" to

/sys/module/zfs/parameters/zfs_vdev_raidz_impl and /sys/module/zcommon/parameters/zfs_fletcher_4_impl

and that made the system stable again.

Include any warning/errors/backtraces from the system logs

mprime:

FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file.
shartge commented 5 years ago

I've cherry picked the 4 patches to the simd branch on top of debian's 0.8.2-2 package: https://salsa.debian.org/zfsonlinux-team/zfs/commit/9031b0db41ef0e2675d5a88f076bf001f1ea86f1 which would be convenient for Debian users to do the test. @happyaron

Unfortunately I am still able to replicate the issue, currently testing with Linux debian-buster 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux:

Module version:

ZFS: Loaded module v0.8.2-3~bpo10+1, ZFS pool version 5000, ZFS filesystem version 5

Parameters:

zcommon/parameters/zfs_fletcher_4_impl:[fastest] scalar superscalar superscalar4 sse2 ssse3 avx2 avx512f 
zfs/parameters/zfs_vdev_raidz_impl:cycle [fastest] original scalar sse2 ssse3 avx2 avx512f avx512bw 

Proof patches have been applied:

dpkg-source: warning: extracting unsigned source package (zfs-linux_0.8.2-3~bpo10+1.dsc)
dpkg-source: info: extracting zfs-linux in zfs-linux-0.8.2
dpkg-source: info: unpacking zfs-linux_0.8.2.orig.tar.gz
dpkg-source: info: unpacking zfs-linux_0.8.2-3~bpo10+1.debian.tar.xz
dpkg-source: info: using patch list from debian/patches/series
dpkg-source: info: applying 0001-Prevent-manual-builds-in-the-DKMS-source.patch
dpkg-source: info: applying 0002-Check-for-META-and-DCH-consistency-in-autoconf.patch
dpkg-source: info: applying 0003-relocate-zvol_wait.patch
dpkg-source: info: applying enable-zed.patch
dpkg-source: info: applying 1004-zed-service-bindir.patch
dpkg-source: info: applying init-debian-openrc-workaround.patch
dpkg-source: info: applying 3100-remove-libzfs-module-timeout.patch
dpkg-source: info: applying force-verbose-rules.patch
dpkg-source: info: applying Linux-5.0-compat-SIMD-compatibility.patch
dpkg-source: info: applying Fix-CONFIG_X86_DEBUG_FPU-build-failure.patch
dpkg-source: info: applying Enable-SIMD-for-encryption.patch
dpkg-source: info: applying linux-compat-SIMD-save-restore.patch

mprime output:

[Worker #3 Oct 5 15:05] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #4 Oct 5 15:05] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #1 Oct 5 15:05] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #2 Oct 5 15:05] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #1 Oct 5 15:06] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #2 Oct 5 15:06] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #3 Oct 5 15:06] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #3 Oct 5 15:06] FATAL ERROR: Rounding was 0.4988735581, expected less than 0.4
[Worker #3 Oct 5 15:06] Hardware failure detected, consult stress.txt file.
[Worker #3 Oct 5 15:06] Torture Test completed 1 tests in 1 minutes - 1 errors, 0 warnings.
[Worker #3 Oct 5 15:06] Worker stopped.
[Worker #4 Oct 5 15:07] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.

Test script to generate I/O:

#!/bin/bash

set -x
while true; do 
        time dd if=/dev/zero bs=16M count=2000 status=progress oflag=direct of=/backup/testdata.dat; 
done
shartge commented 5 years ago

The same is also true for Linux debian-buster 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux.

Same module as above, same test script, same error:

[Worker #4 Oct 5 15:18] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #1 Oct 5 15:18] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #3 Oct 5 15:18] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #2 Oct 5 15:18] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #3 Oct 5 15:19] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #1 Oct 5 15:19] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #2 Oct 5 15:19] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #4 Oct 5 15:19] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #4 Oct 5 15:20] FATAL ERROR: Rounding was 16679048.97, expected less than 0.4
[Worker #4 Oct 5 15:20] Hardware failure detected, consult stress.txt file.
[Worker #4 Oct 5 15:20] Torture Test completed 1 tests in 1 minutes - 1 errors, 0 warnings.
[Worker #4 Oct 5 15:20] Worker stopped.

I will now test the native branch from https://github.com/behlendorf/zfs/tree/zfs-0.8.2-simd to rule out anything else in the Debian packages creates this.

shartge commented 5 years ago

Confirmed, the same happens with the native modules from https://github.com/behlendorf/zfs/tree/zfs-0.8.2-simd on Linux debian-buster 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64. I will retest this in 4.19 but I don't think it will change.

shartge commented 5 years ago

And, as expected, Linux debian-buster 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux also shows the same error as above.

So, https://github.com/zfsonlinux/zfs/pull/9406 doesn't fix the problem (for me).

Edit: To be precise: The changes on 0.8.2 don't fix the problem for me.

shartge commented 5 years ago

CPU-Info for this test:


Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       43 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
Stepping:            0
CPU MHz:             3092.734
BogoMIPS:            6185.46
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities

Yes, this test runs in a VM, but this should not matter, ZFS should be usable in any environment.

shartge commented 5 years ago

And out of curiosity I tested https://github.com/behlendorf/zfs/tree/issue-9346 directly:

ZFS: Loaded module v0.8.0-307_gb7be6169c, ZFS pool version 5000, ZFS filesystem version 5 on top of Linux debian-buster 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux and it fails for me as well.

shartge commented 5 years ago

I think I see the problem here.

You are all using zfs scrub pool to reproduce the error and I am as well not able to reproduce the problem when doing reads, created by a scrub or a direct read via dd from the testfile.

But when using writes the problem still happens, see my test script from https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-538648709.

Fabian-Gruenbichler commented 5 years ago

I think I see the problem here.

You are all using zfs scrub pool to reproduce the error and I am as well not able to reproduce the problem when doing reads, created by a scrub or a direct read via dd from the testfile.

But when using writes the problem still happens, see my test script from #9346 (comment).

cannot reproduce this here. can you please verify that this is not an issue with your test setup? e.g., try testing unpatched 0.8.2 (which has no SIMD support with recent kernels, and thus cannot be affected by this issue)?

shartge commented 5 years ago

Yes, the setup is correct.

For the "native" tests I used the steps from https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-538574678 to checkout from git and compile the modules.

You can see in the snippet from the kernel log what module version got loaded: ZFS: Loaded module v0.8.0-307_gb7be6169c, ZFS pool version 5000, ZFS filesystem version 5

Edit: This is from the test for https://github.com/behlendorf/zfs/tree/issue-9346, i.e. not the ported-to-0.8.2 patches but the patches for the master branch.

shartge commented 5 years ago

Edit: This test has been done on 4.19.

Here is another test from the zfs-0.8.2-simd branch:

root@debian-buster:~/git/zfs# git log | head -n 6
commit 89fbd51fece10f1c9d1b8502aa5d843e67d7e48e
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu Oct 3 00:03:20 2019 +0000

    Linux 4.14, 4.19, 5.0+ compat: SIMD save/restore

After ./configure and make:

root@debian-buster:~/git/zfs# ./cmd/zpool/zpool export backup
root@debian-buster:~/git/zfs# ./scripts/zfs.sh -u
root@debian-buster:~/git/zfs# ./scripts/zfs.sh
root@debian-buster:~/git/zfs# ./cmd/zpool/zpool import backup

dmesg -T reports:

[Mon Oct  7 07:54:55 2019] ZFS: Unloaded module v0.8.0-307_gb7be6169c
[Mon Oct  7 07:55:21 2019] ZFS: Loaded module v0.8.2-1, ZFS pool version 5000, ZFS filesystem version 5

I start my test script in one terminal:

root@debian-buster:~/git/zfs# zfs-write-test.sh 
+ true
+ dd if=/dev/zero bs=16M count=2000 status=progress oflag=direct of=/backup/testdata.dat
33353105408 bytes (33 GB, 31 GiB) copied, 42 s, 794 MB/s 
2000+0 records in
2000+0 records out
33554432000 bytes (34 GB, 31 GiB) copied, 42.2444 s, 794 MB/s

real    0m42.322s
user    0m0.017s
sys     0m17.482s
+ true
+ dd if=/dev/zero bs=16M count=2000 status=progress oflag=direct of=/backup/testdata.dat
33067892736 bytes (33 GB, 31 GiB) copied, 43 s, 769 MB/s 
2000+0 records in
2000+0 records out
33554432000 bytes (34 GB, 31 GiB) copied, 43.6985 s, 768 MB/s

real    0m43.791s
user    0m0.012s
sys     0m20.082s
+ true
+ dd if=/dev/zero bs=16M count=2000 status=progress oflag=direct of=/backup/testdata.dat
33487323136 bytes (33 GB, 31 GiB) copied, 44 s, 761 MB/s 
2000+0 records in
2000+0 records out
33554432000 bytes (34 GB, 31 GiB) copied, 44.1489 s, 760 MB/s

real    0m44.300s
user    0m0.032s
sys     0m19.434s
[... and so forth ...]

And mprime -t in another:

[Worker #2 Oct 7 07:56] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #3 Oct 7 07:56] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #1 Oct 7 07:56] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #4 Oct 7 07:56] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #1 Oct 7 07:57] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #2 Oct 7 07:57] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #3 Oct 7 07:57] Test 2, 12400 Lucas-Lehmer in-place iterations of M20971521 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #3 Oct 7 07:57] FATAL ERROR: Rounding was 2.459313877e+10, expected less than 0.4
[Worker #3 Oct 7 07:57] Hardware failure detected, consult stress.txt file.
[Worker #3 Oct 7 07:57] Torture Test completed 1 tests in 1 minutes - 1 errors, 0 warnings.
[Worker #3 Oct 7 07:57] Worker stopped.
shartge commented 5 years ago

I also let mprime -t run on that system without any ZFS I/O to check for non-ZFS-related issues but it ran for hours without any problems. (Other than my monitoring screaming at me because of the sudden high CPU usage.)

Fabian-Gruenbichler commented 5 years ago

could you please still do the test I asked you to do? you are the only one seeing this behaviour so far AFAICT, I am just trying to narrow down where it could potentially come from..

shartge commented 5 years ago

Ah, you mean to use the code-base which does not use any SIMD instructions, i.e. the basic 0.8.2 branch?

Fabian-Gruenbichler commented 5 years ago

Ah, you mean to use the code-base which does not use any SIMD instructions, i.e. the basic 0.8.2 branch?

yes :)

shartge commented 5 years ago

I'm running the test on 4.19 right now:

root@debian-buster:~/git/zfs-0.8-release# git log | head -n 6
commit 1222e921c9e3d8f5c693f196435be4604a1187c0
Author: Tony Hutter <hutter2@llnl.gov>
Date:   Fri Aug 23 15:52:32 2019 -0700

    Tag zfs-0.8.2
root@debian-buster:~/t# grep . /sys/module/z*/parameters/*impl
/sys/module/zcommon/parameters/zfs_fletcher_4_impl:[fastest] scalar superscalar superscalar4 
/sys/module/zfs/parameters/zfs_vdev_raidz_impl:[fastest] original scalar 

No SIMD in use.

It is up to "Test 6" in mprime, no errors so far. I let it run to completion, which will take another ~10 minutes and report the final result here.

But with SIMD the tests never got that far, it always errors out in "Test " at the latest.

shartge commented 5 years ago

As expected, the first 11 mprime tests completet without error:

[Worker #2 Oct 7 08:47] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #4 Oct 7 08:47] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #3 Oct 7 08:47] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[Worker #1 Oct 7 08:47] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using AVX-512 FFT length 1152K, Pass1=192, Pass2=6K, clm=4.
[...]
[Worker #4 Oct 7 09:02] Self-test 1152K passed!
[Worker #2 Oct 7 09:02] Self-test 1152K passed!
[Worker #3 Oct 7 09:02] Self-test 1152K passed!
[Worker #1 Oct 7 09:02] Self-test 1152K passed!

Under full I/O load:


              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
backup      20.9G  58.6G     23  4.58K  95.9K  1.05G
  raidz1    20.9G  58.6G     23  4.58K  95.9K  1.05G
    sdb         -      -      4  1.14K  20.0K   270M
    sdc         -      -      7  1.13K  32.0K   268M
    sdd         -      -      6  1.15K  26.0K   271M
    sde         -      -      4  1.15K  18.0K   270M
----------  -----  -----  -----  -----  -----  -----
shartge commented 5 years ago

And to add another datapoint:

I loaded the 0.8.2+SIMD modules, but disabled the usage of any SIMD instructions:

root@debian-buster:~/t# grep . /sys/module/z*/parameters/*impl
/sys/module/zcommon/parameters/zfs_fletcher_4_impl:[fastest] scalar superscalar superscalar4 sse2 ssse3 avx2 avx512f 
/sys/module/zfs/parameters/zfs_vdev_raidz_impl:cycle [fastest] original scalar sse2 ssse3 avx2 avx512f avx512bw 
root@debian-buster:~/t# echo "scalar" > /sys/module/zcommon/parameters/zfs_fletcher_4_impl
root@debian-buster:~/t# echo "scalar" > /sys/module/zfs/parameters/zfs_vdev_raidz_impl
root@debian-buster:~/t# grep . /sys/module/z*/parameters/*impl
/sys/module/zcommon/parameters/zfs_fletcher_4_impl:fastest [scalar] superscalar superscalar4 sse2 ssse3 avx2 avx512f 
/sys/module/zfs/parameters/zfs_vdev_raidz_impl:cycle fastest original [scalar] sse2 ssse3 avx2 avx512f avx512bw 

And reran my tests. No errors, as expected.

And as soon as I set both parameters back to "fastest", mprime errors out again.

vstax commented 5 years ago

And as soon as I set both parameters back to "fastest", mprime errors out again.

Could it be something related to AVX512 support? Most people here don't have CPUs that support that feature. If there is some issue with saving/restoring specifically AVX512 registers (and it is picked as "fastest" implementation), others cannot observe the problem that you have.

I wonder if you can reproduce this issue when setting implementations not to "scalar", but to "avx2"? Same question with "ssse3".

shartge commented 5 years ago

I was thinking about that possibility as well. This is a very very new CPU so it is quite possible others are not able to reproduce this because of that. I will test the other SIMD variants and report back.

vstax commented 5 years ago

I will test the other SIMD variants and report back.

Thanks. No need to go lower than SSSE3, I think, no one uses these implementations in practice :) But what might be worth it is disabling AVX512 support in Prime95. So you should see it using "FMA3 FFT" (which means using AVX2) and not "AVX-512 FFT". I don't know if it would make a difference but it seems like a good idea to me, when you using AVX2 code in ZFS.

You can do that by putting "CpuSupportsAVX512F=0" in local.txt (https://www.tomshardware.com/reviews/stress-test-cpu-pc-guide,5461-2.html)

Fabian-Gruenbichler commented 5 years ago

I managed to reproduce this with AVX2 on one of our testlab machines (baremetal, using both the backported PR and the PR as is, with stock Debian Buster 4.19 kernel), so it seems there is still something broken with the PR.

shartge commented 5 years ago

And the reports for the other SIMD variants are in, with mprime still using AVX2.

This still creates errors:

/sys/module/zcommon/parameters/zfs_fletcher_4_impl:fastest scalar superscalar superscalar4 sse2 ssse3 [avx2] avx512f 
/sys/module/zfs/parameters/zfs_vdev_raidz_impl:cycle fastest original scalar sse2 ssse3 [avx2] avx512f avx512bw 

This also still creates errors:

/sys/module/zcommon/parameters/zfs_fletcher_4_impl:fastest scalar superscalar superscalar4 sse2 [ssse3] avx2 avx512f 
/sys/module/zfs/parameters/zfs_vdev_raidz_impl:cycle fastest original scalar sse2 [ssse3] avx2 avx512f avx512bw 

This also creates errors:

/sys/module/zcommon/parameters/zfs_fletcher_4_impl:fastest scalar superscalar superscalar4 [sse2] ssse3 avx2 avx512f 
/sys/module/zfs/parameters/zfs_vdev_raidz_impl:cycle fastest original scalar [sse2] ssse3 avx2 avx512f avx512bw 
shartge commented 5 years ago

But what might be worth it is disabling AVX512 support in Prime95. So you should see it using "FMA3 FFT" (which means using AVX2) and not "AVX-512 FFT". I don't know if it would make a difference but it seems like a good idea to me, when you using AVX2 code in ZFS.

You can do that by putting "CpuSupportsAVX512F=0" in local.txt (https://www.tomshardware.com/reviews/stress-test-cpu-pc-guide,5461-2.html)

Switching to AVX2/FMA3-FFT for mprime and using "fastest" (i.e AVX512) in ZFS also creates errors.

Switching ZFS to AVX2 while keeping mprime also at AVX2 creates errors, too.

And finally, setting ZFS to "ssse3" and keeping mprime at AVX2 still creates errors.

But @Fabian-Gruenbichler was able to reproduce this, so I can finally stop doubting myself.

shartge commented 5 years ago

Interesting observation:

If I keep ZFS at fastest/AVX512 and configure mprime to not use any modern SIMD instructions other than SSE2, I am no longer able to reproduce the problem.

For local.txt:

CpuSupportsAVX512F=0
CpuSupportsAVX2=0
CpuSupportsFMA4=0
CpuSupportsFMA3=0
CpuSupportsAVX=0

And mprime passes all three self tests:


[Worker #1 Oct 7 12:43] Test 1, 3100 Lucas-Lehmer iterations of M21871519 using type-2 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Oct 7 12:44] Test 2, 3100 Lucas-Lehmer in-place iterations of M20971521 using FFT length 1120K, Pass1=448, Pass2=2560, clm=4.
[Worker #1 Oct 7 12:44] Test 3, 3100 Lucas-Lehmer iterations of M20971519 using type-2 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Oct 7 12:45] Test 4, 4000 Lucas-Lehmer in-place iterations of M19922945 using FFT length 1120K, Pass1=448, Pass2=2560, clm=4.
[Worker #1 Oct 7 12:46] Self-test 1120K passed!
[Worker #1 Oct 7 12:46] Test 1, 1600000 Lucas-Lehmer in-place iterations of M83839 using FFT length 4K.
[Worker #1 Oct 7 12:47] Test 2, 1600000 Lucas-Lehmer in-place iterations of M82031 using FFT length 4K.
[Worker #1 Oct 7 12:48] Test 3, 1600000 Lucas-Lehmer in-place iterations of M79745 using FFT length 4K.
[Worker #1 Oct 7 12:48] Test 4, 1600000 Lucas-Lehmer in-place iterations of M77455 using FFT length 4K.
[Worker #1 Oct 7 12:49] Self-test 4K passed!
[Worker #1 Oct 7 12:49] Test 1, 1120000 Lucas-Lehmer in-place iterations of M107519 using FFT length 5K.
[Worker #1 Oct 7 12:50] Test 2, 1120000 Lucas-Lehmer in-place iterations of M106497 using FFT length 5K.
[Worker #1 Oct 7 12:51] Test 3, 1120000 Lucas-Lehmer in-place iterations of M104447 using FFT length 5K.
[Worker #1 Oct 7 12:51] Test 4, 1120000 Lucas-Lehmer in-place iterations of M102401 using FFT length 5K.
[Worker #1 Oct 7 12:52] Self-test 5K passed!

As soon as I enable anything above SSE2, starting with AVX, the errors return.

vstax commented 5 years ago

If I keep ZFS at fastest/AVX512 and configure mprime to not use any modern SIMD instructions other than SSE2, I am no longer able to reproduce the problem.

In SSE modes XMM registers are used, which are lower half of AVX (YMM) registers (or lower quad of AVX512 ZMM registers). Since this issue seems to be about saving/restoring registers when switching threads, using only lower part of register technically shouldn't change anything. If Prime95 is actually using SSE2 instructions, that is...

But maybe, just maybe, I'm really speculating here, kernel actually does save/restore on SSE (XMM) registers so the problem does not appear when Prime95 is only using XMM registers. It's upper part of YMM register that causes problem, that is, only SSE registers are saved/restored instead of whole 256 bit AVX ones. I don't know if this is possible :) Just thought I'd share an idea.

EDIT: this could happen if FXSAVE instruction which is called explicitly by https://github.com/zfsonlinux/zfs/pull/9406 works as expected but XSAVE feature in kernel doesn't work or isn't called correctly for some reason.

vorsich-tiger commented 5 years ago

I'd like to throw in a few words just before any final "fix" is committed and routed forward: There has been (or depending on the source there maybe still is) a major bug in kernel 5.2 when running kvm VMs. I think it might be worth to re-evaluate whether the currently suggested patch series actually over-reacts unnecessarily to non-existing zfs-problems in 5.2+ kernels. I.e. any 5.2 host running a VM indicating a zfs-bug just might deliver a false positive. I guess it is best to re-run any bug-positive-indicative test on a non 5.2-kernel hosted VM just to be sure.

https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/ i.e. https://bugzilla.kernel.org/show_bug.cgi?id=204209 https://lkml.org/lkml/2019/7/17/758 etc.

Fabian-Gruenbichler commented 5 years ago

On October 25, 2019 1:38 am, vorsich-tiger wrote:

I'd like to throw in a few words just before any final "fix" is committed and routed forward: There has been (or depending on the source there maybe still is) a major bug in kernel 5.2 when running kvm VMs. I think it might be worth to re-evaluate whether the currently suggested patch series actually over-reacts unnecessarily to non-existing zfs-problems in 5.2+ kernels. I.e. any 5.2 host running a VM indicating a zfs-bug just might deliver a false positive. I guess it is best to re-run any bug-positive-indicative test on a non 5.2-kernel hosted VM just to be sure.

https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/ i.e. https://bugzilla.kernel.org/show_bug.cgi?id=204209 https://lkml.org/lkml/2019/7/17/758 etc.

the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

Fabian-Gruenbichler commented 5 years ago

On October 25, 2019 7:41 am, Fabian Grünbichler wrote:

On October 25, 2019 1:38 am, vorsich-tiger wrote:

I'd like to throw in a few words just before any final "fix" is committed and routed forward: There has been (or depending on the source there maybe still is) a major bug in kernel 5.2 when running kvm VMs. I think it might be worth to re-evaluate whether the currently suggested patch series actually over-reacts unnecessarily to non-existing zfs-problems in 5.2+ kernels. I.e. any 5.2 host running a VM indicating a zfs-bug just might deliver a false positive. I guess it is best to re-run any bug-positive-indicative test on a non 5.2-kernel hosted VM just to be sure.

https://www.reddit.com/r/VFIO/comments/cgqk6p/kernel_52_kvm_bug/ i.e. https://bugzilla.kernel.org/show_bug.cgi?id=204209 https://lkml.org/lkml/2019/7/17/758 etc.

the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

and the fixed 5.2/5.3 kernels are affected as well (the fix is contained in 5.2.5 and all released 5.3 versions).

vorsich-tiger commented 5 years ago

the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

@Fabian-Gruenbichler, I'm not sure you got the central point(s) I wanted to make. 1. I wanted to get everybody on the same page in relation to the fact that not only zfs might be "disturbing" the SIMD processing subsystem in the kernel, but that there is a potential that other kernel portions might also be broken - the reference I gave shows that this was actually true. 2. I am not questioning potentially required zfs SIMD fixes for kernel versions below 5.2 3. It is my impression that the developers took quite some time thinking to establish certain assumptions that should be safe to be made for kernels starting with 5.2. Within the initial comments of this issue I see developers' statements which assume zfs SIMD for 5.2+ is not broken. I merely wanted to raise awareness that tests indicating the opposite should be re-evaluated with the info from that reddit post in mind, i.e. just maybe zfs SIMD for 5.2+ is really not broken.

shartge commented 5 years ago

i.e. just maybe zfs SIMD for 5.2+ is really not broken.

Negative on that. The same problem can be reproduced on non-KVM running baremetal hosts using 5.2+.

Fabian-Gruenbichler commented 5 years ago

the issue also occurs on baremetal hosts that have no VMs running whatsoever, and on kernels earlier than 5.2.

@Fabian-Gruenbichler, I'm not sure you got the central point(s) I wanted to make. 1. I wanted to get everybody on the same page in relation to the fact that not only zfs might be "disturbing" the SIMD processing subsystem in the kernel, but that there is a potential that other kernel portions might also be broken - the reference I gave shows that this was actually true. 2. I am not questioning potentially required zfs SIMD fixes for kernel versions below 5.2 3. It is my impression that the developers took quite some time thinking to establish certain assumptions that should be safe to be made for kernels starting with 5.2. Within the initial comments of this issue I see developers' statements which assume zfs SIMD for 5.2+ is not broken. I merely wanted to raise awareness that tests indicating the opposite should be re-evaluated with the info from that reddit post in mind, i.e. just maybe zfs SIMD for 5.2+ is really not broken.

I did not misunderstand your post. I am one of the devs who triaged this bug initially, analyzed the old code, verified a workaround on our downstream side, and reviewed the now merged fix :wink:

see the detailed testing report (on baremetal!) over at https://github.com/zfsonlinux/zfs/pull/9406#issuecomment-539956625

the approach that was used for 5.2 was in theory sound for < 5.2, but not workable for GPL/license reasons. it was broken for 5.2+ though, as was the approach for < 5.2 on < 5.2 kernels. the only thing that really worked was the kernel-only solution (and a combination of 5.2+ approach with helper backports on < 5.2 kernels).

in other words, it was broken all around, irrespective of other FPU-related breakage on some 5.2 versions..

behlendorf commented 5 years ago

PR #9515 contains an 0.8 backport of the fix applied to master.

shartge commented 5 years ago

PR #9515 contains an 0.8 backport of the fix applied to master.

I will be able to test this on my systems tomorrow GMT morning.

shartge commented 5 years ago

I've had PR https://github.com/zfsonlinux/zfs/pull/9515 applied on top of the zfs-0.8-release branch on my test VM and one physical system, both first running for 4 hours on 5.2.0-bpo from Debian and then another 5 hours on 4.19 also from Debian and could no longer reproduce https://github.com/zfsonlinux/zfs/issues/9346.

From my point of view this looks very very promising.

mvrhov commented 4 years ago

Is this by any chance the same bug the Go authors found: https://bugzilla.kernel.org/show_bug.cgi?id=205663#c2

behlendorf commented 4 years ago

@mvrhov thanks for pointing out the upstream issue. That wasn't the core issue here, but it may have further confused the situation when trying to debug this.

Fabian-Gruenbichler commented 4 years ago

On December 24, 2019 7:08 pm, Brian Behlendorf wrote:

@mvrhov thanks for pointing out the upstream issue. That wasn't the core issue here, but it may have further confused the situation when trying to debug this.

I saw that while triaging (I think I even linked it in one of the issues as possible culprit?) but quickly ruled it out. Might have affected some user reports though, if they (/their distro) used the affected kernel+gcc versions.

behlendorf commented 4 years ago

Closing. The SIMD patches have been included in the 0.8.3 release.