Reads from ZFS volumes cause system instability when SIMD acceleration is enabled

aerusso commented 5 years ago

System information

I'm duplicating Debian bug report 940932. Because of the severity of the bug report (claims data corruption), I'm directly posting it here before trying to confirm with the original poster. If this is inappropriate, I apologize, and please close the bug report.

Type	Version/Name
Distribution Name	Debian
Distribution Version	stable
Linux Kernel	4.19.67
Architecture	amd64 (Ryzen 5 2600X and Ryzen 5 2600 on X470 GAMING PLUS (MS-7B79) BIOS version: 7B79vAC)
ZFS Version	zfs-linux/0.8.1-4~bpo10+1

Describe the problem you're observing

Rounding error failure in mprime torture test that goes away when /sys/module/zfs/parameters/zfs_vdev_raidz_impl and /sys/module/zcommon/parameters/zfs_fletcher_4_impl are set to scalar.

Describe how to reproduce the problem

Quoting the bug report:

recently I have noticed some instability on one of my machines. The mprime (https://www.mersenne.org/download/) Torture Tests would occasionaly show errors like

"FATAL ERROR: Rounding was 0.5, expected less than 0.4 Hardware failure detected, consult stress.txt file."

random commands would occasionaly segfault.

While trying to narrow down the problem I have replaced the PSU, RAM and the CPU. Multiple hour long runs of memtest86 did not show any problem.

Finally I was able to narrow down the reads from ZFS volumes as the trigger for the instability. Scrubbing the volume would cause mprime to error out especially quickly.

As a workaround I switched the SIMD acceleration off by piping "scalar" to

/sys/module/zfs/parameters/zfs_vdev_raidz_impl and /sys/module/zcommon/parameters/zfs_fletcher_4_impl

and that made the system stable again.

Include any warning/errors/backtraces from the system logs

mprime:

FATAL ERROR: Rounding was 0.5, expected less than 0.4
Hardware failure detected, consult stress.txt file.

rincebrain commented 5 years ago

We spent a bit of time going back and forth on IRC about this, and it seems that only the scalar setting makes the problem go away.

alex-gh commented 5 years ago

An update from the original thread:

A quick update:

I have booted up the Debian live USB on another machine and was able to reproduce this bug with it.

The machine had the Ryzen 5 2600 CPU (the one I swapped with the machine I have originally found the problem on).

The Mainboard is: ASUS PRIME B350-PLUS BIOS Version: 5216

Output of uname -a: Linux debian 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2 (2019-08-28) x86_64 GNU/Linux

Output of zfs --version: zfs-0.8.1-4~bpo10+1 zfs-kmod-0.8.1-4~bpo10+1

Also here are the steps I'm taking to reproduce the problem:

Start mprime for linux 64 bit

Select Torture Test

Choose 12 torture test threads in case of ryzen 5 (default setting)

Select Test (2) Small FFT

All other settings are set to default settings

Run the test

Read data from zfs by either reading a large file or starting a scrub. (raidz scrubs are escpecially effective)

Within a few seconds you should see mprime reporting errors.

behlendorf commented 5 years ago

@aerusso thank you for bringing this to our attention. The reported symptoms are consistent with what we'd expect if the fpu registered were someone not being restored. We'll see if we can reproduce the issue locally using the 4.19 kernel and the provided test case. Would it be possible to try and reproduce the issue using a 5.2 or newer kernel?

rincebrain commented 5 years ago

Horrifyingly, I can reproduce this in a Debian buster VM on my Intel Xeon-D.

I'm going to guess, since reports of this being on fire haven't otherwise trickled in, there might be a mismerge in Debian, or a missing followup patch?

alex-gh commented 5 years ago

I did a test with a Manjaro live USB and I could not reproduce this behaviour.

Kernel: 5.2.11-1-MANJARO ZFS package: archzfs/zfs-dkms-git 2019.09.18.r5411.gafc8f0a6f-1

ggzengel commented 5 years ago

I can reproduce it with kernel 4.19 and stress-ng too. I get more than 5 errors per minute.

With kernel 5.2 there are no errors.

root# zpool scrub zpool1
root# stress-ng --vecmath 9 --fp-error 9 -vvv --verify --timeout 3600
stress-ng: debug: [20635] 32 processors online, 32 processors configured
stress-ng: info:  [20635] dispatching hogs: 9 vecmath, 9 fp-error
stress-ng: debug: [20635] cache allocate: default cache size: 20480K
<snip>
stress-ng: fail:  [22426] stress-ng-fp-error: exp(DBL_MAX) return was 1.000000 (expected inf), errno=0 (expected 34), excepts=0 (expected 8)
stress-ng: fail:  [22426] stress-ng-fp-error: exp(-1000000.0) return was 1.000000 (expected 0.000000), errno=0 (expected 34), excepts=0 (expected 16)
stress-ng: fail:  [22389] stress-ng-fp-error: log(0.0) return was 51472868343212123638854435100661726861789564087474337372834924821256607581904275443789550923204262543290261262543297927616110435675714711004645013184740565747574812535257726048857959524537318313055909029913182014561534585350486375714439359868335816704.000000 (expected -0.000000), errno=34 (expected 34), excepts=4 (expected 4)
stress-ng: fail:  [22426] stress-ng-fp-error: exp(DBL_MAX) return was 0.000000 (expected inf), errno=0 (expected 34), excepts=8 (expected 8)
stress-ng: fail:  [22407] stress-ng-fp-error: exp(-1000000.0) return was -304425543965041899037761188749362776730427289735837064756329392319501601366578319214648354685850550352787929416219211679117562590779680584744448269412872882932591437212235151179776.000000 (expected 0.000000), errno=0 (expected 34), excepts=16 (expected 16)
stress-ng: fail:  [22397] stress-ng-fp-error: exp(DBL_MAX) return was 1.000315 (expected inf), errno=0 (expected 34), excepts=0 (expected 8)

# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  32
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:            2
CPU MHz:             2399.755
BogoMIPS:            4800.04
Hypervisor vendor:   Xen
Virtualization type: none
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-31
Flags:               fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault intel_ppin ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms xsaveopt

# uname -a
Linux server2 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2 (2019-08-28) x86_64 GNU/Linux

ThomasLamprecht commented 5 years ago

Can confirm this too on 5.0. It seems that the assumption from the SIMD patch, that with 5.0 and 5.1 kernels preemption and local IRQ disabling is enough, is wrong:

For the 5.0 and 5.1 kernels disabling preemption and local interrupts is sufficient to allow the FPU to be used. All non-kernel threads will restore the preserved user FPU state. -- commit message of commit e5db31349484e5e859c7a942eb15b98d68ce5b4d

If one checks out the kernelfpu{begin,end} methods from 5.0 kernel we can see that those safe the registers also. I can fix this issue by doing so, but my approach was really cumbersome as the "copy_kernel_to_xregs_err", "copy_kernel_to_fxregs_err" and "copy_kernel_to_fregs_err" methods are not avaialble, only those without "_err", but as those use the GPL symboled "ex_handler_fprestore" I cannot use them here.

So for my POC fix I ensured that on begin we always save the fpregs, and for the end always restore, and to do so I just hacked over the functionally of those methods from the 5.3 Kernel: (note quite minimal hacky change as a POC fix to show the issue)

diff --git a/include/linux/simd_x86.h b/include/linux/simd_x86.h
index 5f243e0cc..08504ba92 100644
--- a/include/linux/simd_x86.h
+++ b/include/linux/simd_x86.h
@@ -179,7 +180,6 @@ kfpu_begin(void)
        preempt_disable();
        local_irq_disable();

-#if defined(HAVE_KERNEL_TIF_NEED_FPU_LOAD)
        /*
         * The current FPU registers need to be preserved by kfpu_begin()
         * and restored by kfpu_end().  This is required because we can
@@ -188,32 +188,51 @@ kfpu_begin(void)
         * context switch.
         */
        copy_fpregs_to_fpstate(&current->thread.fpu);
-#elif defined(HAVE_KERNEL_FPU_INITIALIZED)
        /*
         * There is no need to preserve and restore the FPU registers.
         * They will always be restored from the task's stored FPU state
         * when switching contexts.
         */
        WARN_ON_ONCE(current->thread.fpu.initialized == 0);
-#endif
 }
+#ifndef kernel_insn_err
+#define kernel_insn_err(insn, output, input...)                                \
+({                                                                     \
+       int err;                                                        \
+       asm volatile("1:" #insn "\n\t"                                  \
+                    "2:\n"                                             \
+                    ".section .fixup,\"ax\"\n"                         \
+                    "3:  movl $-1,%[err]\n"                            \
+                    "    jmp  2b\n"                                    \
+                    ".previous\n"                                      \
+                    _ASM_EXTABLE(1b, 3b)                               \
+                    : [err] "=r" (err), output                         \
+                    : "0"(0), input);                                  \
+       err;                                                            \
+})
+#endif
+

 static inline void
 kfpu_end(void)
 {
-#if defined(HAVE_KERNEL_TIF_NEED_FPU_LOAD)
        union fpregs_state *state = &current->thread.fpu.state;
-       int error;
+       int err = 0;

        if (use_xsave()) {
-               error = copy_kernel_to_xregs_err(&state->xsave, -1);
+               u32 lmask = -1;
+               u32 hmask = -1;
+               XSTATE_OP(XRSTOR, &state->xsave, lmask, hmask, err);
        } else if (use_fxsr()) {
-               error = copy_kernel_to_fxregs_err(&state->fxsave);
+               struct fxregs_state *fx = &state->fxsave;
+               if (IS_ENABLED(CONFIG_X86_32))
+                       err = kernel_insn_err(fxrstor %[fx], "=m" (*fx), [fx] "m" (*fx));
+               else
+                       err = kernel_insn_err(fxrstorq %[fx], "=m" (*fx), [fx] "m" (*fx));
        } else {
-               error = copy_kernel_to_fregs_err(&state->fsave);
+               copy_kernel_to_fregs(&state->fsave);
        }
-       WARN_ON_ONCE(error);
-#endif
+       WARN_ON_ONCE(err);

        local_irq_enable();
        preempt_enable();

Related to the removal of the SIMD patch in the (future) 0.8.2 release #9161

shartge commented 5 years ago

With kernel 5.2 there are no errors.

I can reproduce this with mprime -t on Debian Buster running 5.2.9-2~bpo10+1 and zfs-dkms 0.8.1-4~bpo10+1 after ~1 minute of runtime:

[Worker #1 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #6 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #7 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #4 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #8 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #5 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #3 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #2 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #4 Sep 25 13:43] FATAL ERROR: Rounding was 4.029914356e+80, expected less than 0.4
[Worker #4 Sep 25 13:43] Hardware failure detected, consult stress.txt file.
[Worker #4 Sep 25 13:43] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #4 Sep 25 13:43] Worker stopped.

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
Stepping:            2
CPU MHz:             1201.117
CPU max MHz:         3600.0000
CPU min MHz:         1200.0000
BogoMIPS:            6999.89
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            10240K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

behlendorf commented 5 years ago

@GamerSource thanks for digging in to this, that matches my understanding of the issue. What I don't quite understand yet is why this wasn't observed during the initial patch testing. It may be possible it was due to my specific kernel configuration. Regardless, I agree the fix here is going to need to be to save and restore the registers similar to the 5.2+ support.

@shartge are you absolutely sure you were running with an 5.2 based kernel? Only systems running a 4.14 LTS, 4.19 LTS, 5.0, or 5.1 kernel with a patched version of 0.8.1 should be impacted by this.

shartge commented 5 years ago

@shartge are you absolutely sure you were running with an 5.2 based kernel? Only systems running a 4.14 LTS, 4.19 LTS, 5.0, or 5.1 kernel with a patched version of 0.8.1 should be impacted by this.

I am 100% sure, as this Kernel 5.2.9-2~bpo10+1 was the only Kernel installed on that system at that moment.

Also the version I copy-pasted was directly from uname -a.

Edit: Interesting bit: I was not able to reproduce this with stress-ng, as @ggzengel was, but mprime triggered it right away.

Edit²: Here is the line stress-ng logged via syslog:

Sep 25 13:35:46 storage-01 stress-ng: system: 'storage-01' Linux 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64

This was 10 Minutes before my first comment. I let stress-ng run for ~6 minutes with a scrub running a the same time. When that did non show any fails, I retested with mprime -t at 13:42, which immediately hit the problem at 13:43.

Edit³: I also checked if the hardware is fine, of course. Without ZFS mprime -t ran for 2 hours without any errors.

behlendorf commented 5 years ago

@shartge would you mind checking the dkms build directory to verify that HAVE_KERNEL_TIF_NEED_FPU_LOAD was defined in the zfs_config.h file.

/* kernel TIF_NEED_FPU_LOAD exists */
#define HAVE_KERNEL_TIF_NEED_FPU_LOAD 1

shartge commented 5 years ago

~~I will, but it will have to wait until tomorrow, because right now I have reverted the system back to 4.19 and 0.7.2 and I have to wait until the backup window has finished.~~ See https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-535153689

shartge commented 5 years ago

Scratch that, I don't need that specific system to test the build, I can just use any Debian Buster system for that, for example any of my test VMs.

Using Linux debian-buster 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux and zfs-dkms 0.8.1-4~bpo10+1 I get:

/* kernel TIF_NEED_FPU_LOAD exists */
#define HAVE_KERNEL_TIF_NEED_FPU_LOAD 1

I am attaching the whole file in case it may be helpful. zfs_config.h.txt

ggzengel commented 5 years ago

@shartge I had to reduce the CPUs to 18 for stress-ng because scrub was pausing while using all 32 CPUs. I use n/2+2 CPUs because I have a NUMA system with 2 nodes.

shartge commented 5 years ago

I now did a real test with the VM I used to do the compile test in https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-535153689 and I am able to reproduce the bug very fast.

Using a 4 disk RAIDZ and dd if=/dev/zero of=testdata.dat bs=16M while running mprime -t at the same time quickly results in

[Worker #4 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #3 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #2 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Sep 26 07:34] FATAL ERROR: Rounding was 1944592149, expected less than 0.4
[Worker #1 Sep 26 07:34] Hardware failure detected, consult stress.txt file.
[Worker #1 Sep 26 07:34] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #1 Sep 26 07:34] Worker stopped.

CPU for this system is

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       43 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
Stepping:            0
CPU MHz:             3092.734
BogoMIPS:            6185.46
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm rdseed adx smap xsaveopt arat md_clear flush_l1d arch_capabilities

Kernel and ZFS version can be found in https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-535153689

ggzengel commented 5 years ago

This is a vmware client. Does vmware have special FPU&IRQ handling inside the kernel or have a bug?

shartge commented 5 years ago

This should not matter, as I can reproduce the same problem on 2 physical systems.

But because both of them are production storages, it is easier for me to do this in a VM, as long as it shows the same behaviour.

ggzengel commented 5 years ago

The worst thing is that inside KVM the VMs get FPU errors too even they don't use ZFS. I started a Debian live CD inside Proxmox, installed stress-ng and get a lot of errors if I start ZFS scrub at the host.

shartge commented 5 years ago

This does not happen with VMware ESX for me. I've been running mprime -t in my Test-VM since 07:00 today and got not one single error.

Only when I have ZFS active and put I/O load on it, the FPU errors start to occur.

The same also happened for my with the two physical systems I used to test this.

ggzengel commented 5 years ago

@shartge Are you using ZFS on VMware host?

shartge commented 5 years ago

No!

I just quickly created a test VM to test the compilation of the module without the need to use and interrupt my production storages.

And I also tried to reproduce this issue here in a VM instead of a physical host, which, as I have show, I was successful in doing.

But, again: The error is reproducible on normal hardware with 5.2 and 0.8.1. (Using a VM is just more convenient.)

ggzengel commented 5 years ago

Summary:

This happens only with ZFS 8.X
FPU errors are always with kernel 4.19 - 5.1
It shouldn't be with kernel 5.2 but there are exceptions 3.1. @shartge gets FPU errors even with kernel 5.2 too 3.2. @alex-gh and I didn't get errors with kernel 5.2
I get FPU errors inside KVM-VM with ZFS 8.x and kernel 5.0 running at host side (Proxmox). There is no ZFS code inside the VM.
The workaround is: 5.1 run echo scalar > /sys/module/zcommon/parameters/zfs_fletcher_4_impl echo scalar > /sys/module/zfs/parameters/zfs_vdev_raidz_impl 5.2 for persistence add zfs.zfs_vdev_raidz_impl=scalar zcommon.zfs_fletcher_4_impl=scalar to kernel parameter (GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub on debian and run update_grub)

shartge commented 5 years ago

@shartge gets FPU errors even with kernel 5.2 too

Note that this is with the code patched by Debian for both the Kernel and ZFS. I have yet to try the vanilla ZFS code with 5.2.

It could very well be that the inclusion of https://github.com/zfsonlinux/zfs/commit/e5db31349484e5e859c7a942eb15b98d68ce5b4d by Debian causes this.

ggzengel commented 5 years ago

With Buster and 5.2 I don't get the FPU errors but it's dom0 from XEN: https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-534873377

shartge commented 5 years ago

Who knows what Xen does with the FPU state in the Dom0. It could be a false negative as well.

ggzengel commented 5 years ago

Now I checked it with 5.2 and without XEN. No FPU errors.


# cat /etc/apt/sources.list | grep -vE "^$|^#"
deb http://deb.debian.org/debian/ buster main non-free contrib
deb http://security.debian.org/debian-security buster/updates main contrib non-free
deb http://deb.debian.org/debian/ buster-updates main contrib non-free
deb http://deb.debian.org/debian/ buster-backports main contrib non-free

# dkms status
zfs, 0.8.1, 4.19.0-6-amd64, x86_64: installed
zfs, 0.8.1, 5.2.0-0.bpo.2-amd64, x86_64: installed

# uname -a
Linux xenserver2.donner14.private 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux

# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping:            2
CPU MHz:             2599.803
CPU max MHz:         3200.0000
CPU min MHz:         1200.0000
BogoMIPS:            4799.63
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-7,16-23
NUMA node1 CPU(s):   8-15,24-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

# modinfo zfs
filename:       /lib/modules/5.2.0-0.bpo.2-amd64/updates/dkms/zfs.ko
version:        0.8.1-4~bpo10+1
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
alias:          devname:zfs
alias:          char-major-10-249
srcversion:     FA9BDA7077DD9A40222C4B8
depends:        spl,znvpair,icp,zlua,zunicode,zcommon,zavl
retpoline:      Y
name:           zfs

# apt list | grep zfs | grep installed
libzfs2linux/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed,automatic]
zfs-dkms/buster-backports,now 0.8.1-4~bpo10+1 all [installed,automatic]
zfs-zed/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed,automatic]
zfsutils-linux/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed]

behlendorf commented 5 years ago

I'm working on a fix for this.

shartge commented 5 years ago

Now I checked it with 5.2 and without XEN. No FPU errors.

Did you try with mprime -t or just stress-ng? I find that I have trouble reliably reproducing this with stress-ng, but mprime hits it in the first minute or faster.

shartge commented 5 years ago

I'm working on a fix for this.

Ah, did you identify a reason for the problem? That is very good to hear/read, indeed!

ThomasLamprecht commented 5 years ago

I'm working on a fix for this.

Ah, did you identify a reason for the problem? That is very good to hear/read, indeed!

Yes, in this thread ;) see: https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-534984486 https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-535133283

ggzengel commented 5 years ago

@shartge

Did you try with mprime -t or just stress-ng? I find that I have trouble reliably reproducing this with stress-ng, but mprime hits it in the first minute or faster.

I get in the first minute more than 5 errors with stress-ng.

I use: stress-ng --fp-error $n -vvv --verify --timeout 3600 with $n=num_cpu/2+2 or $n=num_cpu-2 both shows errors very fast. Once I got 20 errors in 45 secs while scrubbing with more than 1.5GB/s. But nothing with kernel 5.2 in one hour.

I guess the load with mprime is lower and zfs gets more time for IO. How many threads starts mprime?

ggzengel commented 5 years ago

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940932#20

Debian has removed SIMD and released 0.8.2 at unstable:

Disable linux-5.0-simd-compat.patch due to its incorrect assumption that may lead to system instability when SIMD is enabled. (Closes: #940932) See also https://github.com/zfsonlinux/zfs/issues/9346

shartge commented 5 years ago

I guess the load with mprime is lower and zfs gets more time for IO. How many threads starts mprime?

mprime -t starts one thread per logical CPU per default.

And while I can also reproduce the problem while scrubbing, doing normal I/O, for example by just dd if=somebigfile of=/dev/null bs=16M triggers the problem faster for me. Possibly because of the self-tuning effect the scrub has, lowering its throughput in case of higher CPU usage.

vstax commented 5 years ago

I guess the load with mprime is lower and zfs gets more time for IO. How many threads starts mprime?

The difference may come from the fact that prime95 automatically renices itself to "10" priority, so that normal priority tasks (and some of ZFS kernel threads that aren't running at -20 priority) get CPU more easily. Stress-ng does not renice itself unless extra options are specified.

shartge commented 5 years ago

The difference may come from the fact that prime95 automatically renices itself to "10" priority, so that normal priority tasks (and some of ZFS kernel threads that aren't running at -20 priority) get CPU more easily. Stress-ng does not renice itself unless extra options are specified.

At least for mprime -t (aka test mode) this is not true. v298b6 runs at niceness 0 for me.

vstax commented 5 years ago

At least for mprime -t (aka test mode) this is not true. v298b6 runs at niceness 0 for me.

You are right, my bad. It only happens when launching normally then selecting "torture test". Hmm. Interesting.

shartge commented 5 years ago

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940932#20

Debian has removed SIMD and released 0.8.2 at unstable:
* Disable linux-5.0-simd-compat.patch due to its incorrect assumption that
  may lead to system instability when SIMD is enabled. (Closes: #940932)
  See also #9346

Just for completeness:

0.8.2-1 does not show the symptoms discussed here, because it disables the use of the FPU on non-modified kernels, using only the scalar math.

And on a modified kernel the FPU is used again, but protected by the __kernelfpu* functions, also causing no problems.

ipkpjersi commented 5 years ago

I cannot replicate this issue with kernel 4.15 on Ubuntu 18.04 but I can replicate this issue with kernel 5.0 on Ubuntu 18.04. I thought I was going crazy and that the kernel was unstable due to a regression or something, interesting to see I finally found out what was causing my issue. My workaround for now is just sticking with kernel 4.15.

ThomasLamprecht commented 5 years ago

FWIW and FYI, our distribution decided to backport the newer copy_kernelto*_err helpers0 and then effectively made the 5.0 and 5.2 code paths regarding save/restore the same1. Could not re-produce the issue here with that anymore..

As we will soon move to a 5.3 based Kernel anyway, we should not need to keep this patch to long.

Fabian-Gruenbichler commented 5 years ago

@behlendorf I am not sure whether this is worth an additional/separate issue, but while analyzing the code paths causing this issue together with @GamerSource we noticed that even the 5.2+ behaviour of using copy_kernel_to_*_err is strictly speaking wrong. compared to the (GPL-only) variants without _err, they delegate handling of errors/exceptions by the FPU to the caller. similarly to the existing call sites of copy_kernel_to_*_err, the calls in ZoL probably need to (at least) invalidate the FPU state/reset it to the initial one iff the return value is non-zero. unfortunately, it looks like all the helpers for that error handling are again (transitively) GPL-only since they use the non _err variants of copy_kernel_to_* and the GPL-only fpstate_init.

see arch/x86/mm/extable.c ex_handler_fprestore (which is what the non-'_err' variants use to handle FPU exceptions in the restore path) for the implication of the current, non-error handling implementation:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/mm/extable.c?h=linux-5.3.y#n96

/*
 * Handler for when we fail to restore a task's FPU state.  We should never get
 * here because the FPU state of a task using the FPU (task->thread.fpu.state)
 * should always be valid.  However, past bugs have allowed userspace to set
 * reserved bits in the XSAVE area using PTRACE_SETREGSET or sys_rt_sigreturn().
 * These caused XRSTOR to fail when switching to the task, leaking the FPU
 * registers of the task previously executing on the CPU.  Mitigate this class
 * of vulnerability by restoring from the initial state (essentially, zeroing
 * out all the FPU registers) if we can't restore from the task's FPU state.
 */

so probably not something that needs to be fixed right now, but something to keep in mind if you rework the whole FPU handling. IMHO ZoL will need to re-implement more of the surrounding helpers (probably at least: which instructions to use, preserve, restore) - who knows how long the semi-correct usage of the _err variants stays possible for CCDL modules..

0xFelix commented 5 years ago

Could this have possibly led to silent data corruption? Or would it only cause checksumming to fail?

behlendorf commented 5 years ago

@0xFelix it will not lead to internal filesystem corruption. What may happen is user space code using the FPU may get incorrect values leading to stability or other problems in the user space code.

Fabian-Gruenbichler commented 5 years ago

On October 2, 2019 10:45 pm, Felix wrote:

Could this have possibly led to silent data corruption? Or would it only cause checksumming to fail?

absolutely. just not in ZFS, but in anything in userspace that used the FPU in parallel to compute checksums (or anything else, for that matter). corruption might be silent and undetectable, silent but detectable (e.g. checksums that now fail verification, but with potentially good data) or noisy (like the original reports here in this issue).

graham00 commented 5 years ago

My apologies if this is too basic of a question here, but do the SIMD complications in general only impact raidz arrangements, or could raid10 vdev arrangements also be impacted (or replication)?

I've spent a few hours reading here and there trying to find the answer to that, but I'm still not certain.

ThomasLamprecht commented 5 years ago

My test allowed only triggering this issue when doing a scrub concurrently. But, the biggest testing was done on a RAID10 and only ~1hour on a RAIDZ1 - the latter should be problematic too, IIUC, but I could not trigger this issue with the stress-ng test at all, but as said limited testing was done in this regard from, so take it with a grain of salt.

But scrubs definitively made this issue show very quickly, independent of raid modus. So, IMHO, all (userspace) data operations done during a scrub with this patch applied under a problematic kernel need to be rechecked. This means every compressed archive, for example.

behlendorf commented 5 years ago

I've opened PR #9406 with the fix for this issue. It passes all of the testing I've been able to throw at it, but it would still be helpful to confirm the fix on a wider variety of hardware. If anyone is willing to help with the testing it would be appreciated. As described in the first comment running the mprime -t test concurrently with a scrub is a great way to stress test the PR.

shartge commented 5 years ago

Can I just patch 0.8.2 currently in Debian with https://github.com/zfsonlinux/zfs/pull/9406/commits/b7be6169c1702ea79498309228676744762d139b and test the result or do I need other changes from zfsonlinux:master to do this reliably?

shartge commented 5 years ago

To answer my own question (sorry for not testing before commenting): No https://github.com/zfsonlinux/zfs/commit/b7be6169c1702ea79498309228676744762d139b does not apply cleanly to 0.8.2.

behlendorf commented 5 years ago

To help facilitate testing I've created a branch in my repository based off 0.8.2 which applies only the needed SIMD patches. An easy way to test the fix is to clone it, build it, load the new kmods directly from the build tree, and run mprime -t. Alternately you can of course build and install packages.

# Clone it and checkout the zfs-0.8.2-simd branch.
git clone https://github.com/behlendorf/zfs.git --branch zfs-0.8.2-simd

# Build it in tree.
cd zfs
sh autogen.sh
./configure
make -j$(nproc)

# Export the pool and unload the system provided zfs modules.
sudo ./cmd/zpool/zpool export <mypool>
sudo ./scripts/zfs.sh -u

# Load the freshly built kmods, import your pool, start the scrub.
sudo ./scripts/zfs.sh
sudo ./cmd/zpool/zpool import <mypool>
sudo ./cmd/zpool/zpool scrub <mypool>

# Download and launch mprime
mkdir ../mprime
cd ../mprime
wget http://www.mersenne.org/ftp_root/gimps/p95v298b6.linux64.tar.gz
tar -xf p95v298b6.linux64.tar.gz
./mprime -t

https://github.com/behlendorf/zfs/tree/zfs-0.8.2-simd

cdluminate commented 5 years ago

I've cherry picked the 4 patches to the simd branch on top of debian's 0.8.2-2 package: https://salsa.debian.org/zfsonlinux-team/zfs/commit/9031b0db41ef0e2675d5a88f076bf001f1ea86f1 which would be convenient for Debian users to do the test. @happyaron

openzfs / zfs