Closed aerusso closed 4 years ago
We spent a bit of time going back and forth on IRC about this, and it seems that only the scalar setting makes the problem go away.
An update from the original thread:
A quick update:
I have booted up the Debian live USB on another machine and was able to reproduce this bug with it.
The machine had the Ryzen 5 2600 CPU (the one I swapped with the machine I have originally found the problem on).
The Mainboard is: ASUS PRIME B350-PLUS BIOS Version: 5216
Output of uname -a: Linux debian 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2 (2019-08-28) x86_64 GNU/Linux
Output of zfs --version: zfs-0.8.1-4~bpo10+1 zfs-kmod-0.8.1-4~bpo10+1
Also here are the steps I'm taking to reproduce the problem:
- Start mprime for linux 64 bit
- Select Torture Test
- Choose 12 torture test threads in case of ryzen 5 (default setting)
- Select Test (2) Small FFT
- All other settings are set to default settings
- Run the test
- Read data from zfs by either reading a large file or starting a scrub. (raidz scrubs are escpecially effective)
Within a few seconds you should see mprime reporting errors.
@aerusso thank you for bringing this to our attention. The reported symptoms are consistent with what we'd expect if the fpu registered were someone not being restored. We'll see if we can reproduce the issue locally using the 4.19 kernel and the provided test case. Would it be possible to try and reproduce the issue using a 5.2 or newer kernel?
Horrifyingly, I can reproduce this in a Debian buster VM on my Intel Xeon-D.
I'm going to guess, since reports of this being on fire haven't otherwise trickled in, there might be a mismerge in Debian, or a missing followup patch?
I did a test with a Manjaro live USB and I could not reproduce this behaviour.
Kernel: 5.2.11-1-MANJARO ZFS package: archzfs/zfs-dkms-git 2019.09.18.r5411.gafc8f0a6f-1
I can reproduce it with kernel 4.19 and stress-ng too. I get more than 5 errors per minute.
With kernel 5.2 there are no errors.
root# zpool scrub zpool1
root# stress-ng --vecmath 9 --fp-error 9 -vvv --verify --timeout 3600
stress-ng: debug: [20635] 32 processors online, 32 processors configured
stress-ng: info: [20635] dispatching hogs: 9 vecmath, 9 fp-error
stress-ng: debug: [20635] cache allocate: default cache size: 20480K
<snip>
stress-ng: fail: [22426] stress-ng-fp-error: exp(DBL_MAX) return was 1.000000 (expected inf), errno=0 (expected 34), excepts=0 (expected 8)
stress-ng: fail: [22426] stress-ng-fp-error: exp(-1000000.0) return was 1.000000 (expected 0.000000), errno=0 (expected 34), excepts=0 (expected 16)
stress-ng: fail: [22389] stress-ng-fp-error: log(0.0) return was 51472868343212123638854435100661726861789564087474337372834924821256607581904275443789550923204262543290261262543297927616110435675714711004645013184740565747574812535257726048857959524537318313055909029913182014561534585350486375714439359868335816704.000000 (expected -0.000000), errno=34 (expected 34), excepts=4 (expected 4)
stress-ng: fail: [22426] stress-ng-fp-error: exp(DBL_MAX) return was 0.000000 (expected inf), errno=0 (expected 34), excepts=8 (expected 8)
stress-ng: fail: [22407] stress-ng-fp-error: exp(-1000000.0) return was -304425543965041899037761188749362776730427289735837064756329392319501601366578319214648354685850550352787929416219211679117562590779680584744448269412872882932591437212235151179776.000000 (expected 0.000000), errno=0 (expected 34), excepts=16 (expected 16)
stress-ng: fail: [22397] stress-ng-fp-error: exp(DBL_MAX) return was 1.000315 (expected inf), errno=0 (expected 34), excepts=0 (expected 8)
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 32
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping: 2
CPU MHz: 2399.755
BogoMIPS: 4800.04
Hypervisor vendor: Xen
Virtualization type: none
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-31
Flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault intel_ppin ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms xsaveopt
# uname -a
Linux server2 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2 (2019-08-28) x86_64 GNU/Linux
Can confirm this too on 5.0. It seems that the assumption from the SIMD patch, that with 5.0 and 5.1 kernels preemption and local IRQ disabling is enough, is wrong:
For the 5.0 and 5.1 kernels disabling preemption and local interrupts is sufficient to allow the FPU to be used. All non-kernel threads will restore the preserved user FPU state. -- commit message of commit e5db31349484e5e859c7a942eb15b98d68ce5b4d
If one checks out the kernelfpu{begin,end} methods from 5.0 kernel we can see that those safe the registers also. I can fix this issue by doing so, but my approach was really cumbersome as the "copy_kernel_to_xregs_err", "copy_kernel_to_fxregs_err" and "copy_kernel_to_fregs_err" methods are not avaialble, only those without "_err", but as those use the GPL symboled "ex_handler_fprestore" I cannot use them here.
So for my POC fix I ensured that on begin we always save the fpregs, and for the end always restore, and to do so I just hacked over the functionally of those methods from the 5.3 Kernel: (note quite minimal hacky change as a POC fix to show the issue)
diff --git a/include/linux/simd_x86.h b/include/linux/simd_x86.h
index 5f243e0cc..08504ba92 100644
--- a/include/linux/simd_x86.h
+++ b/include/linux/simd_x86.h
@@ -179,7 +180,6 @@ kfpu_begin(void)
preempt_disable();
local_irq_disable();
-#if defined(HAVE_KERNEL_TIF_NEED_FPU_LOAD)
/*
* The current FPU registers need to be preserved by kfpu_begin()
* and restored by kfpu_end(). This is required because we can
@@ -188,32 +188,51 @@ kfpu_begin(void)
* context switch.
*/
copy_fpregs_to_fpstate(¤t->thread.fpu);
-#elif defined(HAVE_KERNEL_FPU_INITIALIZED)
/*
* There is no need to preserve and restore the FPU registers.
* They will always be restored from the task's stored FPU state
* when switching contexts.
*/
WARN_ON_ONCE(current->thread.fpu.initialized == 0);
-#endif
}
+#ifndef kernel_insn_err
+#define kernel_insn_err(insn, output, input...) \
+({ \
+ int err; \
+ asm volatile("1:" #insn "\n\t" \
+ "2:\n" \
+ ".section .fixup,\"ax\"\n" \
+ "3: movl $-1,%[err]\n" \
+ " jmp 2b\n" \
+ ".previous\n" \
+ _ASM_EXTABLE(1b, 3b) \
+ : [err] "=r" (err), output \
+ : "0"(0), input); \
+ err; \
+})
+#endif
+
static inline void
kfpu_end(void)
{
-#if defined(HAVE_KERNEL_TIF_NEED_FPU_LOAD)
union fpregs_state *state = ¤t->thread.fpu.state;
- int error;
+ int err = 0;
if (use_xsave()) {
- error = copy_kernel_to_xregs_err(&state->xsave, -1);
+ u32 lmask = -1;
+ u32 hmask = -1;
+ XSTATE_OP(XRSTOR, &state->xsave, lmask, hmask, err);
} else if (use_fxsr()) {
- error = copy_kernel_to_fxregs_err(&state->fxsave);
+ struct fxregs_state *fx = &state->fxsave;
+ if (IS_ENABLED(CONFIG_X86_32))
+ err = kernel_insn_err(fxrstor %[fx], "=m" (*fx), [fx] "m" (*fx));
+ else
+ err = kernel_insn_err(fxrstorq %[fx], "=m" (*fx), [fx] "m" (*fx));
} else {
- error = copy_kernel_to_fregs_err(&state->fsave);
+ copy_kernel_to_fregs(&state->fsave);
}
- WARN_ON_ONCE(error);
-#endif
+ WARN_ON_ONCE(err);
local_irq_enable();
preempt_enable();
Related to the removal of the SIMD patch in the (future) 0.8.2 release #9161
With kernel 5.2 there are no errors.
I can reproduce this with mprime -t
on Debian Buster running 5.2.9-2~bpo10+1
and zfs-dkms 0.8.1-4~bpo10+1
after ~1 minute of runtime:
[Worker #1 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #6 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #7 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #4 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #8 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #5 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #3 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #2 Sep 25 13:42] Test 1, 76000 Lucas-Lehmer in-place iterations of M4501145 using FMA3 FFT length 224K, Pass1=896, Pass2=256, clm=2.
[Worker #4 Sep 25 13:43] FATAL ERROR: Rounding was 4.029914356e+80, expected less than 0.4
[Worker #4 Sep 25 13:43] Hardware failure detected, consult stress.txt file.
[Worker #4 Sep 25 13:43] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #4 Sep 25 13:43] Worker stopped.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
Stepping: 2
CPU MHz: 1201.117
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 6999.89
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 10240K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
@GamerSource thanks for digging in to this, that matches my understanding of the issue. What I don't quite understand yet is why this wasn't observed during the initial patch testing. It may be possible it was due to my specific kernel configuration. Regardless, I agree the fix here is going to need to be to save and restore the registers similar to the 5.2+ support.
@shartge are you absolutely sure you were running with an 5.2 based kernel? Only systems running a 4.14 LTS, 4.19 LTS, 5.0, or 5.1 kernel with a patched version of 0.8.1 should be impacted by this.
@shartge are you absolutely sure you were running with an 5.2 based kernel? Only systems running a 4.14 LTS, 4.19 LTS, 5.0, or 5.1 kernel with a patched version of 0.8.1 should be impacted by this.
I am 100% sure, as this Kernel 5.2.9-2~bpo10+1
was the only Kernel installed on that system at that moment.
Also the version I copy-pasted was directly from uname -a
.
Edit: Interesting bit: I was not able to reproduce this with stress-ng
, as @ggzengel was, but mprime
triggered it right away.
Edit²: Here is the line stress-ng
logged via syslog:
Sep 25 13:35:46 storage-01 stress-ng: system: 'storage-01' Linux 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64
This was 10 Minutes before my first comment. I let stress-ng
run for ~6 minutes with a scrub running a the same time. When that did non show any fails, I retested with mprime -t
at 13:42, which immediately hit the problem at 13:43.
Edit³: I also checked if the hardware is fine, of course. Without ZFS mprime -t
ran for 2 hours without any errors.
@shartge would you mind checking the dkms build directory to verify that HAVE_KERNEL_TIF_NEED_FPU_LOAD
was defined in the zfs_config.h
file.
/* kernel TIF_NEED_FPU_LOAD exists */
#define HAVE_KERNEL_TIF_NEED_FPU_LOAD 1
I will, but it will have to wait until tomorrow, because right now I have reverted the system back to 4.19 and 0.7.2 and I have to wait until the backup window has finished. See https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-535153689
Scratch that, I don't need that specific system to test the build, I can just use any Debian Buster system for that, for example any of my test VMs.
Using Linux debian-buster 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux
and zfs-dkms 0.8.1-4~bpo10+1
I get:
/* kernel TIF_NEED_FPU_LOAD exists */
#define HAVE_KERNEL_TIF_NEED_FPU_LOAD 1
I am attaching the whole file in case it may be helpful. zfs_config.h.txt
@shartge I had to reduce the CPUs to 18 for stress-ng because scrub was pausing while using all 32 CPUs. I use n/2+2 CPUs because I have a NUMA system with 2 nodes.
I now did a real test with the VM I used to do the compile test in https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-535153689 and I am able to reproduce the bug very fast.
Using a 4 disk RAIDZ and dd if=/dev/zero of=testdata.dat bs=16M
while running mprime -t
at the same time quickly results in
[Worker #4 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #3 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #2 Sep 26 07:33] Test 1, 12400 Lucas-Lehmer in-place iterations of M21871519 using FMA3 FFT length 1120K, Pass1=448, Pass2=2560, clm=2.
[Worker #1 Sep 26 07:34] FATAL ERROR: Rounding was 1944592149, expected less than 0.4
[Worker #1 Sep 26 07:34] Hardware failure detected, consult stress.txt file.
[Worker #1 Sep 26 07:34] Torture Test completed 0 tests in 0 minutes - 1 errors, 0 warnings.
[Worker #1 Sep 26 07:34] Worker stopped.
CPU for this system is
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
Stepping: 0
CPU MHz: 3092.734
BogoMIPS: 6185.46
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm rdseed adx smap xsaveopt arat md_clear flush_l1d arch_capabilities
Kernel and ZFS version can be found in https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-535153689
This is a vmware client. Does vmware have special FPU&IRQ handling inside the kernel or have a bug?
This should not matter, as I can reproduce the same problem on 2 physical systems.
But because both of them are production storages, it is easier for me to do this in a VM, as long as it shows the same behaviour.
The worst thing is that inside KVM the VMs get FPU errors too even they don't use ZFS. I started a Debian live CD inside Proxmox, installed stress-ng and get a lot of errors if I start ZFS scrub at the host.
This does not happen with VMware ESX for me. I've been running mprime -t
in my Test-VM since 07:00 today and got not one single error.
Only when I have ZFS active and put I/O load on it, the FPU errors start to occur.
The same also happened for my with the two physical systems I used to test this.
@shartge Are you using ZFS on VMware host?
No!
I just quickly created a test VM to test the compilation of the module without the need to use and interrupt my production storages.
And I also tried to reproduce this issue here in a VM instead of a physical host, which, as I have show, I was successful in doing.
But, again: The error is reproducible on normal hardware with 5.2 and 0.8.1. (Using a VM is just more convenient.)
Summary:
@shartge gets FPU errors even with kernel 5.2 too
Note that this is with the code patched by Debian for both the Kernel and ZFS. I have yet to try the vanilla ZFS code with 5.2.
It could very well be that the inclusion of https://github.com/zfsonlinux/zfs/commit/e5db31349484e5e859c7a942eb15b98d68ce5b4d by Debian causes this.
With Buster and 5.2 I don't get the FPU errors but it's dom0 from XEN: https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-534873377
Who knows what Xen does with the FPU state in the Dom0. It could be a false negative as well.
Now I checked it with 5.2 and without XEN. No FPU errors.
# cat /etc/apt/sources.list | grep -vE "^$|^#"
deb http://deb.debian.org/debian/ buster main non-free contrib
deb http://security.debian.org/debian-security buster/updates main contrib non-free
deb http://deb.debian.org/debian/ buster-updates main contrib non-free
deb http://deb.debian.org/debian/ buster-backports main contrib non-free
# dkms status
zfs, 0.8.1, 4.19.0-6-amd64, x86_64: installed
zfs, 0.8.1, 5.2.0-0.bpo.2-amd64, x86_64: installed
# uname -a
Linux xenserver2.donner14.private 5.2.0-0.bpo.2-amd64 #1 SMP Debian 5.2.9-2~bpo10+1 (2019-08-25) x86_64 GNU/Linux
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
Stepping: 2
CPU MHz: 2599.803
CPU max MHz: 3200.0000
CPU min MHz: 1200.0000
BogoMIPS: 4799.63
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
# modinfo zfs
filename: /lib/modules/5.2.0-0.bpo.2-amd64/updates/dkms/zfs.ko
version: 0.8.1-4~bpo10+1
license: CDDL
author: OpenZFS on Linux
description: ZFS
alias: devname:zfs
alias: char-major-10-249
srcversion: FA9BDA7077DD9A40222C4B8
depends: spl,znvpair,icp,zlua,zunicode,zcommon,zavl
retpoline: Y
name: zfs
# apt list | grep zfs | grep installed
libzfs2linux/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed,automatic]
zfs-dkms/buster-backports,now 0.8.1-4~bpo10+1 all [installed,automatic]
zfs-zed/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed,automatic]
zfsutils-linux/buster-backports,now 0.8.1-4~bpo10+1 amd64 [installed]
I'm working on a fix for this.
Now I checked it with 5.2 and without XEN. No FPU errors.
Did you try with mprime -t
or just stress-ng
? I find that I have trouble reliably reproducing this with stress-ng, but mprime hits it in the first minute or faster.
I'm working on a fix for this.
Ah, did you identify a reason for the problem? That is very good to hear/read, indeed!
I'm working on a fix for this.
Ah, did you identify a reason for the problem? That is very good to hear/read, indeed!
Yes, in this thread ;) see: https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-534984486 https://github.com/zfsonlinux/zfs/issues/9346#issuecomment-535133283
@shartge
Did you try with
mprime -t
or juststress-ng
? I find that I have trouble reliably reproducing this with stress-ng, but mprime hits it in the first minute or faster.
I get in the first minute more than 5 errors with stress-ng.
I use: stress-ng --fp-error $n -vvv --verify --timeout 3600 with $n=num_cpu/2+2 or $n=num_cpu-2 both shows errors very fast. Once I got 20 errors in 45 secs while scrubbing with more than 1.5GB/s. But nothing with kernel 5.2 in one hour.
I guess the load with mprime is lower and zfs gets more time for IO. How many threads starts mprime?
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940932#20
Debian has removed SIMD and released 0.8.2 at unstable:
I guess the load with mprime is lower and zfs gets more time for IO. How many threads starts mprime?
mprime -t
starts one thread per logical CPU per default.
And while I can also reproduce the problem while scrubbing, doing normal I/O, for example by just dd if=somebigfile of=/dev/null bs=16M
triggers the problem faster for me. Possibly because of the self-tuning effect the scrub has, lowering its throughput in case of higher CPU usage.
I guess the load with mprime is lower and zfs gets more time for IO. How many threads starts mprime?
The difference may come from the fact that prime95 automatically renices itself to "10" priority, so that normal priority tasks (and some of ZFS kernel threads that aren't running at -20 priority) get CPU more easily. Stress-ng does not renice itself unless extra options are specified.
The difference may come from the fact that prime95 automatically renices itself to "10" priority, so that normal priority tasks (and some of ZFS kernel threads that aren't running at -20 priority) get CPU more easily. Stress-ng does not renice itself unless extra options are specified.
At least for mprime -t
(aka test mode) this is not true. v298b6 runs at niceness 0 for me.
At least for
mprime -t
(aka test mode) this is not true. v298b6 runs at niceness 0 for me.
You are right, my bad. It only happens when launching normally then selecting "torture test". Hmm. Interesting.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940932#20
Debian has removed SIMD and released 0.8.2 at unstable:
* Disable linux-5.0-simd-compat.patch due to its incorrect assumption that may lead to system instability when SIMD is enabled. (Closes: #940932) See also #9346
Just for completeness:
0.8.2-1 does not show the symptoms discussed here, because it disables the use of the FPU on non-modified kernels, using only the scalar math.
And on a modified kernel the FPU is used again, but protected by the __kernelfpu* functions, also causing no problems.
I cannot replicate this issue with kernel 4.15 on Ubuntu 18.04 but I can replicate this issue with kernel 5.0 on Ubuntu 18.04. I thought I was going crazy and that the kernel was unstable due to a regression or something, interesting to see I finally found out what was causing my issue. My workaround for now is just sticking with kernel 4.15.
FWIW and FYI, our distribution decided to backport the newer copy_kernelto*_err helpers0 and then effectively made the 5.0 and 5.2 code paths regarding save/restore the same1. Could not re-produce the issue here with that anymore..
As we will soon move to a 5.3 based Kernel anyway, we should not need to keep this patch to long.
@behlendorf I am not sure whether this is worth an additional/separate issue, but while analyzing the code paths causing this issue together with @GamerSource we noticed that even the 5.2+ behaviour of using copy_kernel_to_*_err
is strictly speaking wrong. compared to the (GPL-only) variants without _err, they delegate handling of errors/exceptions by the FPU to the caller. similarly to the existing call sites of copy_kernel_to_*_err
, the calls in ZoL probably need to (at least) invalidate the FPU state/reset it to the initial one iff the return value is non-zero. unfortunately, it looks like all the helpers for that error handling are again (transitively) GPL-only since they use the non _err variants of copy_kernel_to_*
and the GPL-only fpstate_init
.
see arch/x86/mm/extable.c ex_handler_fprestore (which is what the non-'_err' variants use to handle FPU exceptions in the restore path) for the implication of the current, non-error handling implementation:
/*
* Handler for when we fail to restore a task's FPU state. We should never get
* here because the FPU state of a task using the FPU (task->thread.fpu.state)
* should always be valid. However, past bugs have allowed userspace to set
* reserved bits in the XSAVE area using PTRACE_SETREGSET or sys_rt_sigreturn().
* These caused XRSTOR to fail when switching to the task, leaking the FPU
* registers of the task previously executing on the CPU. Mitigate this class
* of vulnerability by restoring from the initial state (essentially, zeroing
* out all the FPU registers) if we can't restore from the task's FPU state.
*/
so probably not something that needs to be fixed right now, but something to keep in mind if you rework the whole FPU handling. IMHO ZoL will need to re-implement more of the surrounding helpers (probably at least: which instructions to use, preserve, restore) - who knows how long the semi-correct usage of the _err variants stays possible for CCDL modules..
Could this have possibly led to silent data corruption? Or would it only cause checksumming to fail?
@0xFelix it will not lead to internal filesystem corruption. What may happen is user space code using the FPU may get incorrect values leading to stability or other problems in the user space code.
On October 2, 2019 10:45 pm, Felix wrote:
Could this have possibly led to silent data corruption? Or would it only cause checksumming to fail?
absolutely. just not in ZFS, but in anything in userspace that used the FPU in parallel to compute checksums (or anything else, for that matter). corruption might be silent and undetectable, silent but detectable (e.g. checksums that now fail verification, but with potentially good data) or noisy (like the original reports here in this issue).
My apologies if this is too basic of a question here, but do the SIMD complications in general only impact raidz arrangements, or could raid10 vdev arrangements also be impacted (or replication)?
I've spent a few hours reading here and there trying to find the answer to that, but I'm still not certain.
My test allowed only triggering this issue when doing a scrub concurrently. But, the biggest testing was done on a RAID10 and only ~1hour on a RAIDZ1 - the latter should be problematic too, IIUC, but I could not trigger this issue with the stress-ng test at all, but as said limited testing was done in this regard from, so take it with a grain of salt.
But scrubs definitively made this issue show very quickly, independent of raid modus. So, IMHO, all (userspace) data operations done during a scrub with this patch applied under a problematic kernel need to be rechecked. This means every compressed archive, for example.
I've opened PR #9406 with the fix for this issue. It passes all of the testing I've been able to throw at it, but it would still be helpful to confirm the fix on a wider variety of hardware. If anyone is willing to help with the testing it would be appreciated. As described in the first comment running the mprime -t
test concurrently with a scrub is a great way to stress test the PR.
Can I just patch 0.8.2 currently in Debian with https://github.com/zfsonlinux/zfs/pull/9406/commits/b7be6169c1702ea79498309228676744762d139b and test the result or do I need other changes from zfsonlinux:master to do this reliably?
To answer my own question (sorry for not testing before commenting): No https://github.com/zfsonlinux/zfs/commit/b7be6169c1702ea79498309228676744762d139b does not apply cleanly to 0.8.2.
To help facilitate testing I've created a branch in my repository based off 0.8.2 which applies only the needed SIMD patches. An easy way to test the fix is to clone it, build it, load the new kmods directly from the build tree, and run mprime -t
. Alternately you can of course build and install packages.
# Clone it and checkout the zfs-0.8.2-simd branch.
git clone https://github.com/behlendorf/zfs.git --branch zfs-0.8.2-simd
# Build it in tree.
cd zfs
sh autogen.sh
./configure
make -j$(nproc)
# Export the pool and unload the system provided zfs modules.
sudo ./cmd/zpool/zpool export <mypool>
sudo ./scripts/zfs.sh -u
# Load the freshly built kmods, import your pool, start the scrub.
sudo ./scripts/zfs.sh
sudo ./cmd/zpool/zpool import <mypool>
sudo ./cmd/zpool/zpool scrub <mypool>
# Download and launch mprime
mkdir ../mprime
cd ../mprime
wget http://www.mersenne.org/ftp_root/gimps/p95v298b6.linux64.tar.gz
tar -xf p95v298b6.linux64.tar.gz
./mprime -t
I've cherry picked the 4 patches to the simd
branch on top of debian's 0.8.2-2
package: https://salsa.debian.org/zfsonlinux-team/zfs/commit/9031b0db41ef0e2675d5a88f076bf001f1ea86f1 which would be convenient for Debian users to do the test. @happyaron
System information
I'm duplicating Debian bug report 940932. Because of the severity of the bug report (claims data corruption), I'm directly posting it here before trying to confirm with the original poster. If this is inappropriate, I apologize, and please close the bug report.
Describe the problem you're observing
Rounding error failure in mprime torture test that goes away when
/sys/module/zfs/parameters/zfs_vdev_raidz_impl
and/sys/module/zcommon/parameters/zfs_fletcher_4_impl
are set toscalar
.Describe how to reproduce the problem
Quoting the bug report:
Include any warning/errors/backtraces from the system logs
mprime: