refresh-bio / SPLASH

57 stars 6 forks source link

SIGILL Error on some linux kernels [BUG] #2

Closed alexdhill closed 1 year ago

alexdhill commented 1 year ago

Context

Operating system: Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-67-generic x86_64)

Expected Behavior

Downloading the precompiled binaries or building from source should yeild executables that print the usage when executed.

Current Behavior

Downloaded or built executables throw SIGILL [Illegal instruction: (core dumped)] when called. Nomad gives error 'cannot find version number for satc'.

I have downloaded the procompiled binaries and cloned the source to build nomad, and in both cases all the compiled executables (satc, satc_merge, satc_dump, etc.) throw SIGILL errors [Illegal instructions: (core dumped)].

I found that reducing the optimization levels from -O3 has successfully built most of the executables, but the satc_merge and sig_anch files still throw errors.

Reproducing the issue

  1. Launch VM using kernel version 5.15.0-67

2a. Git clone the NOMAD or R-NOMAD into the VM -- or -- 2b. Download the binaries into the VM

  1. Enter VM and verify that the kernel is using Linux 5.15.0-67-generic x86_64 using uname -r

  2. Execute nomad or any of the executables.

Potential Problem/Solution

However, I have recently tried running on another Ubuntu system (Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-60-generic x86_64)), and the files which cannot be run on my primary server are able to run with no issues. It appears that with the kernel verion 5.15.0-60 NOMAD runs correctly, but on verion 5.15.0-67 it does not.

marekkokot commented 1 year ago

Hi,

thanks for reporting. It seems the problem is related to -mavx flag, at least on my Ubuntu VM removing this flag solves the issue. Could you try this on your machine and let me know? I think we will need to remove this flag in the next release. Could you also please give me the output of lscpu on your machine?

alexdhill commented 1 year ago

Removing -mavx worked, I got all execs built and run on our system.

lscpu returns: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 44 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU X7560 @ 2.27GHz CPU family: 6 Model: 46 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 4 Stepping: 6 BogoMIPS: 4522.12 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt lahf_lm pti ssbd ibrs ibpb stibp dtherm ida flush_l1d Caches (sum of all):
L1d: 1 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 8 MiB (32 instances) L3: 96 MiB (4 instances) NUMA:
NUMA node(s): 4 NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60 NUMA node1 CPU(s): 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61 NUMA node2 CPU(s): 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62 NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63 Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX unsupported L1tf: Mitigation; PTE Inversion Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable Meltdown: Mitigation; PTI Mmio stale data: Unknown: No mitigations Retbleed: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected

marekkokot commented 1 year ago

Great, thanks. We will need to test on our servers if it is relevant for performance, and if not, just remove this flag in the next release. Thank you again.