Wrong SIMD support detection: Illegal instruction caused by the 'vmovups' instruction

yurivict commented 3 years ago

relevant cmake output section:

-- Seaching for SSE...
-- Performing Test DETECTED_SSE_42
-- Performing Test DETECTED_SSE_42 - Success
-- Performing Test DETECTED_SSE_41
-- Performing Test DETECTED_SSE_41 - Success
-- Performing Test DETECTED_SSE_30
-- Performing Test DETECTED_SSE_30 - Success
-- Performing Test DETECTED_SSE_20
-- Performing Test DETECTED_SSE_20 - Success
-- Performing Test DETECTED_SSE_10
-- Performing Test DETECTED_SSE_10 - Success
--   Found SSE 4.2 extensions, using flags:  -msse4.2 -mfpmath=sse
-- Searching for AVX...
-- Performing Test DETECTED_AVX_20
-- Performing Test DETECTED_AVX_20 - Success
-- Performing Test DETECTED_AVX_10
-- Performing Test DETECTED_AVX_10 - Success
--   Found AVX 2.0 extensions, using flags:  -mavx2
-- Searching for FMA...
-- Performing Test DETECTED_FMA
-- Performing Test DETECTED_FMA - Success
--   Found FMA extensions, using flags:  -mfma
-- Searching for NEON...
-- Performing Test DETECTED_NEON
-- Performing Test DETECTED_NEON - Failed
--   No NEON support found
-- C++ compiler flags: -O2 -pipe -fno-omit-frame-pointer -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing -fno-omit-frame-pointer  -isystem /usr/local/include -std=c++14 -pthread -fopenmp=libomp  -msse4.2 -mfpmath=sse  -mavx2  -mfma 
-- C compile flags:    -O2 -pipe -fno-omit-frame-pointer  -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing -pthread -fopenmp=libomp  -msse4.2 -mfpmath=sse  -mavx2  -mfma

cpu identification:

$ cpuid
 eax in    eax      ebx      ecx      edx
00000000 0000000b 756e6547 6c65746e 49656e69
00000001 000106a5 02100800 0098e3bd bfebfbff
00000002 55035a01 00f0b2e4 00000000 09ca212c
00000003 00000000 00000000 00000000 00000000
00000004 1c004121 01c0003f 0000003f 00000000
00000005 00000040 00000040 00000003 00001120
00000006 00000003 00000002 00000001 00000000
00000007 00000000 00000000 00000000 00000000
00000008 00000000 00000000 00000000 00000000
00000009 00000000 00000000 00000000 00000000
0000000a 07300403 00000044 00000000 00000603
0000000b 00000001 00000002 00000100 00000002
80000000 80000008 00000000 00000000 00000000
80000001 00000000 00000000 00000001 28100800
80000002 65746e49 2952286c 726f4320 4d542865
80000003 37692029 55504320 20202020 20202020
80000004 30333920 20402020 30382e32 007a4847
80000005 00000000 00000000 00000000 00000000
80000006 00000000 00000000 01006040 00000000
80000007 00000000 00000000 00000000 00000100
80000008 00003024 00000000 00000000 00000000

Vendor ID: "GenuineIntel"; CPUID level 11

Intel-specific functions:
Version 000106a5:
Type 0 - Original OEM
Family 6 - Pentium Pro
Model 26 - 
Stepping 5
Reserved 0

Extended brand string: "Intel(R) Core(TM) i7 CPU         930  @ 2.80GHz"
CLFLUSH instruction cache line size: 8
Initial APIC ID: 2
Hyper threading siblings: 16

Feature flags set 1 (CPUID.01H:EDX): bfebfbff:
FPU    Floating Point Unit
VME    Virtual 8086 Mode Enhancements
DE     Debugging Extensions
PSE    Page Size Extensions
TSC    Time Stamp Counter
MSR    Model Specific Registers
PAE    Physical Address Extension
MCE    Machine Check Exception
CX8    COMPXCHG8B Instruction
APIC   On-chip Advanced Programmable Interrupt Controller present and enabled
SEP    Fast System Call
MTRR   Memory Type Range Registers
PGE    PTE Global Flag
MCA    Machine Check Architecture
CMOV   Conditional Move and Compare Instructions
FGPAT  Page Attribute Table
PSE-36 36-bit Page Size Extension
CLFSH  CFLUSH instruction
DS     Debug store
ACPI   Thermal Monitor and Clock Ctrl
MMX    MMX instruction set
FXSR   Fast FP/MMX Streaming SIMD Extensions save/restore
SSE    Streaming SIMD Extensions instruction set
SSE2   SSE2 extensions
SS     Self Snoop
HT     Hyper Threading
TM     Thermal monitor
31     Pending Break Enable

Feature flags set 2 (CPUID.01H:ECX): 0098e3bd:
SSE3     SSE3 extensions
DTES64   64-bit debug store
MONITOR  MONITOR/MWAIT instructions
DS-CPL   CPL Qualified Debug Store
VMX      Virtual Machine Extensions
EST      Enhanced Intel SpeedStep Technology
TM2      Thermal Monitor 2
SSSE3    Supplemental Streaming SIMD Extension 3
CX16     CMPXCHG16B
xTPR     Send Task Priority messages
PDCM     Perfmon and debug capability
SSE4.1   Streaming SIMD Extension 4.1
SSE4.2   Streaming SIMD Extension 4.2
POPCNT   POPCNT instruction

Extended feature flags set 1 (CPUID.80000001H:EDX): 28100800
SYSCALL   SYSCALL/SYSRET instructions
XD-bit    Execution Disable bit
RDTSCP    RDTSCP and IA32_TSC_AUX are available
EM64T     Intel Extended Memory 64 Technology

Extended feature flags set 2 (CPUID.80000001H:ECX): 00000001
LAHF      LAHF/SAHF available in IA-32e mode

Old-styled TLB and cache info:
5a: Data TLB: 2MB or 4MB pages, 4-way set associative, 32 entries
03: Data TLB: 4KB pages, 4-way set assoc, 64 entries
55: Instruction TLB: 2MB or 4MB pages, fully assoc., 7 entries
e4: 3rd-level cache: 8MB, 16-way set associative, 64-byte line size
b2: Instruction TLB: 4-KB Pages, 4-way set associative, 64 entries
f0: 64-byte prefetching
2c: 1st-level data cache: 32-KB, 8-way set associative, 64-byte line size
21: 256-KB L2 (MLC), 8-way set associative, 64 byte line size
ca: Shared 2nd-level TLB: 4-KB Pages, 4-way set associative, 512 entries
09: 1st-level instruction cache: 32KB, 4-way set assoc, 64 byte line size

Processor serial: 0001-06A5-0000-0000-0000-0000

Deterministic Cache Parameters:
index=0: eax=1c004121 ebx=01c0003f ecx=0000003f edx=00000000
> Data cache, level 1, self initializing
> 64 sets, 8 ways, 1 partitions, line size 64
> full size 32768 bytes
> shared between up to 2 threads
> NB this package has up to 8 threads
index=1: eax=1c004122 ebx=00c0003f ecx=0000007f edx=00000000
> Instruction cache, level 1, self initializing
> 128 sets, 4 ways, 1 partitions, line size 64
> full size 32768 bytes
> shared between up to 2 threads
index=2: eax=1c004143 ebx=01c0003f ecx=000001ff edx=00000000
> Unified cache, level 2, self initializing
> 512 sets, 8 ways, 1 partitions, line size 64
> full size 262144 bytes
> shared between up to 2 threads
index=3: eax=1c03c163 ebx=03c0003f ecx=00001fff edx=00000002
> Unified cache, level 3, self initializing
> 8192 sets, 16 ways, 1 partitions, line size 64
> full size 8388608 bytes
> shared between up to 16 threads

rserban commented 3 years ago

I do not have access to test this on a processor that doesn't support AVX at all, but just asked a colleague to test it on a processor that has AVX but not AVX2. It worked as expected.

Could you please post your CMakeOutput.log and CMakeError.log files?

yurivict commented 3 years ago

CMakeError.log CMakeOutput.log

rserban commented 3 years ago

On the one processor I have available for testing this type of issue (Intel Xeon E5-2690 v2), CMake Chrono configuration correctly identifies AVX support but no AVX2 support when using clang 10.0.1 (same as what you use).

Having said that, when using GCC we force testing the host architecture by adding -march=native. We do not do that for clang. I am not sure why it works with my setup but doesn't with yours.

I pushed a modification to also set the native architecture when testing SIMD support with clang in the feature/clang branch. Could you please test that code and let me know if it fixes your issue?

yurivict commented 3 years ago

The latest rev. c4921ab from feature/clang still builds with these SIMD flags: -msse4.2 -mfpmath=sse -mavx2 -mfma.

rserban commented 3 years ago

You are not cross-compiling, right? Could you maybe print a message somewhere in the elseif branch at lines 38-41 of cmake/FindAVX.cmake in the feature/clang branch code and check that it gets executed?

yurivict commented 3 years ago

You are not cross-compiling, right?

Not ctoss-compiling.

Could you maybe print a message somewhere in the elseif branch at lines 38-41 of cmake/FindAVX.cmake in the feature/clang branch code and check that it gets executed?

I prefixed it with (Clang) and it printed this:

-- (Clang) Using CPU native flags for AVX optimization:

yurivict commented 3 years ago

-mavx2 is added elsewhere.

rserban commented 3 years ago

Yes, that is added a few lines below when attempting to build a small test program. Without -march-native, clang compiles that successfully even if the host does not support AVX2 (this is actually also the case with GCC) and so -mavx2 is used from there on. The question is why isn't -march=native appended to AVX_FLAGS at line 39.

thepianoboy commented 2 years ago

I'm curious about the behavior of this issue in release 7.0.3. The SIMD changes that I made in PR #384 removed a bit of redundant checking that might have caused some problems. If it's still a problem, hopefully I can shed a little bit more light on the issue as I merge the changes into develop as well.

projectchrono / chrono

Wrong SIMD support detection: Illegal instruction caused by the 'vmovups' instruction #297