madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
29 stars 33 forks source link

Segfault in testmisc test on MacOS/ARM (the same code succeds on MacOS/x86_64) - disable it as a workaround #838

Closed valassi closed 2 months ago

valassi commented 2 months ago

The CI for PR #832 is now failing the tests on MAC with a segmentation fault, details below.

NB: the code of the test that is failing has NOT CHANGED with respect to when the same test had been executed on April 15, see also below. In other words, this is a problem with MAC or with the configuration of the MAC CI.

I do not have an interactive MAC, and even if I had I am not sure that I would be able to reproduce the issue (since one month ago there was no issue).

I will try to check if the compiler changed or anything similar.

--

Failing job May 15 (https://github.com/madgraph5/madgraph4gpu/pull/832/commits/7a25b556d959b0e92a7e6bf4cdfd99434122ff74) https://github.com/madgraph5/madgraph4gpu/actions/runs/9075595776/job/24936514897?pr=832

Run actions/checkout@v2
Syncing repository: madgraph5/madgraph4gpu
Getting Git version info
Copying '/Users/runner/.gitconfig' to '/Users/runner/work/_temp/46171b91-3bd0-497e-a184-c585acd3a345/.gitconfig'
Temporarily overriding HOME='/Users/runner/work/_temp/46171b91-3bd0-497e-a184-c585acd3a345' before making global git config changes
Adding repository directory to the temporary git global config as a safe directory
/opt/homebrew/bin/git config --global --add safe.directory /Users/runner/work/madgraph4gpu/madgraph4gpu
Deleting the contents of '/Users/runner/work/madgraph4gpu/madgraph4gpu'
Initializing the repository
Disabling automatic garbage collection
Setting up auth
Fetching the repository
Determining the checkout info
Checking out the ref
/opt/homebrew/bin/git log -1 --format='%H'
'a42e271109488aa97224e91c8b6aec83b055c57e'

...

Run make AVX=none OMPFLAGS= FPTYPE=d -C epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_epem_mupmum -f cudacpp.mk info
cudacpp.mk:142: CUDA_HOME was not set: using ""
cudacpp.mk:148: HIP_HOME was not set: using ""
cudacpp.mk:234: CUDA_HOME is not set or is invalid: export CUDA_HOME to compile with cuda
cudacpp.mk:235: HIP_HOME is not set or is invalid: export HIP_HOME to compile with hip
OMPFLAGS=
AVX=none
FPTYPE=d
HELINL=0
HRDCOD=0
HASCURAND=hasNoCurand
HASHIPRAND=hasNoHiprand
Building in BUILDDIR=. for tag=none_d_inl0_hrd0_hasNoCurand_hasNoHiprand (USEBUILDDIR is not set)

Darwin Mac-1715670763535.local arm
machdep.cpu.brand_string: Apple M1 (Virtual)
hw.physicalcpu: 3
hw.logicalcpu: 3

USECCACHE=

GPUCC=

CXX=c++
c++ --version
Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin23.4.0
Thread model: posix
InstalledDir: /Applications/Xcode_15.0.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

FC=gfortran-11
gfortran-11 --version
GNU Fortran (Homebrew GCC 11.4.0) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

...

Run make AVX=none OMPFLAGS= FPTYPE=d -C epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_epem_mupmum -f cudacpp.mk check
cudacpp.mk:142: CUDA_HOME was not set: using ""
cudacpp.mk:148: HIP_HOME was not set: using ""
cudacpp.mk:234: CUDA_HOME is not set or is invalid: export CUDA_HOME to compile with cuda
cudacpp.mk:235: HIP_HOME is not set or is invalid: export HIP_HOME to compile with hip
OMPFLAGS=
AVX=none
FPTYPE=d
HELINL=0
HRDCOD=0
HASCURAND=hasNoCurand
HASHIPRAND=hasNoHiprand
Building in BUILDDIR=. for tag=none_d_inl0_hrd0_hasNoCurand_hasNoHiprand (USEBUILDDIR is not set)
./runTest.exe
INFO: The application does not require the host to support any AVX feature
[==========] Running 3 tests from 3 test suites.
make: *** [runTest] Segmentation fault: 11
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_EPEM_MUPMUM_CPU_XXX
[ RUN      ] SIGMA_SM_EPEM_MUPMUM_CPU_XXX.testxxx
[       OK ] SIGMA_SM_EPEM_MUPMUM_CPU_XXX.testxxx (0 ms)
[----------] 1 test from SIGMA_SM_EPEM_MUPMUM_CPU_XXX (0 ms total)

[----------] 1 test from SIGMA_SM_EPEM_MUPMUM_CPU_MISC
[ RUN      ] SIGMA_SM_EPEM_MUPMUM_CPU_MISC.testmisc
Error: Process completed with exit code 2.

Successful job April 15 (https://github.com/madgraph5/madgraph4gpu/pull/832/commits/d6986688d70c786816b54b13f38111f556c23af8) https://github.com/madgraph5/madgraph4gpu/actions/runs/8685948894/job/23816454188

Run actions/checkout@v2
Syncing repository: madgraph5/madgraph4gpu
Getting Git version info
Copying '/Users/runner/.gitconfig' to '/Users/runner/work/_temp/a9f42afd-fa92-423f-bea8-59904f629cab/.gitconfig'
Temporarily overriding HOME='/Users/runner/work/_temp/a9f42afd-fa92-423f-bea8-59904f629cab' before making global git config changes
Adding repository directory to the temporary git global config as a safe directory
/usr/local/bin/git config --global --add safe.directory /Users/runner/work/madgraph4gpu/madgraph4gpu
Deleting the contents of '/Users/runner/work/madgraph4gpu/madgraph4gpu'
Initializing the repository
Disabling automatic garbage collection
Setting up auth
Fetching the repository
Determining the checkout info
Checking out the ref
/usr/local/bin/git log -1 --format='%H'
'758acc0cd90af7ecdd4358a23ee44ee103b81f31'

...

Run make AVX=none OMPFLAGS= FPTYPE=d -C epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_epem_mupmum -f cudacpp.mk info
cudacpp.mk:142: CUDA_HOME was not set: using ""
cudacpp.mk:148: HIP_HOME was not set: using ""
cudacpp.mk:234: CUDA_HOME is not set or is invalid: export CUDA_HOME to compile with cuda
cudacpp.mk:235: HIP_HOME is not set or is invalid: export HIP_HOME to compile with hip
OMPFLAGS=
AVX=none
FPTYPE=d
HELINL=0
HRDCOD=0
HASCURAND=hasNoCurand
HASHIPRAND=hasNoHiprand
Building in BUILDDIR=. for tag=none_d_inl0_hrd0_hasNoCurand_hasNoHiprand (USEBUILDDIR is not set)

Darwin Mac-1713169328310.local i386
machdep.cpu.brand_string: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
machdep.cpu.brand: 0
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH MMX FXSR SSE SSE2 SS HTT PBE SSE3 PCLMULQDQ DTES64 DSCPL VMX SSSE3 FMA CX16 TPR SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES VMM PCID XSAVE OSXSAVE SEGLIM64 AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW
hw.physicalcpu: 4
hw.logicalcpu: 4

USECCACHE=

GPUCC=

CXX=c++
c++ --version
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: x86_64-apple-darwin21.6.0
Thread model: posix
InstalledDir: /Applications/Xcode_14.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

FC=gfortran-11
gfortran-11 --version
GNU Fortran (Homebrew GCC 11.4.0) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

...

0s
Run make AVX=none OMPFLAGS= FPTYPE=d -C epochX/cudacpp/ee_mumu.mad/SubProcesses/P1_epem_mupmum -f cudacpp.mk check
cudacpp.mk:142: CUDA_HOME was not set: using ""
cudacpp.mk:148: HIP_HOME was not set: using ""
cudacpp.mk:234: CUDA_HOME is not set or is invalid: export CUDA_HOME to compile with cuda
cudacpp.mk:235: HIP_HOME is not set or is invalid: export HIP_HOME to compile with hip
OMPFLAGS=
AVX=none
FPTYPE=d
HELINL=0
HRDCOD=0
HASCURAND=hasNoCurand
HASHIPRAND=hasNoHiprand
Building in BUILDDIR=. for tag=none_d_inl0_hrd0_hasNoCurand_hasNoHiprand (USEBUILDDIR is not set)
./runTest.exe
INFO: The application does not require the host to support any AVX feature
[==========] Running 3 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_EPEM_MUPMUM_CPU_XXX
[ RUN      ] SIGMA_SM_EPEM_MUPMUM_CPU_XXX.testxxx
[       OK ] SIGMA_SM_EPEM_MUPMUM_CPU_XXX.testxxx (0 ms)
[----------] 1 test from SIGMA_SM_EPEM_MUPMUM_CPU_XXX (0 ms total)

[----------] 1 test from SIGMA_SM_EPEM_MUPMUM_CPU_MISC
[ RUN      ] SIGMA_SM_EPEM_MUPMUM_CPU_MISC.testmisc
[       OK ] SIGMA_SM_EPEM_MUPMUM_CPU_MISC.testmisc (4 ms)
[----------] 1 test from SIGMA_SM_EPEM_MUPMUM_CPU_MISC (4 ms total)

[----------] 1 test from SIGMA_SM_EPEM_MUPMUM_CPU/MadgraphTest
[ RUN      ] SIGMA_SM_EPEM_MUPMUM_CPU/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_epem_mupmum.txt
INFO: No Floating Point Exceptions have been reported
[       OK ] SIGMA_SM_EPEM_MUPMUM_CPU/MadgraphTest.CompareMomentaAndME/0 (19 ms)
[----------] 1 test from SIGMA_SM_EPEM_MUPMUM_CPU/MadgraphTest (19 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 3 test suites ran. (24 ms total)
[  PASSED  ] 3 tests.
INFO: The following Floating Point Exceptions have been reported: FE_INVALID

./check.exe --common -p 2 32 2
./fcheck.exe 2 32 2
Avg ME (C++/C++)    = 1.215805e-02
Avg ME (F77/C++)    = 1.2158051820303425E-002
Relative difference = 1.4972001475186286e-07
OK (relative difference <= 2E-4)
valassi commented 2 months ago

The first puzzling thing is that the commit hash printed out in the log does not seem to be the correct one.

Anyway, what may be going on here is that the job used to succeed on Mac i386 github hosted runners.

Darwin Mac-1713169328310.local i386
machdep.cpu.brand_string: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz

It now seeme to be running on Mac ARM

Darwin Mac-1715670763535.local arm
machdep.cpu.brand_string: Apple M1 (Virtual)

This is weird, because from what I read github hosted runners are only supposed to use x64.

What I will try to do:

valassi commented 2 months ago

As a workaround I have disabled testmis on Mac. This allows the CI for PR #832 to succeed, so that this can be merged.

I have opened two new followup issues

valassi commented 2 months ago

Ok the CI for #832 is now succeeding after disabling this test. Closing this issue with a workaround (the real fix should come in #840).