madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

runTest.exe intermittently fails with illegal instructions in the CI #791

Closed valassi closed 10 months ago

valassi commented 10 months ago

In the new CI workflow I am developing, runTest.exe intermittently fails with illegal instructions in the CI.

I suspect this may be because I am reusing the googletest build? Note, if this is the case it would be better to change the current build.TAG infrastructure of googletest to make sure the tag is more representative (are there different compilers on different CI nodes?).

Try to avoid caching googletest builds, as a workaround.

valassi commented 10 months ago

Example https://github.com/valassi/madgraph4gpu/actions/runs/6754091436/job/18361228639

Run .github/workflows/testsuite_oneprocess.sh tput_test gg_tt.mad
Executing .github/workflows/testsuite_oneprocess.sh tput_test gg_tt.mad

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[testsuite_oneprocess.sh] tput_test (gg_tt.mad) starting at Sat Nov  4 09:42:51 UTC 2023
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Current directory is /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad

*******************************************************************************
*** tput-test gg_tt.mad (P1_gg_ttx)
*******************************************************************************

Testing in /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx

Execute build.512y_d_inl0_hrd0/runTest.exe
.github/workflows/testsuite_oneprocess.sh: line 177:  5509 Illegal instruction     (core dumped) $*

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[testsuite_oneprocess.sh] tput_test (gg_tt.mad) finished with status=132 (NOT OK) at Sat Nov  4 09:42:51 UTC 2023
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Error: Process completed with exit code 132.

ie

image

valassi commented 10 months ago

I think this is fixed. As in tput throughtputX.sh, I needed to skip 512y tests on no-AVX512 nodes for instance (SKIP 512y which is not supported - no avx512vl in /proc/cpuinfo)