gq_ttq HIP tests crash on AMD GPUs at LUMI (only in non-debug builds)

valassi commented 6 months ago

While rerunning the full battery of tests on LUMI including those on AMD GPUs, there were several crashes (both in tput and tmad tests).

Example in tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt:

cmpExe /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/build.none_d_inl0_hrd0/gcheck.exe --common -p 2 64 2
cmpExe /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/build.none_d_inl0_hrd0/fgcheck.exe 2 64 2
Memory access fault by GPU node-4 (Agent handle: 0x693a290) on address 0x1460ca129000. Reason: Unknown.

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x146460faf372 in ???
#1  0x146460fae505 in ???
#2  0x14645f4a2dbf in ???
#3  0x14645f4a2d2b in ???
#4  0x14645f4a43e4 in ???
#5  0x146457975b64 in ???
#6  0x146457972b38 in ???
#7  0x146457930496 in ???
#8  0x14645f43c6e9 in ???
#9  0x14645f57049e in ???
#10  0xffffffffffffffff in ???
Avg ME (C++/CUDA)   = 
Avg ME (F77/CUDA)   = 
ERROR! Fortran calculation (F77/CUDA) crashed

and also

runExe /users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/build.none_d_inl0_hrd0/runTest.exe
Memory access fault by GPU node-4 (Agent handle: 0x667850) on address 0x1454f3e09000. Reason: Unknown.

Example in tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt:

*** (3) EXECUTE MADEVENT_CUDA x1 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./build.none_d_inl0_hrd0/madevent_cuda < /tmp/valassia/input_gqttq_x1_cudacpp > /tmp/valassia/output_gqttq_x1_cudacpp'
ERROR! ' ./build.none_d_inl0_hrd0/madevent_cuda < /tmp/valassia/input_gqttq_x1_cudacpp > /tmp/valassia/output_gqttq_x1_cudacpp' failed
 PDF set = nn23lo1
 alpha_s(Mz)= 0.1300 running at 2 loops.
 alpha_s(Mz)= 0.1300 running at 2 loops.
 Renormalization scale set on event-by-event basis
 Factorization   scale set on event-by-event basis

 getting user params
Enter number of events and max and min iterations: 
 Number of events and iterations         8192           1           1

This is strange and probably difficult to debug because it is specific to HIP and specific to gqttq:

The same gqttq tests succeed on NVidia GPUs on itscrd90
All tests but gqttq succeed on AMD GPUs on LUMI

I imagine that in any case this is not a blocker for PR #801. It is probably better to merge PR #801, also so that this code is readily available and can be tested. In any case the HIP stuff in PR #801 works for other physics processes so it is usable at least in some cases.

valassi commented 6 months ago

For reference, the LUMI setup I used which gives these failures is

module load cray-python
export PATH=~/CCACHE/ccache-4.8.2-INSTALL/bin:$PATH
export CCACHE_DIR=~/CCACHE/ccache
export USECCACHE=1
module load gcc/12.2.0
export FC=`which gfortran`

valassi commented 6 months ago

The crashes in fgcheck and runTest.exe are in

(gdb) where
#0  0x0000155553542d2b in raise () from /lib64/libc.so.6
#1  0x00001555535443e5 in abort () from /lib64/libc.so.6
#2  0x000015554ba15b65 in ?? () from /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1
#3  0x000015554ba12b39 in ?? () from /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1
#4  0x000015554b9d0497 in ?? () from /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1
#5  0x00001555534de6ea in start_thread () from /lib64/libpthread.so.0
#6  0x000015555361049f in clone () from /lib64/libc.so.6

valassi commented 6 months ago

I tried also the following setup (see #807), but I still get the same crash

module load cray-python
export PATH=~/CCACHE/ccache-4.8.2-INSTALL/bin:$PATH
export CCACHE_DIR=~/CCACHE/ccache
export USECCACHE=1
module load LUMI/23.09 partition/G
module load cpeGNU/23.09
export CC="cc --cray-bypass-pkgconfig -craype-verbose"
export CXX="CC --cray-bypass-pkgconfig -craype-verbose"
export FC="ftn  --cray-bypass-pkgconfig -craype-verbose -ffixed-line-length-132"

valassi commented 6 months ago

I tried setting export HSA_OVERRIDE_GFX_VERSION=9.0.10 but this does not help. See https://stackoverflow.com/a/74540452

valassi commented 6 months ago

This is a bit better, use rocgdb instead of gdb See https://docs.amd.com/projects/HIP/en/docs-5.0.0/how_to_guides/debugging.html

rocgdb --args ./fgcheck.exe 2 64 2
GNU gdb (rocm-rel-5.2-109) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./fgcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fgcheck.exe 2 64 2
warning: File "/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6.0.30-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6.0.30-gdb.py
line to your configuration file "/users/valassia/.config/gdb/gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/users/valassia/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
 GPUBLOCKS=             2
 GPUTHREADS=           64
 NITERATIONS=           2
__GpuRuntime: calling GpuSetDevice(0)
[New Thread 0x15554acb3700 (LWP 68800)]
WARNING! Instantiate device Bridge (nevt=128, gpublocks=1, gputhreads=128, gpublocks*gputhreads=128)
WARNING! Instantiate host Sampler (nevt=128)
Iteration #1
Warning: precise memory violation signal reporting is not enabled, reported
location may not be accurate.  See "show amdgpu precise-memory".

Thread 3 "fgcheck.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 3, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x000015554aa992e8 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*) () from file:///pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/../../lib/libmg5amc_gu_ttxu_cuda.so#offset=53248&size=75624
(gdb) where
#0  0x000015554aa992e8 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*) ()
   from file:///pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/../../lib/libmg5amc_gu_ttxu_cuda.so#offset=53248&size=75624
(gdb)

valassi commented 6 months ago

And

(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is off (currently disabled).
(gdb) set amdgpu precise-memory on
(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is on (currently enabled).
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fgcheck.exe 2 64 2
warning: File "/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6.0.30-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
 GPUBLOCKS=             2
 GPUTHREADS=           64
 NITERATIONS=           2
__GpuRuntime: calling GpuSetDevice(0)
[New Thread 0x15554acb3700 (LWP 69159)]
WARNING! Instantiate device Bridge (nevt=128, gpublocks=1, gputhreads=128, gpublocks*gputhreads=128)
WARNING! Instantiate host Sampler (nevt=128)
Iteration #1

Thread 3 "fgcheck.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 3, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x000015554aa992ac in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*) () from file:///pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/../../lib/libmg5amc_gu_ttxu_cuda.so#offset=53248&size=75624
(gdb) where
#0  0x000015554aa992ac in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*) ()
   from file:///pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/../../lib/libmg5amc_gu_ttxu_cuda.so#offset=53248&size=75624
(gdb) up
Initial frame selected; you cannot go up.

valassi commented 6 months ago

This is a tough one. I have rebuilt the code in debug mode, to get more info about the memory crash, but... the program does not crash in debug builds. This may require some rather complex debugging.

I definitely think we should merge #801 without waiting for this.

roiser commented 6 months ago

This may be related to #748 (though you mention that Cuda now succeeds?), I can have a look at it together with the other issue (after channelid).

valassi commented 6 months ago

Thanks Stefan. But I think it is (almost) most certainly not related to #748, which is fixed:

748 was about a cross section difference, this is about a crash
748 was on all platforms including CUDA (and even C++!), this succeds for CUDA and C++ and only fails for HIP
also, this fails on HIP non debug and succeeds on HIP debug, and the crash is deep inside ROCM libraries, so really looks like a memory issue that only affects HIP

Then of course I may be wrong, and maybe there is some similarity to 748?

By the way this is also "easier" to debug than 748, because 748 required a complex script comparing cuda and fortran cross sections, while here you can see a crash simply by running './runTest.exe', or './fgcheck 2 64 2'. or even, I just checked, the simplest

./gcheck.exe -p 1 8 1
Segmentation fault

(NB on gg_tt.mad ./gcheck.exe -p 1 8 1 succeeds instead)

This is also more interesting

[valassia@nid005274 bash] ~/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu > gdb --args ./gcheck.exe -p 1 8 1
GNU gdb (GDB; SUSE Linux Enterprise 15) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./gcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/gcheck.exe -p 1 8 1
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.31-150300.52.2.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /opt/rocm/lib/libamdhip64.so.5
Try: zypper install -C "debuginfo(build-id)=674564d133650b93a5e2cf6338637fa80b4c1d75"
Missing separate debuginfo for /opt/rocm/lib/libamd_comgr.so.2
Try: zypper install -C "debuginfo(build-id)=b17a3eb35dda04190fe98e5e32aa1aefa968d82f"
Missing separate debuginfo for /opt/rocm/lib/libhsa-runtime64.so.1
Try: zypper install -C "debuginfo(build-id)=646f1143a4c60fefc869dda39d74bb0d24e8b2e2"
[New Thread 0x15554b3c2700 (LWP 2325)]

Thread 1 "gcheck.exe" received signal SIGSEGV, Segmentation fault.
0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
Missing separate debuginfos, use: zypper install libdrm2-debuginfo-2.4.107-150400.1.8.x86_64 libdrm_amdgpu1-debuginfo-2.4.107-150400.1.8.x86_64 libelf1-debuginfo-0.185-150400.5.3.1.x86_64 libgcc_s1-debuginfo-12.3.0+git1204-150000.1.10.1.x86_64 libncurses6-debuginfo-6.1-150000.5.15.1.x86_64 libnuma1-debuginfo-2.0.14.20.g4ee5e0c-150400.1.24.x86_64 libstdc++6-debuginfo-12.3.0+git1204-150000.1.10.1.x86_64 libz1-debuginfo-1.2.11-150000.3.45.1.x86_64
(gdb) where
#0  0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#1  0x000000000021d289 in mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
    ()
#2  0x000000000021592e in main ()
(gdb)

And

[valassia@nid005274 bash] ~/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu > rocgdb --args ./gcheck.exe -p 1 8 1
GNU gdb (rocm-rel-5.2-109) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./gcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/gcheck.exe -p 1 8 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x15554b3c2700 (LWP 2482)]

Thread 1 "gcheck.exe" received signal SIGSEGV, Segmentation fault.
0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
(gdb) where
#0  0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#1  0x000000000021d289 in mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
    ()
#2  0x000000000021592e in main ()
(gdb)

valassi commented 6 months ago

Also tried AMD_SERIALIZE but does not help

[valassia@nid005274 bash] ~/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu > AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3 rocgdb --args ./gcheck.exe -p 1 8 1
GNU gdb (rocm-rel-5.2-109) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./gcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/gcheck.exe -p 1 8 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x15554b3c2700 (LWP 3298)]

Thread 1 "gcheck.exe" received signal SIGSEGV, Segmentation fault.
0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
(gdb) where
#0  0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#1  0x000000000021d289 in mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
    ()
#2  0x000000000021592e in main ()

valassi commented 6 months ago

Anyway. I think it looks like a nasty memory problem.

Since the above hinted at TimerMap::start(), I commented all internal implementations of TimerMap start, stop, dump. Now I get another equally silly stack trace

[valassia@nid005274 bash] ~/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu > AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3 rocgdb --args ./gcheck.exe -p 1 8 1
GNU gdb (rocm-rel-5.2-109) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./gcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/gcheck.exe -p 1 8 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000155555302053 in _Unwind_Resume () from /lib64/libgcc_s.so.1
(gdb) where
#0  0x0000155555302053 in _Unwind_Resume () from /lib64/libgcc_s.so.1
#1  0x000000000020cc71 in main ()

Ok Stefan up to you (but I'll post more if I get other ideas to test). Thanks.

Anyway, again, I definitely think that this is not a blocker for PR #801.

Can you check if you have comments on #801? I will wait for @oliviermattelaer 's review and go ahead anyway.

roiser commented 6 months ago

Then of course I may be wrong, and maybe there is some similarity to 748?

What I remember from the issues I debugged last autumn, they were all related to using wrong array indexes in some function parameters for the ME calculations and if the values were garbage memory we also saw crashes. I have a few tricks on how to compare those. Will followup later.

madgraph5 / madgraph4gpu

gq_ttq HIP tests crash on AMD GPUs at LUMI (only in non-debug builds) #806