Open valassi opened 6 months ago
For reference, the LUMI setup I used which gives these failures is
module load cray-python
export PATH=~/CCACHE/ccache-4.8.2-INSTALL/bin:$PATH
export CCACHE_DIR=~/CCACHE/ccache
export USECCACHE=1
module load gcc/12.2.0
export FC=`which gfortran`
The crashes in fgcheck and runTest.exe are in
(gdb) where
#0 0x0000155553542d2b in raise () from /lib64/libc.so.6
#1 0x00001555535443e5 in abort () from /lib64/libc.so.6
#2 0x000015554ba15b65 in ?? () from /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1
#3 0x000015554ba12b39 in ?? () from /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1
#4 0x000015554b9d0497 in ?? () from /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1
#5 0x00001555534de6ea in start_thread () from /lib64/libpthread.so.0
#6 0x000015555361049f in clone () from /lib64/libc.so.6
I tried also the following setup (see #807), but I still get the same crash
module load cray-python
export PATH=~/CCACHE/ccache-4.8.2-INSTALL/bin:$PATH
export CCACHE_DIR=~/CCACHE/ccache
export USECCACHE=1
module load LUMI/23.09 partition/G
module load cpeGNU/23.09
export CC="cc --cray-bypass-pkgconfig -craype-verbose"
export CXX="CC --cray-bypass-pkgconfig -craype-verbose"
export FC="ftn --cray-bypass-pkgconfig -craype-verbose -ffixed-line-length-132"
I tried setting export HSA_OVERRIDE_GFX_VERSION=9.0.10
but this does not help.
See https://stackoverflow.com/a/74540452
This is a bit better, use rocgdb instead of gdb See https://docs.amd.com/projects/HIP/en/docs-5.0.0/how_to_guides/debugging.html
rocgdb --args ./fgcheck.exe 2 64 2
GNU gdb (rocm-rel-5.2-109) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./fgcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fgcheck.exe 2 64 2
warning: File "/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6.0.30-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6.0.30-gdb.py
line to your configuration file "/users/valassia/.config/gdb/gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/users/valassia/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
GPUBLOCKS= 2
GPUTHREADS= 64
NITERATIONS= 2
__GpuRuntime: calling GpuSetDevice(0)
[New Thread 0x15554acb3700 (LWP 68800)]
WARNING! Instantiate device Bridge (nevt=128, gpublocks=1, gputhreads=128, gpublocks*gputhreads=128)
WARNING! Instantiate host Sampler (nevt=128)
Iteration #1
Warning: precise memory violation signal reporting is not enabled, reported
location may not be accurate. See "show amdgpu precise-memory".
Thread 3 "fgcheck.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 3, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x000015554aa992e8 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*) () from file:///pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/../../lib/libmg5amc_gu_ttxu_cuda.so#offset=53248&size=75624
(gdb) where
#0 0x000015554aa992e8 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*) ()
from file:///pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/../../lib/libmg5amc_gu_ttxu_cuda.so#offset=53248&size=75624
(gdb)
And
(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is off (currently disabled).
(gdb) set amdgpu precise-memory on
(gdb) show amdgpu precise-memory
AMDGPU precise memory access reporting is on (currently enabled).
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fgcheck.exe 2 64 2
warning: File "/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6.0.30-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
GPUBLOCKS= 2
GPUTHREADS= 64
NITERATIONS= 2
__GpuRuntime: calling GpuSetDevice(0)
[New Thread 0x15554acb3700 (LWP 69159)]
WARNING! Instantiate device Bridge (nevt=128, gpublocks=1, gputhreads=128, gpublocks*gputhreads=128)
WARNING! Instantiate host Sampler (nevt=128)
Iteration #1
Thread 3 "fgcheck.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 3, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x000015554aa992ac in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*) () from file:///pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/../../lib/libmg5amc_gu_ttxu_cuda.so#offset=53248&size=75624
(gdb) where
#0 0x000015554aa992ac in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int, double*, double*, int*, int*) ()
from file:///pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/../../lib/libmg5amc_gu_ttxu_cuda.so#offset=53248&size=75624
(gdb) up
Initial frame selected; you cannot go up.
This is a tough one. I have rebuilt the code in debug mode, to get more info about the memory crash, but... the program does not crash in debug builds. This may require some rather complex debugging.
I definitely think we should merge #801 without waiting for this.
This may be related to #748 (though you mention that Cuda now succeeds?), I can have a look at it together with the other issue (after channelid).
Thanks Stefan. But I think it is (almost) most certainly not related to #748, which is fixed:
Then of course I may be wrong, and maybe there is some similarity to 748?
By the way this is also "easier" to debug than 748, because 748 required a complex script comparing cuda and fortran cross sections, while here you can see a crash simply by running './runTest.exe', or './fgcheck 2 64 2'. or even, I just checked, the simplest
./gcheck.exe -p 1 8 1
Segmentation fault
(NB on gg_tt.mad ./gcheck.exe -p 1 8 1
succeeds instead)
This is also more interesting
[valassia@nid005274 bash] ~/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu > gdb --args ./gcheck.exe -p 1 8 1
GNU gdb (GDB; SUSE Linux Enterprise 15) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./gcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/gcheck.exe -p 1 8 1
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.31-150300.52.2.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /opt/rocm/lib/libamdhip64.so.5
Try: zypper install -C "debuginfo(build-id)=674564d133650b93a5e2cf6338637fa80b4c1d75"
Missing separate debuginfo for /opt/rocm/lib/libamd_comgr.so.2
Try: zypper install -C "debuginfo(build-id)=b17a3eb35dda04190fe98e5e32aa1aefa968d82f"
Missing separate debuginfo for /opt/rocm/lib/libhsa-runtime64.so.1
Try: zypper install -C "debuginfo(build-id)=646f1143a4c60fefc869dda39d74bb0d24e8b2e2"
[New Thread 0x15554b3c2700 (LWP 2325)]
Thread 1 "gcheck.exe" received signal SIGSEGV, Segmentation fault.
0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
Missing separate debuginfos, use: zypper install libdrm2-debuginfo-2.4.107-150400.1.8.x86_64 libdrm_amdgpu1-debuginfo-2.4.107-150400.1.8.x86_64 libelf1-debuginfo-0.185-150400.5.3.1.x86_64 libgcc_s1-debuginfo-12.3.0+git1204-150000.1.10.1.x86_64 libncurses6-debuginfo-6.1-150000.5.15.1.x86_64 libnuma1-debuginfo-2.0.14.20.g4ee5e0c-150400.1.24.x86_64 libstdc++6-debuginfo-12.3.0+git1204-150000.1.10.1.x86_64 libz1-debuginfo-1.2.11-150000.3.45.1.x86_64
(gdb) where
#0 0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#1 0x000000000021d289 in mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
()
#2 0x000000000021592e in main ()
(gdb)
And
[valassia@nid005274 bash] ~/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu > rocgdb --args ./gcheck.exe -p 1 8 1
GNU gdb (rocm-rel-5.2-109) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./gcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/gcheck.exe -p 1 8 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x15554b3c2700 (LWP 2482)]
Thread 1 "gcheck.exe" received signal SIGSEGV, Segmentation fault.
0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
(gdb) where
#0 0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#1 0x000000000021d289 in mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
()
#2 0x000000000021592e in main ()
(gdb)
Also tried AMD_SERIALIZE but does not help
[valassia@nid005274 bash] ~/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu > AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3 rocgdb --args ./gcheck.exe -p 1 8 1
GNU gdb (rocm-rel-5.2-109) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./gcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/gcheck.exe -p 1 8 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x15554b3c2700 (LWP 3298)]
Thread 1 "gcheck.exe" received signal SIGSEGV, Segmentation fault.
0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
(gdb) where
#0 0x000000000021f9dd in std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#1 0x000000000021d289 in mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
()
#2 0x000000000021592e in main ()
Anyway. I think it looks like a nasty memory problem.
Since the above hinted at TimerMap::start(), I commented all internal implementations of TimerMap start, stop, dump. Now I get another equally silly stack trace
[valassia@nid005274 bash] ~/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu > AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3 rocgdb --args ./gcheck.exe -p 1 8 1
GNU gdb (rocm-rel-5.2-109) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./gcheck.exe...
(gdb) run
Starting program: /pfs/lustrep1/users/valassia/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/gcheck.exe -p 1 8 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x0000155555302053 in _Unwind_Resume () from /lib64/libgcc_s.so.1
(gdb) where
#0 0x0000155555302053 in _Unwind_Resume () from /lib64/libgcc_s.so.1
#1 0x000000000020cc71 in main ()
Ok Stefan up to you (but I'll post more if I get other ideas to test). Thanks.
Anyway, again, I definitely think that this is not a blocker for PR #801.
Can you check if you have comments on #801? I will wait for @oliviermattelaer 's review and go ahead anyway.
Then of course I may be wrong, and maybe there is some similarity to 748?
What I remember from the issues I debugged last autumn, they were all related to using wrong array indexes in some function parameters for the ME calculations and if the values were garbage memory we also saw crashes. I have a few tricks on how to compare those. Will followup later.
While rerunning the full battery of tests on LUMI including those on AMD GPUs, there were several crashes (both in tput and tmad tests).
Example in
tput/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
:and also
Example in
tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
:This is strange and probably difficult to debug because it is specific to HIP and specific to gqttq:
I imagine that in any case this is not a blocker for PR #801. It is probably better to merge PR #801, also so that this code is readily available and can be tested. In any case the HIP stuff in PR #801 works for other physics processes so it is usable at least in some cases.