madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Segfault in fgcheck.exe on LUMI (should we link hip, c++, fortran using hipcc or the fortran compiler?) #802

Closed valassi closed 7 months ago

valassi commented 7 months ago

I am doing some tests on the LUMI AMD GPU for PR #801 .

The gcheck.exe standard test seems ok.

However fgcheck.exe segfaults.

./fgcheck.exe 2 64 2

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x152d03be3640 in ???
#1  0x152d03be2873 in ???
#2  0x152d02670dbf in ???
#3  0x152d03d56300 in ???
#4  0x152d03d56b64 in ???
#5  0x152d03d548bf in ???
#6  0x20c597 in ???
#7  0x20cd28 in ???
#8  0x152d0265b24c in ???
#9  0x20c3e9 in _start
        at ../sysdeps/x86_64/start.S:120
#10  0xffffffffffffffff in ???
Segmentation fault

And also gdb does not help

(gdb) where
#0  0x0000155554f17300 in ?? () from /usr/lib64/libgfortran.so.4
#1  0x0000155554f17b65 in ?? () from /usr/lib64/libgfortran.so.4
#2  0x0000155554f158c0 in ?? () from /usr/lib64/libgfortran.so.4
#3  0x000000000020c598 in MAIN__ ()
#4  0x000000000020cd29 in main ()

I have done some poor man debugging by disabling stuff in fcheck_sa.f. It turns out that the error is in very simple stuff, already the READ statements.

The above is when I am using gfortran for fortran FC, and using the default cudacpp.mk where the link (of hip, fortran and c++) is done using hipcc. (For comparison, the same with nvcc works ok for cuda in my environments).

The only think that I was able to get to work, in this LUMI environment, involves two changes: one, use flang (hidden inside the ROC installation) instead of gfortran for FC; at the same time, use that same flang instead of hipcc for linking of fgcheck.exe, adding however -lstdc++ -L /opt/rocm-5.2.3/lib/ -lamdhip64 to the link command.

This is a problem I observed for fgceck.exe for now, but I guess that I would get the same for madevent? Maybe not, because it seems that we are actually linking madevent with the fortran compiler already (which is what would work here with flang). So I guess that we should probably always link fortran, c++ and GPU code with the fortran compiler? I will do more checks tomorrow.

By the way the issue above anyway does seem to need flang, so using gfortran for linking would not be ok if I tested this well. Maybe this would be easier with some nicer compiler combinations. I am not sure if @Jooorgen had observed anything like this?

I keep the details here for reference.

valassi commented 7 months ago

This is fixed in PR #802.

I ended up using FC for linking all fortran/c++ together in cudacpp.mk (I did not touch madevent and am not even sure what we are doing there for linking fortrran/c++/hip). https://github.com/madgraph5/madgraph4gpu/pull/801/commits/5c27ed64ed7bd9ed37e439aac23284082c06e759

Note: in the end I went back to hip, gcc and gfortran. I had tried hip, clang and flang, which works in SA cudacpp, but fails as flang gives zillions of F90 errors on madevent files #804. To use gfortran, I also had to add -lpthread explicitly https://github.com/madgraph5/madgraph4gpu/pull/801/commits/2fc0d87823bdbfb461899e1454c3ea8a0b90490b

This can be closed.