madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Port to the cray compiler on LUMI? #807

Closed valassi closed 10 months ago

valassi commented 10 months ago

To debug #806 I am testing a different compiler version. In particular this seems to be one of the recommended versions according to the login screen?

module load cray-python
export PATH=~/CCACHE/ccache-4.8.2-INSTALL/bin:$PATH
export CCACHE_DIR=~/CCACHE/ccache
export USECCACHE=1
module load LUMI/23.09 partition/G
module load cpeGNU/23.09
export CC=cc
export CXX=CC
export FC=ftn

This is

ftn -v
Using built-in specs.
COLLECT_GCC=/opt/cray/pe/gcc/12.2.0/bin/../snos/bin/gfortran
COLLECT_LTO_WRAPPER=/opt/cray/pe/gcc/12.2.0/snos/libexec/gcc/x86_64-suse-linux/12.2.0/lto-wrapper
Target: x86_64-suse-linux
Configured with: ../cpe-gcc-12.2.0-202304182231.7dfee50f41751/configure --prefix=/opt/cray/pe/gcc/12.2.0/snos --disable-nls --libdir=/opt/cray/pe/gcc/12.2.0/snos/lib --enable-languages=c,c++,fortran --with-gxx-include-dir=/opt/cray/pe/gcc/12.2.0/snos/include/g++ --with-slibdir=/opt/cray/pe/gcc/12.2.0/snos/lib --with-system-zlib --enable-shared --enable-__cxa_atexit --build=x86_64-suse-linux --with-ppl --with-cloog --disable-multilib
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 12.2.0 20220819 (HPE) (GCC) 

I found two issues so far.

First, the line number is limited to 80.

make
CUDACPP_BUILDDIR='.'
ccache ftn -w -fPIC -O3 -ffast-math -fbounds-check -extend-source  -w -cpp -c myamp.f -I../../Source/ -I../../Source/PDF/gammaUPC
genps.inc:9:72:

    9 |       parameter (ng = 96, maxdim = 3*(max_particles-2)-1, maxinvar= 4*max_particles, maxconfigs=10)
      |                                                                        1
Error: Symbol 'ma' at (1) has no IMPLICIT type
...

Strangely there is no options to extend them in the madevent makefile, I am not sure why it works on gfortran at CERN... I fixed this by adding an option to the wrapper

export FC="ftn -ffixed-line-length-132"

Second, some special cray libraries muct be added

ccache CC -o check.exe ./check_sa.o  -ldl -pthread -L../../lib -lmg5amc_common -Wl,-rpath='$ORIGIN/../../lib'  -L../../lib -lmg5amc_gu_ttxu_cpp ./CommonRandomNumberKernel.o ./RamboSamplingKernels.o ./CurandRandomNumberKernel.o 
/usr/bin/ld: ./RamboSamplingKernels.o: in function `void mg5amcCpu::ramboGetMomentaFinal<KernelAccessRandomNumbers<false>, mg5amcCpu::KernelAccessMomenta<false>, KernelAccessWeights<false> >(double, double const*, double*, double*)':
RamboSamplingKernels.cc:(.text._ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_[_ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_]+0x81): undefined reference to `__cray3_ALOG'
/usr/bin/ld: RamboSamplingKernels.cc:(.text._ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_[_ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_]+0xb8): undefined reference to `__cray_COSS'
/usr/bin/ld: RamboSamplingKernels.cc:(.text._ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_[_ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_]+0x13d): undefined reference to `__cray3_ALOG'
/usr/bin/ld: RamboSamplingKernels.cc:(.text._ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_[_ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_]+0x177): undefined reference to `__cray_COSS'
/usr/bin/ld: RamboSamplingKernels.cc:(.text._ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_[_ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_]+0x1fb): undefined reference to `__cray3_ALOG'
/usr/bin/ld: RamboSamplingKernels.cc:(.text._ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_[_ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_]+0x22c): undefined reference to `__cray_COSS'
/usr/bin/ld: RamboSamplingKernels.cc:(.text._ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_[_ZN9mg5amcCpu20ramboGetMomentaFinalI25KernelAccessRandomNumbersILb0EENS_19KernelAccessMomentaILb0EEE19KernelAccessWeightsILb0EEEEvdPKdPdS9_]+0x459): undefined reference to `__cray3_ALOG'
/usr/bin/ld: ../../lib/libmg5amc_gu_ttxu_cpp.so: undefined reference to `__cray_dset_detect'
collect2: error: ld returned 1 exit status

I did not find a solution yet

valassi commented 10 months ago

I was able to get rid of the link error with the following setup

module load cray-python
export PATH=~/CCACHE/ccache-4.8.2-INSTALL/bin:$PATH
export CCACHE_DIR=~/CCACHE/ccache
export USECCACHE=1
module load LUMI/23.09 partition/G
module load cpeGNU/23.09
export CC="cc --cray-bypass-pkgconfig -craype-verbose"
export CXX="CC --cray-bypass-pkgconfig -craype-verbose"
export FC="ftn  --cray-bypass-pkgconfig -craype-verbose -ffixed-line-length-132"

Unfortunately, this does not fix the crash in #806. The same stack trace remains from gdb.

Note, in order to make the above setup work, I had to reenable multi word CXX for HIP (which must be disabled for CUDA seee #505). This is the fix https://github.com/madgraph5/madgraph4gpu/pull/801/commits/68b589d49bcae9512d490089be7606e8a4a3f5b7

I close this issue. We could try investigating other cray options, but from the point of view of avoiding cray library link errors the above setup is enough I think