madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Build and test with HIP on AMD GPUs #311

Closed valassi closed 9 months ago

valassi commented 2 years ago

Just opening a placeholder for a possible HIP implementation for AMD, in parallel to CUDA for Nvidia

In principle, only very little code should be significantly different between CUDA and HIP and can be easily ifdef'ed in the same way it is now done for CUDA vs C++

valassi commented 2 years ago

Related to #70 about using abstraction layers (or other native implementations such as HIP)

valassi commented 1 year ago

Hi @Jooorgen (cc @roiser @oliviermattelaer) I thought a bit about our HIP discussions last week and I did a coupld of cross checks, so I will report some findings here.

My point is that (unless I am mistaken) there is really very very little cuda-specific code in our generated code. By design and from the start, I tried to keep this as encapsulated as possible, precisely so that we could eventually easily replace it by HIP for AMD gpus (or even sycl for intel gpus).

Therefore the amount of ifdeffing that we need to do to switch from cuda to hip is really really minimal. And we can either do that using the single header you described, which however would probably be very very small, or maybe in some cases even with direct ifdefs in the code. For instance for things like cudaMalloc/hipMalloc we may want a header, but for random numbers or complex types it may be clearer if we ifdef directly in the single file responsible? To be discussed...

Anyway, as I mentioned at the dev meeting, I think that the main issue will be the handling of the various 'ifdef CUDACC' all over the code. The point here is that in 90% of the cases those ifdefs really mean "if GPU implementation (as opposed to CPP/SIMD)" rather than "if CUDA implementation (as opposed to HIP)". See also #318 and #54, and the handling of "mg5amcCpu" vs "mg5amcGpu" namespaces... where a lot of cleanup is needed. For this I imagine that we should have something like two alternative ifdef macros. Currently for the CPP implementation, we have MGONGPU_CPPSIMD (which is used in many places to indicate the SIMD vector length), while essentially for the GPU we have CUDACC, and the two are alternatives. Probably the best would be to rename all MGONGPU_xxx macros as MGONGPUCPP_xxx to indicate they relate to the cudacpp generally, and so rename MGONGPU_CPPSIMD as MGONGPUCPP_CPPSIMD, and then introduce MGONGPUCPP_GPUIMPL (GPUARCH?), which could be cuda or hip (or nvidia or amd). So the idea is that all ifdef CUDACC would become ifdef MGONGPUCPP_GPUIMPL, and they would be valid BOTH for cuda and hip... an dthen dependending on the value one would choose cuda or hip. Comments?

About the actual occurrences of cuda specifc things, this is a grep (I may be forgetting something??)

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa> \grep -l cuda . -r | egrep -v '\.(o|exe|txt|so)'
./src/cudacpp_src.mk
./src/mgOnGpuCxtypes.h
./src/mgOnGpuConfig.h
./src/mgOnGpuFptypes.h
./SubProcesses/MemoryAccessMatrixElements.h
./SubProcesses/RandomNumberKernels.cc
./SubProcesses/MemoryBuffers.h
./SubProcesses/profile.sh
./SubProcesses/MatrixElementKernels.cc
./SubProcesses/fbridge.inc
./SubProcesses/MemoryAccessGs.h
./SubProcesses/runTest.cc
./SubProcesses/cudacpp.mk
./SubProcesses/Bridge.h
./SubProcesses/CudaRuntime.h
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h
./SubProcesses/P1_Sigma_sm_gg_ttx/check_sa.cc
[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa> \grep -l '<<<' . -r | egrep -v '\.(o|exe|txt)'
./SubProcesses/MatrixElementKernels.cc
./SubProcesses/Bridge.h
./SubProcesses/RamboSamplingKernels.cc

But also

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa> grep -l  -i  thrust . -r --exclude='*.txt'  --exclude='*.o'  --exclude='*.exe'  --exclude='*.so'
./src/mgOnGpuCxtypes.h
./src/mgOnGpuConfig.h
./SubProcesses/P1_Sigma_sm_gg_ttx/check_sa.cc
[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa> grep -l  -i  cucomplex . -r --exclude='*.txt'  --exclude='*.o'  --exclude='*.exe'  --exclude='*.so'
./src/mgOnGpuCxtypes.h
./src/mgOnGpuConfig.h
./SubProcesses/P1_Sigma_sm_gg_ttx/check_sa.cc
[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa> grep -l  -i  curand . -r --exclude='*.txt'  --exclude='*.o'  --exclude='*.exe'  --exclude='*.so'
./src/cudacpp_src.mk
./src/mgOnGpuConfig.h
./SubProcesses/RandomNumberKernels.h
./SubProcesses/RandomNumberKernels.cc
./SubProcesses/CrossSectionKernels.cc
./SubProcesses/cudacpp.mk
./SubProcesses/fsampler.cc
./SubProcesses/P1_Sigma_sm_gg_ttx/check_sa.cc

And then specifically for '<<<'

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa> grep '<<<' . -r --exclude='*.txt'  --exclude='*.o'  --exclude='*.exe'  --exclude='*.so'
./SubProcesses/MatrixElementKernels.cc:    computeDependentCouplings<<<m_gpublocks, m_gputhreads>>>( m_gs.data(), m_couplings.data() );
./SubProcesses/MatrixElementKernels.cc:    sigmaKin_getGoodHel<<<m_gpublocks, m_gputhreads>>>( m_momenta.data(), m_couplings.data(), m_matrixElements.data(), m_numerators.data(), m_denominators.data(), devIsGoodHel.data() );
./SubProcesses/MatrixElementKernels.cc:    sigmaKin_getGoodHel<<<m_gpublocks, m_gputhreads>>>( m_momenta.data(), m_couplings.data(), m_matrixElements.data(), devIsGoodHel.data() );
./SubProcesses/MatrixElementKernels.cc:    computeDependentCouplings<<<m_gpublocks, m_gputhreads>>>( m_gs.data(), m_couplings.data() );
./SubProcesses/MatrixElementKernels.cc:    sigmaKin<<<m_gpublocks, m_gputhreads, sharedMemSize>>>( m_momenta.data(), m_couplings.data(), m_rndhel.data(), m_rndcol.data(), m_matrixElements.data(), channelId, m_numerators.data(), m_denominators.data(), m_selhel.data(), m_selcol.data() );
./SubProcesses/MatrixElementKernels.cc:    sigmaKin<<<m_gpublocks, m_gputhreads, sharedMemSize>>>( m_momenta.data(), m_couplings.data(), m_rndhel.data(), m_rndcol.data(), m_matrixElements.data(), m_selhel.data(), m_selcol.data() );
./SubProcesses/Bridge.h:      dev_transposeMomentaF2C<<<m_gpublocks * thrPerEvt, m_gputhreads>>>( m_devMomentaF.data(), m_devMomentaC.data(), m_nevt );
./SubProcesses/RamboSamplingKernels.cc:    getMomentaInitialDevice<<<m_gpublocks, m_gputhreads>>>( m_energy, m_momenta.data() );
./SubProcesses/RamboSamplingKernels.cc:    getMomentaFinalDevice<<<m_gpublocks, m_gputhreads>>>( m_energy, m_rndmom.data(), m_momenta.data(), m_weights.data() );

And then specifically for cuda

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa> grep cuda . -r --exclude='*.txt'  --exclude='*.o'  --exclude='*.exe'  --exclude='*.so'
./src/cudacpp_src.mk:#=== NB: assume that the same name (e.g. cudacpp.mk, Makefile...) is used in the Subprocess and src directories
./src/mgOnGpuCxtypes.h:// Complex type in cuda: thrust or cucomplex or cxsmpl
./src/mgOnGpuCxtypes.h:#ifdef __CUDACC__ // cuda
./src/mgOnGpuCxtypes.h:#if defined __CUDACC__ and defined MGONGPU_CUCXTYPE_THRUST // cuda + thrust
./src/mgOnGpuCxtypes.h:#if defined __CUDACC__ and defined MGONGPU_CUCXTYPE_CUCOMPLEX // cuda + cucomplex
./src/mgOnGpuCxtypes.h:#if defined MGONGPU_FPTYPE_DOUBLE // cuda + cucomplex + double
./src/mgOnGpuCxtypes.h:#elif defined MGONGPU_FPTYPE_FLOAT // cuda + cucomplex + float
./src/mgOnGpuConfig.h:// There are two different code bases for standalone_cudacpp (without multichannel) and madevent+cudacpp (with multichannel)
./src/mgOnGpuConfig.h:// Complex type in cuda: thrust or cucomplex or cxsmpl (CHOOSE ONLY ONE)
./src/mgOnGpuConfig.h:// SANITY CHECKS (cuda complex number implementation)
./src/mgOnGpuFptypes.h:#ifdef __CUDACC__ // cuda
./src/mgOnGpuFptypes.h:  // See https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__SINGLE.html
./src/mgOnGpuFptypes.h:  // See https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__DOUBLE.html
./SubProcesses/MemoryAccessMatrixElements.h:    KernelAccessHelper<MemoryAccessMatrixElementsBase, onDevice>::template kernelAccessField<>; // requires cuda 11.4
./SubProcesses/MemoryAccessMatrixElements.h:    KernelAccessHelper<MemoryAccessMatrixElementsBase, onDevice>::template kernelAccessFieldConst<>; // requires cuda 11.4
./SubProcesses/RandomNumberKernels.cc:    // [NB Timings are for GenRnGen host|device (cpp|cuda) generation of 256*32*1 events with nproc=1: rn(0) is host=0.0012s]
./SubProcesses/RandomNumberKernels.cc:      checkCuda( cudaMemcpy( data, m_rnarray.data(), m_rnarray.bytes(), cudaMemcpyDeviceToHost ) );
./SubProcesses/MemoryBuffers.h:      checkCuda( cudaMallocHost( &( this->m_data ), this->bytes() ) );
./SubProcesses/MemoryBuffers.h:      checkCuda( cudaFreeHost( this->m_data ) );
./SubProcesses/MemoryBuffers.h:      checkCuda( cudaMalloc( &( this->m_data ), this->bytes() ) );
./SubProcesses/MemoryBuffers.h:      checkCuda( cudaFree( this->m_data ) );
./SubProcesses/MemoryBuffers.h:    // NB (PR #45): cudaMemcpy involves an intermediate memcpy to pinned memory if host array is a not a pinned host array
./SubProcesses/MemoryBuffers.h:    checkCuda( cudaMemcpy( dst.data(), src.data(), src.bytes(), cudaMemcpyHostToDevice ) );
./SubProcesses/MemoryBuffers.h:    // NB (PR #45): cudaMemcpy involves an intermediate memcpy to pinned memory if host array is a not a pinned host array
./SubProcesses/MemoryBuffers.h:    checkCuda( cudaMemcpy( dst.data(), src.data(), src.bytes(), cudaMemcpyDeviceToHost ) );
./SubProcesses/profile.sh:###ccargs="  256 32 12" # Similar to cuda config, but faster than using "16384 32 12"
./SubProcesses/profile.sh:  # Profile C++ instead of cuda
./SubProcesses/profile.sh:###  ncu=/usr/local/cuda-11.0/bin/ncu
./SubProcesses/profile.sh:###  ###nsys=/usr/local/cuda-10.1/bin/nsys
./SubProcesses/profile.sh:###  ###nsys=/usr/local/cuda-10.2/bin/nsys
./SubProcesses/profile.sh:###  nsys=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.0RC-d9c38/x86_64-centos7-gcc62-opt/bin/nsys
./SubProcesses/profile.sh:  CUDA_NSIGHT_HOME=/usr/local/cuda-11.1
./SubProcesses/MatrixElementKernels.cc:    checkCuda( cudaPeekAtLastError() );
./SubProcesses/MatrixElementKernels.cc:    checkCuda( cudaPeekAtLastError() );
./SubProcesses/MatrixElementKernels.cc:    checkCuda( cudaDeviceSynchronize() );
./SubProcesses/fbridge.inc:C - NTOTHEL:  the output total number of helicities in cudacpp (aka NCOMB in Fortran)
./SubProcesses/MemoryAccessGs.h:    KernelAccessHelper<MemoryAccessGsBase, onDevice>::template kernelAccessField<>; // requires cuda 11.4
./SubProcesses/MemoryAccessGs.h:    KernelAccessHelper<MemoryAccessGsBase, onDevice>::template kernelAccessFieldConst<>; // requires cuda 11.4
./SubProcesses/runTest.cc:      checkCuda( cudaDeviceReset() ); // this is needed by cuda-memcheck --leak-check full
./SubProcesses/cudacpp.mk:#=== NB: different names (e.g. cudacpp.mk and cudacpp_src.mk) are used in the Subprocess and src directories
./SubProcesses/cudacpp.mk:CUDACPP_SRC_MAKEFILE = cudacpp_src.mk
./SubProcesses/cudacpp.mk:  # See https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html
./SubProcesses/cudacpp.mk:  CULIBFLAGS  = -L$(CUDA_HOME)/lib64/ -lcurand # NB: -lcuda is not needed here!
./SubProcesses/cudacpp.mk:  # If REQUIRE_CUDA is set but no cuda is found, stop here (e.g. for CI tests on GPU #443)
./SubProcesses/cudacpp.mk:  $(error No cuda installation found (set CUDA_HOME or make nvcc visible in PATH))
./SubProcesses/cudacpp.mk:  # No cuda. Switch cuda compilation off and go to common random numbers in C++
./SubProcesses/cudacpp.mk:  $(warning CUDA_HOME is not set or is invalid: export CUDA_HOME to compile with cuda)
./SubProcesses/cudacpp.mk:  # RPATH to cuda/cpp libs when linking executables
./SubProcesses/cudacpp.mk:  # RPATH to common lib when linking cuda/cpp libs
./SubProcesses/cudacpp.mk:MG5AMC_CULIB = mg5amc_$(processid_short)_cuda
./SubProcesses/cudacpp.mk:###$(testmain): LIBFLAGS += ???? # OMP is not supported yet by cudacpp for Apple clang (see #578 and #604)
./SubProcesses/cudacpp.mk:      $(NVCC) -o $@ $(cxx_objects_lib) $(cxx_objects_exe) $(cu_objects_lib) $(cu_objects_exe) -ldl $(LIBFLAGS) -lcuda $(CULIBFLAGS)
./SubProcesses/cudacpp.mk:# Target: memcheck (run the CUDA standalone executable gcheck.exe with a small number of events through cud-memcheck)
./SubProcesses/cudacpp.mk:      $(RUNTIME) $(CUDA_HOME)/bin/cuda-memcheck --check-api-memory-access yes --check-deprecated-instr yes --check-device-heap yes --demangle full --language c --leak-check full --racecheck-report all --report-api-errors all --show-backtrace yes --tool memcheck --track-unused-memory yes $(BUILDDIR)/gcheck.exe -p 2 32 2
./SubProcesses/Bridge.h:    // Return the total number of helicities (expose cudacpp ncomb in the Bridge interface to Fortran)
./SubProcesses/Bridge.h:      checkCuda( cudaMemcpy( m_devMomentaC.data(), momenta, m_devMomentaC.bytes(), cudaMemcpyHostToDevice ) );
./SubProcesses/Bridge.h:      checkCuda( cudaMemcpy( m_devMomentaF.data(), momenta, m_devMomentaF.bytes(), cudaMemcpyHostToDevice ) );
./SubProcesses/Bridge.h:        if constexpr( F2C )          // needs c++17 and cuda >=11.2 (#333)
./SubProcesses/CudaRuntime.h:// See https://docs.nvidia.com/cuda/cuda-runtime-api/driver-vs-runtime-api.html#driver-vs-runtime-api
./SubProcesses/CudaRuntime.h:inline void assertCuda( cudaError_t code, const char* file, int line, bool abort = true )
./SubProcesses/CudaRuntime.h:  if( code != cudaSuccess )
./SubProcesses/CudaRuntime.h:    printf( "ERROR! assertCuda: '%s' (%d) in %s:%d\n", cudaGetErrorString( code ), code, file, line );
./SubProcesses/CudaRuntime.h:    if( abort ) assert( code == cudaSuccess );
./SubProcesses/CudaRuntime.h:  // invoke cudaSetDevice(0) in the constructor and book a cudaDeviceReset() call in the destructor
./SubProcesses/CudaRuntime.h:    // Calling cudaSetDevice on startup is useful to properly book-keep the time spent in CUDA initialization
./SubProcesses/CudaRuntime.h:      // ** NB: it is useful to call cudaSetDevice, or cudaFree, to properly book-keep the time spent in CUDA initialization
./SubProcesses/CudaRuntime.h:      // ** NB: otherwise, the first CUDA operation (eg a cudaMemcpyToSymbol in CPPProcess ctor) appears to take much longer!
./SubProcesses/CudaRuntime.h:      // [We initially added cudaFree(0) to "ease profile analysis" only because it shows up as a big recognizable block!]
./SubProcesses/CudaRuntime.h:      // No explicit initialization is needed: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
./SubProcesses/CudaRuntime.h:      // It is not clear what cudaFree(0) does at all: https://stackoverflow.com/questions/69967813/
./SubProcesses/CudaRuntime.h:      if ( debug ) std::cout << "__CudaRuntime: calling cudaFree(0)" << std::endl;
./SubProcesses/CudaRuntime.h:      checkCuda( cudaFree( 0 ) ); // SLOW!
./SubProcesses/CudaRuntime.h:      // Replace cudaFree(0) by cudaSetDevice(0), even if it is not really needed either
./SubProcesses/CudaRuntime.h:      // (but see https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs)
./SubProcesses/CudaRuntime.h:      if( debug ) std::cout << "__CudaRuntime: calling cudaSetDevice(0)" << std::endl;
./SubProcesses/CudaRuntime.h:      checkCuda( cudaSetDevice( 0 ) ); // SLOW!
./SubProcesses/CudaRuntime.h:    // Tear down CUDA application (call cudaDeviceReset)
./SubProcesses/CudaRuntime.h:    // Calling cudaDeviceReset on shutdown is only needed for checking memory leaks in cuda-memcheck
./SubProcesses/CudaRuntime.h:    // See https://docs.nvidia.com/cuda/cuda-memcheck/index.html#leak-checking
./SubProcesses/CudaRuntime.h:      if( debug ) std::cout << "__CudaRuntime: calling cudaDeviceReset()" << std::endl;
./SubProcesses/CudaRuntime.h:      checkCuda( cudaDeviceReset() );
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    checkCuda( cudaMemcpyToSymbol( cHel, tHel, ncomb * npar * sizeof( short ) ) );
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    checkCuda( cudaMemcpyToSymbol( cIPD, tIPD, 2 * sizeof( fptype ) ) );
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    //checkCuda( cudaMemcpyToSymbol( cIPC, tIPC, 0 * sizeof( cxtype ) ) ); // nicoup=0
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:                             , const int nevt     // input: #events (for cuda: nevt == ndim == gpublocks*gputhreads)
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:                       const int nevt )            // input: #events (for cuda: nevt == ndim == gpublocks*gputhreads)
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    checkCuda( cudaMemcpyToSymbol( cNGoodHel, &nGoodHel, sizeof( int ) ) );
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    checkCuda( cudaMemcpyToSymbol( cGoodHel, goodHel, ncomb * sizeof( int ) ) );
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:            , const int nevt               // input: #events (for cuda: nevt == ndim == gpublocks*gputhreads)
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    // SANITY CHECKS for cudacpp code generation (see issues #272 and #343 and PRs #619, #626, #360 and #396)
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:      // process_id corresponds to the index of DSIG1 Fortran functions (must be 1 because cudacpp is unable to handle DSIG2)
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:      constexpr int process_id = 1; // code generation source: standalone_cudacpp
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:        const int ihelF = cGoodHel[ighel] + 1; // NB Fortran [1,ncomb], cudacpp [0,ncomb-1]
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:        allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1]
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:            const int ihelF = cGoodHel[ighel] + 1; // NB Fortran [1,ncomb], cudacpp [0,ncomb-1]
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:            const int ihelF = cGoodHel[ighel] + 1; // NB Fortran [1,ncomb], cudacpp [0,ncomb-1]
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:            allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1]
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:            allselcol[ievt2] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1]
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h:                             const int nevt );     // input: #events (for cuda: nevt == ndim == gpublocks*gputhreads)
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h:                       const int nevt );           // input: #events (for cuda: nevt == ndim == gpublocks*gputhreads)
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.h:            const int nevt );              // input: #events (for cuda: nevt == ndim == gpublocks*gputhreads)
./SubProcesses/P1_Sigma_sm_gg_ttx/check_sa.cc:  // --- 00. Initialise cuda
./SubProcesses/P1_Sigma_sm_gg_ttx/check_sa.cc:  // invoke cudaSetDevice(0) in the constructor and book a cudaDeviceReset() call in the destructor
./SubProcesses/P1_Sigma_sm_gg_ttx/check_sa.cc:  CudaRuntime cudaRuntime( debug );

However above for cuda the only things that are not comments are

./SubProcesses/RandomNumberKernels.cc:      checkCuda( cudaMemcpy( data, m_rnarray.data(), m_rnarray.bytes(), cudaMemcpyDeviceToHost ) );
./SubProcesses/MemoryBuffers.h:      checkCuda( cudaMallocHost( &( this->m_data ), this->bytes() ) );
./SubProcesses/MemoryBuffers.h:      checkCuda( cudaFreeHost( this->m_data ) );
./SubProcesses/MemoryBuffers.h:      checkCuda( cudaMalloc( &( this->m_data ), this->bytes() ) );
./SubProcesses/MemoryBuffers.h:      checkCuda( cudaFree( this->m_data ) );
./SubProcesses/MemoryBuffers.h:    checkCuda( cudaMemcpy( dst.data(), src.data(), src.bytes(), cudaMemcpyHostToDevice ) );
./SubProcesses/MemoryBuffers.h:    checkCuda( cudaMemcpy( dst.data(), src.data(), src.bytes(), cudaMemcpyDeviceToHost ) );
./SubProcesses/profile.sh:###  ncu=/usr/local/cuda-11.0/bin/ncu
./SubProcesses/profile.sh:###  nsys=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.0RC-d9c38/x86_64-centos7-gcc62-opt/bin/nsys
./SubProcesses/profile.sh:  CUDA_NSIGHT_HOME=/usr/local/cuda-11.1
./SubProcesses/MatrixElementKernels.cc:    checkCuda( cudaPeekAtLastError() );
./SubProcesses/MatrixElementKernels.cc:    checkCuda( cudaPeekAtLastError() );
./SubProcesses/MatrixElementKernels.cc:    checkCuda( cudaDeviceSynchronize() );
./SubProcesses/runTest.cc:      checkCuda( cudaDeviceReset() ); // this is needed by cuda-memcheck --leak-check full
./SubProcesses/cudacpp.mk:  CULIBFLAGS  = -L$(CUDA_HOME)/lib64/ -lcurand # NB: -lcuda is not needed here!
./SubProcesses/cudacpp.mk:MG5AMC_CULIB = mg5amc_$(processid_short)_cuda
./SubProcesses/cudacpp.mk:      $(NVCC) -o $@ $(cxx_objects_lib) $(cxx_objects_exe) $(cu_objects_lib) $(cu_objects_exe) -ldl $(LIBFLAGS) -lcuda $(CULIBFLAGS)
./SubProcesses/cudacpp.mk:      $(RUNTIME) $(CUDA_HOME)/bin/cuda-memcheck --check-api-memory-access yes --check-deprecated-instr yes --check-device-heap yes --demangle full --language c --leak-check full --racecheck-report all --report-api-errors all --show-backtrace yes --tool memcheck --track-unused-memory yes $(BUILDDIR)/gcheck.exe -p 2 32 2
./SubProcesses/Bridge.h:      checkCuda( cudaMemcpy( m_devMomentaC.data(), momenta, m_devMomentaC.bytes(), cudaMemcpyHostToDevice ) );
./SubProcesses/Bridge.h:      checkCuda( cudaMemcpy( m_devMomentaF.data(), momenta, m_devMomentaF.bytes(), cudaMemcpyHostToDevice ) );
./SubProcesses/CudaRuntime.h:inline void assertCuda( cudaError_t code, const char* file, int line, bool abort = true )
./SubProcesses/CudaRuntime.h:  if( code != cudaSuccess )
./SubProcesses/CudaRuntime.h:    printf( "ERROR! assertCuda: '%s' (%d) in %s:%d\n", cudaGetErrorString( code ), code, file, line );
./SubProcesses/CudaRuntime.h:    if( abort ) assert( code == cudaSuccess );
./SubProcesses/CudaRuntime.h:      checkCuda( cudaFree( 0 ) ); // SLOW!
./SubProcesses/CudaRuntime.h:      checkCuda( cudaSetDevice( 0 ) ); // SLOW!
./SubProcesses/CudaRuntime.h:      checkCuda( cudaDeviceReset() );
./SubProcesses/CudaRuntime.h:...[much more]...
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    checkCuda( cudaMemcpyToSymbol( cHel, tHel, ncomb * npar * sizeof( short ) ) );
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    checkCuda( cudaMemcpyToSymbol( cIPD, tIPD, 2 * sizeof( fptype ) ) );
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    checkCuda( cudaMemcpyToSymbol( cNGoodHel, &nGoodHel, sizeof( int ) ) );
./SubProcesses/P1_Sigma_sm_gg_ttx/CPPProcess.cc:    checkCuda( cudaMemcpyToSymbol( cGoodHel, goodHel, ncomb * sizeof( int ) ) );
./SubProcesses/P1_Sigma_sm_gg_ttx/check_sa.cc:  CudaRuntime cudaRuntime( debug );

The above is quite little. Again not clear if it makes more sense having a single header and writing the above code in terms of "gpuMalloc" instead of "cudaMalloc" or "hipMalloc", or instead better to just ifdef them directly in these files. For thing slike CudaRuntime.h I would say it is bettertp have the equivalent HipRuntime.h maybe?

Then the stuff for thrust and cucomplex is so encapsulated that I would keep it there, rather than in a separate header, also and especially because each case is a choice. By the way the default complex class now is cxsmpl in cpp, and thrust in cuda. But cxsmpl can be used also in cuda, so it could be used also on HIP if we do not like the AMD complex class.

Finally about curand, similarly I would leave this encapsulated because each library is a choice. Thinking about it, the issue here actually will be to try and make results strictly reproducible across GPUs. Now we have curandhost for cpp and curanddevice for nvidia gpus, and we get the same results. If we do not find a library giving us the same results ad curand on amd gpus, it may be better to just use cpp common randomnunbers everyehwere. AND, note anyway that this is only for the standalone application, which is relatively low priority: for madeven we are now still using whatever madevent uses (RANMAR, I think), on the cpu.

Some food for thought, we should have a chat then :-) Andrea

valassi commented 9 months ago

I have linked this to PR #801, which will finally close this issue. This includes a few additional fixes, code generation and tests that I added on Jorgen's earlier PR #774.

@nscottnichols this is where the preliminary discussion behind Jorgen's AMD work took place. It may be useful for the Intel GPU task #805.

valassi commented 9 months ago

I mark this as closed because it will be completed in the upcoming PR #801