madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Reduce build times in CUDA and C++ for complex processes (split kernels and more) #348

Open valassi opened 2 years ago

valassi commented 2 years ago

This is a followup of #346: on ggttggg it is clear that build times start becoming very long again (20 minutes or more, mainly in CUDA, but also in clang/C++ the situation looks bad).

The issue is clearly related to inlining of FFV functions (hence to their templating in PR #328) and more generally to LTO/RDC/inlining optimizations over very large code bases (#229 et al).

Removing inlining by hand is an option, but small tests I have done in the past were really bad for performance.

The only viable solution is most likely splitting kernels (#310), not only for CUDA but also for C++. Once we have more than 1000 Feynman diagrams as in ggttggg, it makes no sense to do any optimizations across a single calculate_wavefunctions method with O(1k-10k) FFV calls. It looks better, even just for C++ and for build times, to split this into O(1k) functions, one per diagram.

valassi commented 2 years ago

Note one comment from Olivier: ggttggg i already close to the limit of what MG can handle. The problem is that the color matrix becomes far too big beyond this level.

It would be interesting to check, actually, if build times are related to the color matrix or to the FFV computations. When these are two different kernels, maybe we can add them in two different files?

valassi commented 2 years ago

Note the interesting situation with alpaka builds, where cuda builds essentially happen twice, once in cuda and once in alpaka.

[avalassi@itscrd70 bash] ~> ps -ef | grep valassi
...
avalassi 20870 14163  0 13:15 pts/3    00:00:00 make AVX=none USEBUILDDIR=1 -j

avalassi 20899 20870  0 13:15 pts/3    00:00:00 ccache /usr/local/cuda-11.6/bin/nvcc -O3 -lineinfo -std=c++14 -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch=compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -c gCPPProcess.cu -o build.none_d_inl0/gCPPProcess.o

avalassi 20904 20870  0 13:15 pts/3    00:00:00 ccache /usr/local/cuda-11.6/bin/nvcc -L/usr/local/cuda-11.6/lib64/ -lcurand -L../../lib/build.none_d_inl0 -lmodel_sm -O3 -lineinfo -std=c++14 -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch=compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -DALPAKA -DALPAKA_ACC_GPU_CUDA_ENABLED --expt-relaxed-constexpr -I/data/avalassi/GPU2020/ALPAKA/alpaka/include -I/cvmfs/sft.cern.ch/lcg/releases/LCG_101/Boost/1.77.0/x86_64-centos7-gcc10-opt/include -I/data/avalassi/GPU2020/CUPLA/cupla/include -dc gCPPProcess.cu -o build.none_d_inl0/alpCPPProcess.o

avalassi 21009 20899  0 13:15 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc -O3 -lineinfo -std c++14 -arch compute_70 -use_fast_math -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -c -o build.none_d_inl0/gCPPProcess.o gCPPProcess.cu

avalassi 21022 21009 95 13:15 pts/3    00:05:27 cicc --c++14 --gnu_version=100200 --display_error_number --orig_src_file_name gCPPProcess.cu --orig_src_path_name /data/avalassi/GPU2020/madgraph4gpuX/epochX/alpaka/gg_ttggg.auto/SubProcesses/P1_Sigma_sm_gg_ttxggg/CPPProcess.cc --allow_managed -arch compute_70 -m64 --no-version-ident -ftz=1 -prec_div=0 -prec_sqrt=0 -fmad=1 -fast-math --gen_div_approx_ftz --include_file_name tmpxft_00005211_00000000-3_gCPPProcess.fatbin.c -generate-line-info -tused --gen_module_id_file --module_id_file_name /tmp/avalassi/tmpxft_00005211_00000000-4_gCPPProcess.module_id --gen_c_file_name /tmp/avalassi/tmpxft_00005211_00000000-6_gCPPProcess.cudafe1.c --stub_file_name /tmp/avalassi/tmpxft_00005211_00000000-6_gCPPProcess.cudafe1.stub.c --gen_device_file_name /tmp/avalassi/tmpxft_00005211_00000000-6_gCPPProcess.cudafe1.gpu /tmp/avalassi/tmpxft_00005211_00000000-7_gCPPProcess.cpp1.ii -o /tmp/avalassi/tmpxft_00005211_00000000-6_gCPPProcess.ptx

avalassi 21061 20904  0 13:15 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc -L/usr/local/cuda-11.6/lib64/ -lcurand -L../../lib/build.none_d_inl0 -lmodel_sm -O3 -lineinfo -std c++14 -arch compute_70 -use_fast_math --expt-relaxed-constexpr -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -DALPAKA -DALPAKA_ACC_GPU_CUDA_ENABLED -I/data/avalassi/GPU2020/ALPAKA/alpaka/include -I/cvmfs/sft.cern.ch/lcg/releases/LCG_101/Boost/1.77.0/x86_64-centos7-gcc10-opt/include -I/data/avalassi/GPU2020/CUPLA/cupla/include -dc -o build.none_d_inl0/alpCPPProcess.o gCPPProcess.cu

avalassi 21076 21061 95 13:15 pts/3    00:05:24 cicc --c++14 --gnu_version=100200 --display_error_number --orig_src_file_name gCPPProcess.cu --orig_src_path_name /data/avalassi/GPU2020/madgraph4gpuX/epochX/alpaka/gg_ttggg.auto/SubProcesses/P1_Sigma_sm_gg_ttxggg/CPPProcess.cc --allow_managed --relaxed_constexpr --device-c -arch compute_70 -m64 --no-version-ident -ftz=1 -prec_div=0 -prec_sqrt=0 -fmad=1 -fast-math --gen_div_approx_ftz --include_file_name tmpxft_00005245_00000000-3_gCPPProcess.fatbin.c -generate-line-info -tused --gen_module_id_file --module_id_file_name /tmp/avalassi/tmpxft_00005245_00000000-4_gCPPProcess.module_id --gen_c_file_name /tmp/avalassi/tmpxft_00005245_00000000-6_gCPPProcess.cudafe1.c --stub_file_name /tmp/avalassi/tmpxft_00005245_00000000-6_gCPPProcess.cudafe1.stub.c --gen_device_file_name /tmp/avalassi/tmpxft_00005245_00000000-6_gCPPProcess.cudafe1.gpu /tmp/avalassi/tmpxft_00005245_00000000-7_gCPPProcess.cpp1.ii -o /tmp/avalassi/tmpxft_00005245_00000000-6_gCPPProcess.ptx

Note also that here the C++ build has already succeeded, so it is really nvcc/cicc that takes 10-20 minutes...

(This is a "export CCACHE_RECACHE=true" rebuild but this should not make a difference)

valassi commented 2 years ago

And later on in the same build, nvlink seems to take even longer...

avalassi 20870 14163  0 13:15 pts/3    00:00:00 make AVX=none USEBUILDDIR=1 -j

avalassi 20899 20870  0 13:15 pts/3    00:00:00 ccache /usr/local/cuda-11.6/bin/nvcc -O3 -lineinfo -std=c++14 -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch=compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -c gCPPProcess.cu -o build.none_d_inl0/gCPPProcess.o

avalassi 21009 20899  0 13:15 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc -O3 -lineinfo -std c++14 -arch compute_70 -use_fast_math -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -c -o build.none_d_inl0/gCPPProcess.o gCPPProcess.cu

avalassi 21022 21009 98 13:15 pts/3    00:14:36 cicc --c++14 --gnu_version=100200 --display_error_number --orig_src_file_name gCPPProcess.cu --orig_src_path_name /data/avalassi/GPU2020/madgraph4gpuX/epochX/alpaka/gg_ttggg.auto/SubProcesses/P1_Sigma_sm_gg_ttxggg/CPPProcess.cc --allow_managed -arch compute_70 -m64 --no-version-ident -ftz=1 -prec_div=0 -prec_sqrt=0 -fmad=1 -fast-math --gen_div_approx_ftz --include_file_name tmpxft_00005211_00000000-3_gCPPProcess.fatbin.c -generate-line-info -tused --gen_module_id_file --module_id_file_name /tmp/avalassi/tmpxft_00005211_00000000-4_gCPPProcess.module_id --gen_c_file_name /tmp/avalassi/tmpxft_00005211_00000000-6_gCPPProcess.cudafe1.c --stub_file_name /tmp/avalassi/tmpxft_00005211_00000000-6_gCPPProcess.cudafe1.stub.c --gen_device_file_name /tmp/avalassi/tmpxft_00005211_00000000-6_gCPPProcess.cudafe1.gpu /tmp/avalassi/tmpxft_00005211_00000000-7_gCPPProcess.cpp1.ii -o /tmp/avalassi/tmpxft_00005211_00000000-6_gCPPProcess.ptx

avalassi 22380 20870  0 13:28 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc build.none_d_inl0/alpcheck_sa.o -o build.none_d_inl0/alpcheck.exe build.none_d_inl0/alpCPPProcess.o build.none_d_inl0/cupla/common.o build.none_d_inl0/cupla/device.o build.none_d_inl0/cupla/event.o build.none_d_inl0/cupla/memory.o build.none_d_inl0/cupla/stream.o build.none_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -L../../lib/build.none_d_inl0 -lmodel_sm -L/usr/local/cuda-11.6/lib64/ -lcurand

avalassi 22386 22380 99 13:28 pts/3    00:01:38 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_0000576c_00000000-3_alpcheck_dlink.reg.c -L../../lib/build.none_d_inl0 -L/usr/local/cuda-11.6/lib64/ -lmodel_sm -lcurand -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.none_d_inl0/alpcheck_sa.o build.none_d_inl0/alpCPPProcess.o build.none_d_inl0/cupla/common.o build.none_d_inl0/cupla/device.o build.none_d_inl0/cupla/event.o build.none_d_inl0/cupla/memory.o build.none_d_inl0/cupla/stream.o build.none_d_inl0/cupla/manager/Driver.o -lcudadevrt

avalassi 22419 20870  0 13:28 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc -o build.none_d_inl0/runTest.exe build.none_d_inl0/CPPProcess.o build.none_d_inl0/runTest.o build.none_d_inl0/MadgraphTest.o build.none_d_inl0/testxxx.o build.none_d_inl0/alpCPPProcess.o build.none_d_inl0/runTest_alp.o build.none_d_inl0/cupla/common.o build.none_d_inl0/cupla/device.o build.none_d_inl0/cupla/event.o build.none_d_inl0/cupla/memory.o build.none_d_inl0/cupla/stream.o build.none_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -ldl -L../../lib/build.none_d_inl0 -lmodel_sm -L../../../../../test/googletest/build/lib/ -lgtest -lgtest_main -L/usr/local/cuda-11.6/lib64/ -lcurand -lcuda -lgomp

avalassi 22422 22419 99 13:28 pts/3    00:01:26 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_00005793_00000000-3_runTest_dlink.reg.c -L../../lib/build.none_d_inl0 -L../../../../../test/googletest/build/lib/ -L/usr/local/cuda-11.6/lib64/ -ldl -lmodel_sm -lgtest -lgtest_main -lcurand -lcuda -lgomp -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.none_d_inl0/CPPProcess.o build.none_d_inl0/runTest.o build.none_d_inl0/MadgraphTest.o build.none_d_inl0/testxxx.o build.none_d_inl0/alpCPPProcess.o build.none_d_inl0/runTest_alp.o build.none_d_inl0/cupla/common.o build.none_d_inl0/cupla/device.o build.none_d_inl0/cupla/event.o build.none_d_inl0/cupla/memory.o build.none_d_inl0/cupla/stream.o build.none_d_inl0/cupla/manager/Driver.o -lcudadevrt
valassi commented 2 years ago

And later on a more complex case

avalassi 25243 14163  0 13:40 pts/3    00:00:00 /bin/bash ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -ggttgg -ggttggg
avalassi 25747 25243  0 13:40 pts/3    00:00:00 /bin/bash ./throughputX.sh -makeonly -makej -ggttggg -avxall
avalassi 25755 25747  0 13:40 pts/3    00:00:00 make -j avxall
avalassi 25781 25755  0 13:40 pts/3    00:00:00 make USEBUILDDIR=1 AVX=sse4
avalassi 25782 25755  0 13:40 pts/3    00:00:00 make USEBUILDDIR=1 AVX=avx2
avalassi 25787 25755  0 13:40 pts/3    00:00:00 make USEBUILDDIR=1 AVX=512y
avalassi 25798 25755  0 13:40 pts/3    00:00:00 make USEBUILDDIR=1 AVX=512z

avalassi 25999 25781  0 13:40 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc build.sse4_d_inl0/alpcheck_sa.o -o build.sse4_d_inl0/alpcheck.exe build.sse4_d_inl0/alpCPPProcess.o build.sse4_d_inl0/cupla/common.o build.sse4_d_inl0/cupla/device.o build.sse4_d_inl0/cupla/event.o build.sse4_d_inl0/cupla/memory.o build.sse4_d_inl0/cupla/stream.o build.sse4_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -L../../lib/build.sse4_d_inl0 -lmodel_sm -L/usr/local/cuda-11.6/lib64/ -lcurand

avalassi 26016 25798  0 13:40 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc build.512z_d_inl0/alpcheck_sa.o -o build.512z_d_inl0/alpcheck.exe build.512z_d_inl0/alpCPPProcess.o build.512z_d_inl0/cupla/common.o build.512z_d_inl0/cupla/device.o build.512z_d_inl0/cupla/event.o build.512z_d_inl0/cupla/memory.o build.512z_d_inl0/cupla/stream.o build.512z_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -L../../lib/build.512z_d_inl0 -lmodel_sm -L/usr/local/cuda-11.6/lib64/ -lcurand

avalassi 26018 25787  0 13:40 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc build.512y_d_inl0/alpcheck_sa.o -o build.512y_d_inl0/alpcheck.exe build.512y_d_inl0/alpCPPProcess.o build.512y_d_inl0/cupla/common.o build.512y_d_inl0/cupla/device.o build.512y_d_inl0/cupla/event.o build.512y_d_inl0/cupla/memory.o build.512y_d_inl0/cupla/stream.o build.512y_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -L../../lib/build.512y_d_inl0 -lmodel_sm -L/usr/local/cuda-11.6/lib64/ -lcurand

avalassi 26035 25782  0 13:40 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc build.avx2_d_inl0/alpcheck_sa.o -o build.avx2_d_inl0/alpcheck.exe build.avx2_d_inl0/alpCPPProcess.o build.avx2_d_inl0/cupla/common.o build.avx2_d_inl0/cupla/device.o build.avx2_d_inl0/cupla/event.o build.avx2_d_inl0/cupla/memory.o build.avx2_d_inl0/cupla/stream.o build.avx2_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -L../../lib/build.avx2_d_inl0 -lmodel_sm -L/usr/local/cuda-11.6/lib64/ -lcurand

avalassi 26039 26018 42 13:40 pts/3    00:02:01 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_000065a2_00000000-3_alpcheck_dlink.reg.c -L../../lib/build.512y_d_inl0 -L/usr/local/cuda-11.6/lib64/ -lmodel_sm -lcurand -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.512y_d_inl0/alpcheck_sa.o build.512y_d_inl0/alpCPPProcess.o build.512y_d_inl0/cupla/common.o build.512y_d_inl0/cupla/device.o build.512y_d_inl0/cupla/event.o build.512y_d_inl0/cupla/memory.o build.512y_d_inl0/cupla/stream.o build.512y_d_inl0/cupla/manager/Driver.o -lcudadevrt

avalassi 26056 26016 43 13:40 pts/3    00:02:04 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_000065a0_00000000-3_alpcheck_dlink.reg.c -L../../lib/build.512z_d_inl0 -L/usr/local/cuda-11.6/lib64/ -lmodel_sm -lcurand -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.512z_d_inl0/alpcheck_sa.o build.512z_d_inl0/alpCPPProcess.o build.512z_d_inl0/cupla/common.o build.512z_d_inl0/cupla/device.o build.512z_d_inl0/cupla/event.o build.512z_d_inl0/cupla/memory.o build.512z_d_inl0/cupla/stream.o build.512z_d_inl0/cupla/manager/Driver.o -lcudadevrt

avalassi 26057 26035 47 13:40 pts/3    00:02:15 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_000065b3_00000000-3_alpcheck_dlink.reg.c -L../../lib/build.avx2_d_inl0 -L/usr/local/cuda-11.6/lib64/ -lmodel_sm -lcurand -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.avx2_d_inl0/alpcheck_sa.o build.avx2_d_inl0/alpCPPProcess.o build.avx2_d_inl0/cupla/common.o build.avx2_d_inl0/cupla/device.o build.avx2_d_inl0/cupla/event.o build.avx2_d_inl0/cupla/memory.o build.avx2_d_inl0/cupla/stream.o build.avx2_d_inl0/cupla/manager/Driver.o -lcudadevrt

avalassi 26061 25787  0 13:40 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc -o build.512y_d_inl0/runTest.exe build.512y_d_inl0/CPPProcess.o build.512y_d_inl0/runTest.o build.512y_d_inl0/MadgraphTest.o build.512y_d_inl0/testxxx.o build.512y_d_inl0/alpCPPProcess.o build.512y_d_inl0/runTest_alp.o build.512y_d_inl0/cupla/common.o build.512y_d_inl0/cupla/device.o build.512y_d_inl0/cupla/event.o build.512y_d_inl0/cupla/memory.o build.512y_d_inl0/cupla/stream.o build.512y_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -ldl -L../../lib/build.512y_d_inl0 -lmodel_sm -L../../../../../test/googletest/build/lib/ -lgtest -lgtest_main -L/usr/local/cuda-11.6/lib64/ -lcurand -lcuda -lgomp

avalassi 26063 25999 48 13:40 pts/3    00:02:19 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_0000658f_00000000-3_alpcheck_dlink.reg.c -L../../lib/build.sse4_d_inl0 -L/usr/local/cuda-11.6/lib64/ -lmodel_sm -lcurand -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.sse4_d_inl0/alpcheck_sa.o build.sse4_d_inl0/alpCPPProcess.o build.sse4_d_inl0/cupla/common.o build.sse4_d_inl0/cupla/device.o build.sse4_d_inl0/cupla/event.o build.sse4_d_inl0/cupla/memory.o build.sse4_d_inl0/cupla/stream.o build.sse4_d_inl0/cupla/manager/Driver.o -lcudadevrt

avalassi 26065 25798  0 13:40 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc -o build.512z_d_inl0/runTest.exe build.512z_d_inl0/CPPProcess.o build.512z_d_inl0/runTest.o build.512z_d_inl0/MadgraphTest.o build.512z_d_inl0/testxxx.o build.512z_d_inl0/alpCPPProcess.o build.512z_d_inl0/runTest_alp.o build.512z_d_inl0/cupla/common.o build.512z_d_inl0/cupla/device.o build.512z_d_inl0/cupla/event.o build.512z_d_inl0/cupla/memory.o build.512z_d_inl0/cupla/stream.o build.512z_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -ldl -L../../lib/build.512z_d_inl0 -lmodel_sm -L../../../../../test/googletest/build/lib/ -lgtest -lgtest_main -L/usr/local/cuda-11.6/lib64/ -lcurand -lcuda -lgomp

avalassi 26066 25782  0 13:40 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc -o build.avx2_d_inl0/runTest.exe build.avx2_d_inl0/CPPProcess.o build.avx2_d_inl0/runTest.o build.avx2_d_inl0/MadgraphTest.o build.avx2_d_inl0/testxxx.o build.avx2_d_inl0/alpCPPProcess.o build.avx2_d_inl0/runTest_alp.o build.avx2_d_inl0/cupla/common.o build.avx2_d_inl0/cupla/device.o build.avx2_d_inl0/cupla/event.o build.avx2_d_inl0/cupla/memory.o build.avx2_d_inl0/cupla/stream.o build.avx2_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -ldl -L../../lib/build.avx2_d_inl0 -lmodel_sm -L../../../../../test/googletest/build/lib/ -lgtest -lgtest_main -L/usr/local/cuda-11.6/lib64/ -lcurand -lcuda -lgomp

avalassi 26067 25781  0 13:40 pts/3    00:00:00 /usr/local/cuda-11.6/bin/nvcc -o build.sse4_d_inl0/runTest.exe build.sse4_d_inl0/CPPProcess.o build.sse4_d_inl0/runTest.o build.sse4_d_inl0/MadgraphTest.o build.sse4_d_inl0/testxxx.o build.sse4_d_inl0/alpCPPProcess.o build.sse4_d_inl0/runTest_alp.o build.sse4_d_inl0/cupla/common.o build.sse4_d_inl0/cupla/device.o build.sse4_d_inl0/cupla/event.o build.sse4_d_inl0/cupla/memory.o build.sse4_d_inl0/cupla/stream.o build.sse4_d_inl0/cupla/manager/Driver.o -O3 -lineinfo -std c++14 -I. -I../../src -I../../../../../tools -I../../../../../test/googletest/googletest/include -I/usr/local/cuda-11.6/include/ -DUSE_NVTX -arch compute_70 -use_fast_math -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -ldl -L../../lib/build.sse4_d_inl0 -lmodel_sm -L../../../../../test/googletest/build/lib/ -lgtest -lgtest_main -L/usr/local/cuda-11.6/lib64/ -lcurand -lcuda -lgomp

avalassi 26071 26066 42 13:40 pts/3    00:02:00 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_000065d2_00000000-3_runTest_dlink.reg.c -L../../lib/build.avx2_d_inl0 -L../../../../../test/googletest/build/lib/ -L/usr/local/cuda-11.6/lib64/ -ldl -lmodel_sm -lgtest -lgtest_main -lcurand -lcuda -lgomp -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.avx2_d_inl0/CPPProcess.o build.avx2_d_inl0/runTest.o build.avx2_d_inl0/MadgraphTest.o build.avx2_d_inl0/testxxx.o build.avx2_d_inl0/alpCPPProcess.o build.avx2_d_inl0/runTest_alp.o build.avx2_d_inl0/cupla/common.o build.avx2_d_inl0/cupla/device.o build.avx2_d_inl0/cupla/event.o build.avx2_d_inl0/cupla/memory.o build.avx2_d_inl0/cupla/stream.o build.avx2_d_inl0/cupla/manager/Driver.o -lcudadevrt

avalassi 26073 26061 47 13:40 pts/3    00:02:13 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_000065cd_00000000-3_runTest_dlink.reg.c -L../../lib/build.512y_d_inl0 -L../../../../../test/googletest/build/lib/ -L/usr/local/cuda-11.6/lib64/ -ldl -lmodel_sm -lgtest -lgtest_main -lcurand -lcuda -lgomp -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.512y_d_inl0/CPPProcess.o build.512y_d_inl0/runTest.o build.512y_d_inl0/MadgraphTest.o build.512y_d_inl0/testxxx.o build.512y_d_inl0/alpCPPProcess.o build.512y_d_inl0/runTest_alp.o build.512y_d_inl0/cupla/common.o build.512y_d_inl0/cupla/device.o build.512y_d_inl0/cupla/event.o build.512y_d_inl0/cupla/memory.o build.512y_d_inl0/cupla/stream.o build.512y_d_inl0/cupla/manager/Driver.o -lcudadevrt

avalassi 26082 26067 45 13:40 pts/3    00:02:09 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_000065d3_00000000-3_runTest_dlink.reg.c -L../../lib/build.sse4_d_inl0 -L../../../../../test/googletest/build/lib/ -L/usr/local/cuda-11.6/lib64/ -ldl -lmodel_sm -lgtest -lgtest_main -lcurand -lcuda -lgomp -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.sse4_d_inl0/CPPProcess.o build.sse4_d_inl0/runTest.o build.sse4_d_inl0/MadgraphTest.o build.sse4_d_inl0/testxxx.o build.sse4_d_inl0/alpCPPProcess.o build.sse4_d_inl0/runTest_alp.o build.sse4_d_inl0/cupla/common.o build.sse4_d_inl0/cupla/device.o build.sse4_d_inl0/cupla/event.o build.sse4_d_inl0/cupla/memory.o build.sse4_d_inl0/cupla/stream.o build.sse4_d_inl0/cupla/manager/Driver.o -lcudadevrt

avalassi 26085 26065 49 13:40 pts/3    00:02:20 nvlink -m64 --arch compute_70 --register-link-binaries /tmp/avalassi/tmpxft_000065d1_00000000-3_runTest_dlink.reg.c -L../../lib/build.512z_d_inl0 -L../../../../../test/googletest/build/lib/ -L/usr/local/cuda-11.6/lib64/ -ldl -lmodel_sm -lgtest -lgtest_main -lcurand -lcuda -lgomp -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib/stubs -L/usr/local/cuda-11.6/bin/../targets/x86_64-linux/lib -cpu-arch X86_64 build.512z_d_inl0/CPPProcess.o build.512z_d_inl0/runTest.o build.512z_d_inl0/MadgraphTest.o build.512z_d_inl0/testxxx.o build.512z_d_inl0/alpCPPProcess.o build.512z_d_inl0/runTest_alp.o build.512z_d_inl0/cupla/common.o build.512z_d_inl0/cupla/device.o build.512z_d_inl0/cupla/event.o build.512z_d_inl0/cupla/memory.o build.512z_d_inl0/cupla/stream.o build.512z_d_inl0/cupla/manager/Driver.o -lcudadevrt
valassi commented 2 years ago

What is also puzzling is that ccache does not seem to work with nvlink? I think that this used to be the case in the past?... Was it working with an older nvcc?

valassi commented 2 years ago

Hm note https://github.com/madgraph5/madgraph4gpu/issues/174#issuecomment-826736369 I mentioned that ccache was not working correctly with "-x cu"... here however the issues ainly seems to be at link time rather than at compile time?