Jorgen's addition of HIP/CUDA abstraction to codegen and all processes (with extra fixes and tests: "jt774" replacing PR #774)

valassi commented 7 months ago

Hi @oliviermattelaer @roiser @nscottnichols as discussed this is the WIP PR replacing Jorgen's PR #774.

It includes all of Jorgen's changes for HIP in PR #774 (GpuAbstraction.h and hipcc build instructions) in CODEGEN, plus a merge of the current upstream/master and extra fixes that were necessary to fix codegen and/or CUDA/C++ builds. Those extra fixes were partly derived from earlier work I had done with Jorgen in July/August (that I kept for reference in a PR #800, that I opened and immediately closed).

This PR has a fully functioning CODEGEN and regerenated processes, with baisc builds tested for ggtt.mad, but it is still in WIP.

To start with, I need to run all cuda/c++ tests (some will run in the CI, the performance suite I will run tonight).
More importantly, I HAVE NOT TESTED ANY HIP YET , neither the build nor the runtime. I need to install hipcc to test the build. And I need to setup a node with an AMD GPU to test the code.

I will close PR #774 because it is replaced by this one.

valassi commented 7 months ago

Ok all CI tests succeed - this means builds and tests are ok for all mad and sa on all processes, at least, in CUDA and C++.

In particular this is the new CI https://github.com/madgraph5/madgraph4gpu/actions/runs/7658630729/

valassi commented 7 months ago

I have fixed the build using C++ and HIP on LUMI for gg_tt.sa.

Tomorrow I will backport to codegen, regenerate all processes and launch all tests also on LUMI for HIP.

valassi commented 7 months ago

I have completed the first basic tests of HIP stuff on LUMI. En passant I closed several issues, including #802 (segfaults in fgcheck: I moved to using the fortran linker for linking fortran and c++) and #803 (I removed filesystem headers because the LUMI compilers have issues with c++17 compliance in this area). I also tested flang as a fortran compiler, but this gives issues with F90 non-compliance in madevent files #804.

I backported to codegen and regenerated all processes. I am now rerunning all tests both on itscrd90 with CUDA and on LUMI with HIP.

By the way, I did these tests on SA first and then extended to madevent. It was much easier to do the debugging on SA first. This is an extra argument for keeping SA functional, and having madevent makefiles depend on the SA ones.

valassi commented 7 months ago

Once the tests are done, this is essentially ready to go.

There is only one thing missing, but maybe it can de deferred

TODO: fix the runcards to add HIP as a backen

I would then go ahead and add on top of these two related PRs

PR #798: the separation of cuda and cpp builds and 'targets' (which must be completed with a merge of master, and with handling now also HIP targets)
PR #368: removing gXXX.cu symlinks (which must be completed with a merge of master, and possibly with a cleanup of many files and scripts that only assume CUDA is present and nt HIP or SYCL etc)

valassi commented 7 months ago

Hi @oliviermattelaer @roiser @nscottnichols @hageboeck this is essentially ready. If you have any comments please let me know. (cc @Jooorgen for info: thanks again for this work, hopefully soon in production!)

@oliviermattelaer I add you as reviewer - let me know when/if it cane be merged please or if you have other suggestions.

This PR includes Jorgen's work (see #311) on GpuAbstracttion.h, the change of CUDACC ifdefs and the HIP build recipes. With respect to Jorgen's PR #774 I added a few additional fixes, I regenerated all processes and I have rerun all of my standard "tput" and "tmad" tests on AMD GPUs at LUMI.

The present code passes all tests on NVidia GPUs, and almost all tests on AMD GPUs, but it fails tests on AMD GPUs for gq_ttq (see #806). In my opinion this is not a blocker for merging, it is better to merge this and have the code available for further debugging of the gqttq issue on AMD, and for Nathan's work on Intel GPUs (see #805).

As discussed at the last meeting (and above), my suggestion on how to proceed would be the following:

Merge this PR #801.
Update PR #798 to this merge, and then merge 798 as well. This contains the separation of CUDA/C++ targets. So some work is also needed to cleanly separate HIP and CUDA (in this 801 I keep Jorgen's approach, so for instance if both CUDA and HIP are installed, CUDA is built and not HIP). One extra point that I would add in 798 (and which I did NOT add here in 801) is the handling of the HIP target in the runcards.
Update PR #368 to the two merges above, and then merge 368 as well. This removes the gXXX.cu symlinks, replacing gXXX.o objects by XXX_cu.o objects. This also needs some cleanup to better separate cuda and hip, so for instance the cuda and hip builds should produce separate XXX_cu.o and XXX_hip.o objects (these are in different directories anyway). And similarly madevent_cuda should be renames as madevent_hip when it is built for hip. These last few points can be done as part of 368, or even as a new PR otherwise to keep these issues separate from one another.

Let me know if you have comments. Thanks. Andrea

valassi commented 7 months ago

For reference, theLUMI setup I used is

module load cray-python
export PATH=~/CCACHE/ccache-4.8.2-INSTALL/bin:$PATH
export CCACHE_DIR=~/CCACHE/ccache
export USECCACHE=1
module load gcc/12.2.0
export FC=`which gfortran`

valassi commented 7 months ago

I just added a fix to allow multi-word CXX env variables for HIP builds. This muct be disabled for CUDA (see #505) because -ccbin is used for nvcc there, but I had by mistake disabled them for HIP and this is not necessary. The fix is necessary to alow the following LUMI setup (disabling cray libraries which are otherwise not found, see #807)

module load cray-python
export PATH=~/CCACHE/ccache-4.8.2-INSTALL/bin:$PATH
export CCACHE_DIR=~/CCACHE/ccache
export USECCACHE=1
module load LUMI/23.09 partition/G
module load cpeGNU/23.09
export CC="cc --cray-bypass-pkgconfig -craype-verbose"
export CXX="CC --cray-bypass-pkgconfig -craype-verbose"
export FC="ftn  --cray-bypass-pkgconfig -craype-verbose -ffixed-line-length-132"

(I tried this setup to try and fix the crash in #806 for gqttq, but this does not fix that crash)

valassi commented 7 months ago

I also added another fix to remove the '-G' option in debug builds for HIP (see #808).

Note: to trigger debug builds, use make -f cudacpp.mk debug -j; make -j

roiser commented 7 months ago

Hi [...] @roiser [...] this is essentially ready. If you have any comments please let me know.

I have looked at one of the processes, this looks like what we discussed with Jorgen at the time --> Thanks @Jooorgen !!!!

I did NOT look at the makefiles though, I imagine those will be simplified in the future.

I only have one comment / suggestion. Judging from the code review it seems that random number generation falls back to the common random numbers for HIP, while quickly googling around there seems to exist a HIP version of curand. Especially for the standalone version it may be interesting to also have a HipRandomNumberKernel.cc implemented.

valassi commented 7 months ago

I did NOT look at the makefiles though, I imagine those will be simplified in the future.

Well, so far they are mainly those by Jorgen with HIP build instructions. As discussed I would go for the many simlifications in #798 and #368 after this

I only have one comment / suggestion. Judging from the code review it seems that random number generation falls back to the common random numbers for HIP, while quickly googling around there seems to exist a HIP version of curand. Especially for the standalone version it may be interesting to also have a HipRandomNumberKernel.cc implemented.

Good point, I opened a WIP #809, with a few commits over this one, still to be completed.

I would keep rocrand for a separate stage anyway. There is a lot of stuff we need to get in, I would go one step at a time. So first this 801, then maybe 809, and certainly 798 and 368. Or even much more simply this 801 and then 798 and 368, and rocrand 809 only at the end. (By the way I called this rocrand because it is rocrand at the bottom, but I start t have the impression that the actual API used will be hiprand instead)

valassi commented 7 months ago

Well... there are a few issues with complex numbers and hipcc flags (#810). I added a fix, but it seems that a couple of tests are failing. A bit more to do I guess.

valassi commented 7 months ago

ok the issue should be fixed now (_sm hardcoded, now fixed), rerunning all manual and CI tests

oliviermattelaer commented 7 months ago

So for me this simple madgraph test is not working anymore:

generate g g > t t~ g
output madevent_simd PROC_TEST
launch
set cudacpp_backend FORTRAN

crash is

Command "import /Users/omattelaer/Documents/git_workspace/3.1.1_lo_vectorization/cmd_multi" interrupted in sub-command:
"output madevent_simd PROC_TEST" with error:
AttributeError : 'SIMD_ProcessExporter' object has no attribute 'export_processes'
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in 'MG5_debug'.
Please attach this file to your report.

Does it work for you andrea? If yes which of the MG5aMC branch is needed to have this working? (I bet that this might be my issue)

oliviermattelaer commented 7 months ago

I was using gpucpp_wrap, indeed gpucpp is working. Let me check with that one that SIMD is working now on my mac

valassi commented 7 months ago

Hi @oliviermattelaer thanks yes I am using gpucpp branch. Or more precisely: I am using whatever mg5amcnlo commit we have in the module: https://github.com/valassi/madgraph4gpu/tree/jt774/MG5aMC Now this points to https://github.com/mg5amcnlo/mg5amcnlo/tree/23f61b93fdf268a1cdcbd363cd449c88b3511d7a

This is a rather older version of gpucpp. If you want, I can also try to include the upgrade to the latest gpucpp here. But first check with 23f61b9 please... again, I'd avoid mixing everything together if possible (it's ok if it works, but if it does not work I would rather debug the issues separately). Thanks!

valassi commented 7 months ago

PS 23f61b9 is also what we have now in master https://github.com/madgraph5/madgraph4gpu/tree/master/MG5aMC

oliviermattelaer commented 7 months ago

I can not test with that MG5aMC version since my python version is too recent for that "old" version of MG5aMC... So I had to update (both gpucpp and gpucpp_wrap) to merge with 3.5.3... But seems working so far...

oliviermattelaer commented 7 months ago

Ok, it does not work for me actually. They are some issue with the python interface of the plugin (nothing related to the hip part). I guess that such issue is solve in another branch (can not be working like that). Will push here some cherry-pick or other method to fix the issue that I have (and then work on the next one).

oliviermattelaer commented 7 months ago

note for later, we will need to re-run the test on that commit since the reported crash is an github API issue (i.e. not related to the commit per say)

valassi commented 7 months ago

Ok, it does not work for me actually. They are some issue with the python interface of the plugin (nothing related to the hip part). I guess that such issue is solve in another branch (can not be working like that). Will push here some cherry-pick or other method to fix the issue that I have (and then work on the next one).

Hi @oliviermattelaer thanks for testing and for adding the fix https://github.com/madgraph5/madgraph4gpu/pull/801/commits/9ed3aaf637941b12762b6d7a0bf51246f7a63cba

As you saw I opened PR #811 in parallel about the update of mg5amcnlo to the latest gpucpp. But I suggest we postpone that, and we use the older gpucpp for this PR #801.

And yes we do too much for github and its CI ;-)

oliviermattelaer commented 7 months ago

I have check the plugin/interface part and this is good to merge for me.

However, this would need to add a new backend within launch_plugin.py (for the moment, this only support Fortran, CPP, CUDA). Which means that a "normal" user can not run the code (even if all the script that we use for testing are working).

I would say that our target here would be to be able to run via the following script:

generate g g > t t~ g
output madevent_gpu PROC_TEST
launch
set cudacpp_backend HIP

(maybe a good time to change the name of that variable) I can implement that within this PR (this is easy for me) but I will not be able to test it (no access to LUMI). Or do you prefer that we discuss/add that possibility in another branch? (both are fine for me) If this is the case, then we can move forward for the moment.

valassi commented 7 months ago

I have check the plugin/interface part and this is good to merge for me.

However, this would need to add a new backend within launch_plugin.py (for the moment, this only support Fortran, CPP, CUDA). Which means that a "normal" user can not run the code (even if all the script that we use for testing are working).

I would say that our target here would be to be able to run via the following script:
generate g g > t t~ g
output madevent_gpu PROC_TEST
launch
set cudacpp_backend HIP

Thanks Olivier, I will merge. I committed a few more manual test logs, all OK. And the CI is good.

After this merge 801, I would suggest going in the following order:

adapt and merge #798 about separate cpp, cuda, hip builds (built over Jorgen's #775)
adapt and merge #368, removing gXXX.cu links and using XXX.o and XXX_cu.o instead (and possibly also XXX_hip.o, now)

Also, later on,

adapt and merge #811 about moving to the latest gpucpp
adapt and merge #809 about rocrand (this is actually almost ready too, I just realised that RocrandHost cannot be implemented apparently, can be studied in more detail, maybe needs a more recent HIP, but I think that's ok for the moment)

And then later all the other non hip stuff (warp size, more on makefiles, etc).

valassi commented 7 months ago

And thanks @Jooorgen again! :-)

madgraph5 / madgraph4gpu

Jorgen's addition of HIP/CUDA abstraction to codegen and all processes (with extra fixes and tests: "jt774" replacing PR #774) #801