madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

epochX - code generation, diffs, backports of latest cuda/c++ #244

Closed valassi closed 2 years ago

valassi commented 3 years ago

I put down a few ideas following this afternoon's code generation workshop https://indico.cern.ch/event/1061524/ about how I suggested I would structure the repo (and also specific steps I plan to implement for cuda/c++)

What I would create is a new epochX directory, next to epoch1 and epoch2.

The plan is to backport my vectorization changes (now the latest are in epoch1/cuda/ee_mumu) to epochX/cuda.

This epochX would be an evolving directory (i.e. not a semi-fixed one like epoch1/2), containing both manually developed code and code-generating code infrastructure.

In a "steady state", when all new manual developments have been backported to the code generating code, one should be able to run the code generating code and obtain EXACTLY the same code as that which has been manually developed and committed to the repo. This requires a functioning code generating infrastructure (eg with only a few plugin files in github, but relying on an "external" installation of Mg5aMC somewhere configurable), and some relocation and diff scripts: essentially yo generate new code in a tmp area, then you compare it to what is committed in the repo. It may also require some beautifier to bypass trivial formatting differences (to be seen).

Beyond the steady state, one develops manual code and commits that to the github repo. At this stage, you have in github a code generating plugin that is "older" than the manual developments, and that is perfectly ok. At some point, you do the effort of backporting the new manual changes into the code generating plugin, and ensure that you reach the steday state again. Then you can iterate and do new manual developments.

Specifically for the "cuda" directory (native cuda, plus vectorized c++, with ifdefs for the moment), I would imagine working on the following steps in sequence:

  1. Create the basic infrastructure, with "manual" code and code generating code. Ensure that you get the external MG5aMC and diff scripts working. In practice, start with Olivier's current bazaar code generatoing code, add that as a plugin to github. Generate processes (eemumu and ggttgg? probably both) with that code generating code, commit that to github as the first "manual" version. This would be clearly different from the latest epoch1/eemumu (missing all vectorization), but also from epoch2/ggttgg (I checked today, there are small differences). This is a steady state. Make a PR for this.

  2. Replace epochX/ggttgg by the manual code that is currently in epoch2/ggttgg. This is beyond steady state, Start doing actual work on the code generating plugin, ensure that eventually the plugin is able to generate exactly the same ggttgg code. Make sure that the new plugin also works fine on eemumu: generate a new eemumu version and just check it looks ok, commit that to github. This is again a steady state, for both ggttgg and eemumu. Make a second PR.

  3. Attack vectorization. Replace epochX/eemumu by the manual code that is currently in epoch1/eemumu. This is far beyond steday state. Do the (potentially complex) part of backporting vectorization to code generating code. Here you might also need the beautifier, if not used already before. Ensure that you are able to get exactly the same eemumu code from code generation as you have in the manual version. If done, you are in steady state for eemumu. Make a third PR.

  4. Here it gets interesting. Apply the latest epochX code generator to ggttgg. This should produce vectorized code for ggttgg, but it is likely to be missing some bits. Iterate on both sides, manual ggttgg code and code generating code, until you get something that actually works well for ggttgg, and can be reporoduced exactly from code generation. If done, you are in the steady state for ggttgg. Make a fourth PR.

  5. At this point you probably have a more recent code generator than you had in step 3, and you should make sure that eemumu also looks ok. Iterate a bit on all sides (manual eemumu, manual ggttgg, and code generating code), until both processes look reasonable and can be fully auto generated. You are now in the final steady state for the backport of previous cuda/c++ code. Make a fifth PR.

From here onwards, go on with the normal development model outlined at the very beginning: more changes in manual cuda/c++ for eemumu and/or ggttgg (probably ggttgg will be the main focus by now, and we can fully forget eemumu?). At some point (either here, or maybe already in step 4) you probably need to do something special for color algebras. Anyway, from here onwards there are many more developments, eg Fortran integration, comparison to other backends, etc.

Concerning code generating code for cuda vs kokkos and others: I would probably do all of the steps 1-5 in isolation. Maybe the only thing that it makes sense to have in common at the beginning in step 1 is the external MG5aMC installation, and possibly the tooling to do code generation tests and diffs (and possibly beautification). But the plugins themselves in the code generating code I would initially keep separate for cuda/c++ and kokkos (and alpaka, sycl etc). It looks already quite complex like this... Anyway, this can be rediscussed.

Voila, a long brain dump to clear my own ideas/plans, and explain to others what I'd suggest... feedback welcome of course! Andrea

valassi commented 3 years ago

Yesterday I created a first WIP PR #245 for the "step 1". One important (unexpected) issue is that the current code generation in Madgraph seems to be not reproducible, which in my opinion is a major problem that prevents easy validation of any new developments (both those we already have, such as vectorized c++ or kokkos, and any new development to come). Maybe this is not complex to fix (some python dictionaries to replace by sorted ones?), maybe it is. I will investigate when I have time.

valassi commented 3 years ago

I have fixed the first issue against unreproducibility, but there are others. It was python list(set(list)) - actually python dictionaries are sorted as of python 3.6 so this is not a problem (I use 3.8), and I even rely on this new feature for the fix. It should be realtively fast to fix the other pending issues in a similar way.

valassi commented 3 years ago

Thanks to Olivier, the resproducibility issue is now fixed. I have produced basic code for eemuu and ggttgg. In th eplugin I only added one minimal change to the Makefile, to bypass some build issue for tests, that I will fix more carefully in later versions of the c++/cuda code from epoch1/2. I added a basic throughput script and ran some profiling, logs are in the repo. Probably I also need to switch on fast math to reproduce the results I got in epoch1/2.

This completes my 'step1' PR #245 in the plan outlined in issue #244. Still a long way to go...

I keep this in WIP even if essentially complete. I will open a second PR (including this one) for the next steps.

PS Hm well one last thing to complete this actually could be to add CI tests for it. But for the moment I bypassed the tests, so I will do it in later steps instead,

valassi commented 3 years ago

I have created a (yet incomplete) PR #247 for step2, the backport of epoch2 ggttgg. Probably more than 50% done. On the way I fixed several issues that would have been on the way also for vectorization in epoch1 eemumu (eg copying templates from the plugin). Still some work to do there as described in #247 (eg I would like to disable the code indenter that changes the indentation of the templates like helas.h and helas.cc).

One general comment for the overall strategy I am following. I am now committing BOTH the manual code an dthe auto generated code. For instance, directories ee_mumu and ee_mumu.auto. This makes it much easier to understand what still needs to be ported to cde generation. Sometimes changes are needed in the manual code (eg if the order in auto generation is different from that which had initially been used when developing manual code).

valassi commented 3 years ago

Ok the indentation comes from "class CPPWriter(FileWriter)" in madgraph.iolibs.file_writers. This is called from write_aloha_routines in UFOModelConverterGPU. To disable the indentation it is enough to use FileWriter instead of CppWriter. This works better with large copied code (eg ixxx) but works porly on the FFV routines. I will try to work like this anyway, and fix the FFV a posteriori. Otherwise getting diffs of the code becomes a nightmare.

valassi commented 3 years ago

A couple of slides about the status and strategy n today's meeting https://indico.cern.ch/event/1077418/#2-topical-discussion

valassi commented 3 years ago

I have completed and merged the first two PRs:

STEP1 PR #245

STEP2 PR #247

I can now start working on 'step3', ie the backport of the vectorized code from eemumu "epoch1" (more revcent than epoch2)

valassi commented 3 years ago

I have created a WIP PR #253 for step3. I will work on it next week

valassi commented 3 years ago

While working on step3 and PR #253, I realised that there were a few internal inconsistencies in the repos (eg duplicate files), and the cc/cu file name and symlink structure was not the same as in epoch1 eemumu. I have now fixed all these issues in an epoch2 "bis", with a corresponding PR #254, which I am about to merge.

valassi commented 3 years ago

The result of PR #254 is now a golden tag

  git tag -a golden_epochX2 20ba5b1893572c5999af38272baea752fefd8bdc
  git push --tags
valassi commented 3 years ago

Ok the tag is now recreated, same name https://github.com/madgraph5/madgraph4gpu/releases/tag/golden_epochX2

  git tag -a golden_epochX2 f15deb5f88dd8252aaae914803e25c84c578d970
  git push --tags

"Golden" tag for cudacpp epochX2(bis) after merging PRs #254, #248, #256. To be used for Kokkos/Alpaka/Sycl comparisons, Fortran/cuda bridging etc. This includes the backport to code generating code of epoch2 ggttgg. It also includes auto generated gg_tt for Fortran/cuda bridging.

valassi commented 3 years ago

The backport of the vectorized eemumu to code generating code (epochX3) is now complete. I have merged this in PR #253. The eemumu auto and manual code are the same.

The ggtt and ggttgg auto (and manual) code are instead still those that had been generated with the epochX2 (golden tag) plugin.

I will now start the generation of vectorized ggttgg code (epochX4). The code does generate out of the box, but there are a few build issues to fix. It will need some iterations in the plugin.

valassi commented 3 years ago

I have started epochX4 in PR #267.

I have the first vectorization results on ggttgg: without aggressive inlining, I see a good factor 4 speedup between "no SIMD" and "AVX 512y" in C++! Note that the former was (a factor 2?) slower than Fortran, so the real gain must be assessed. But this is very promising...

valassi commented 3 years ago

Step 4 (and step 5) is now also complete. The final merge requests are #267 and #270 (I forgot some commits in the former).

This vectorizes ggttgg and ggtt (with good SIMD speedups).

All code is in the steady state - the three auto ggttgg, ggtt, eemumu correspond to the plugin, the two manual eemumu and ggttgg are identical to auto, and the logs are those of the code there

This saga #244 can eseentially be closed...

valassi commented 3 years ago

Note that one important side effect of this task #244 and the various PRs in it is that the auto-generated code is now nicely formatted and indented. The code generator IS also a code beautifier, and is very precisely programmable. This means that we should not need clang-format, uncrustify or other tools (issue #49). The way this was implemented was to disable the "CPPFileWriter" in madgraph, using the non-formatted FileWriter, and then adding formatting in all python fragments that write C++ fragments.

valassi commented 2 years ago

EpochX is now firlly established, This saga #244, which was about creating epochX from scratch, can now be closed