madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Prototype multi-threaded MadEvent with shared GPU offload of MEs? #500

Open valassi opened 2 years ago

valassi commented 2 years ago

This is a followup of #495.

It is quite clear that, presently, Amdahl's law severely limits our madevent+MEs throughput on GPUs, because the Fortran-only madevent part ends up taking more than 80% of the total workflow time (see https://github.com/madgraph5/madgraph4gpu/pull/494#issuecomment-1163312052 : total 660s, with madevent 550s and MEs 110s, on a single CPU thread).

In parallel to other approaches described in #495 (profiling and speeding up this madevent overhead by understanding what it is doing - maybe this will already be lower when moving color/helicity choice to GPU?), one option that we could investigate is parallelising madevent over several threads, while sharing the same GPU.

For instance, if we now have 8192 events on the GPU, and supposing we have 8 CPU logical cores, we could run 8 CPU threads each doing 1024 events on the GPU. The assumption I am making here (which may be wrong!) is that we would reach the same occupancy on the GPU and the same ME throughput on the GPU, if we launch 8 identical kernels each doing 1024 events, as if we do now with 1 kernel doing 8192 events. Assume we always keep 32 threads per block. The idea is that the GPU grid of the eight kernels would have only 32 blocks (blocksthreads=1024), while the GPU grid of the single kernel had 256 blocks (blocksthread=8192). In total you always have 256 blocks active, but in case they belong to one CPU thread and in the other they belong to 8 CPU threads.

Maybe something like this could also be achieved with 8 CPU processes rather than 8 CPU threads, but I suspect that we are better off (or maybe we strictly need) a single GPU master context and a single GPU memory allocation.

In this CPU MT model, a single master thread would set the GPU context and allocate the full memory for 8192 events. Probably we would still have a single fbridgecreate call (ie a single Bridge constructor, ie a single Bridge instance). However we would need to call eight fbridgesequence methods, one from each thread. The API of the Bridge would need to change to add a "threadid" that indicates the offset in GPU global memory for the 1024 events that are specific to that thread (in practice, it would be a block id offset).

Maybe one of the options to prototype in Lugano, together with kernel splitting? (Note, the handling of several GPUs on the same CPU is also another thing we eventually need to do: while it is a different problem, I would say that a common theme is the idea that offloading from the CPU to the GPU no longer offloads to a single GPU, but offloads to one specific part of one specific GPU).

roiser commented 2 years ago

Do we know which Fortran part is taking how much time in the overall workflow execution? If the "pre-processing" (random numbers, phase space sampling) is taking more time, one could in parallel do those processing on the CPU and move all the data in one shot to the GPU. Then let the post-processing be done with a single thread. But we would need numbers first ...

valassi commented 2 years ago

Thanks Stefan! No exactly, we do not know who is spending time there (see #495), whether its mainly preprocessing or postprocessing (well maybe that I could easily instrument).

Anyway, we would probably need a multithreaded CPU part, plus the GPU. Not the biggest priority now, but lets keep it in mind...

valassi commented 2 years ago

Another possibility is to launch several processes. This may work, as discussed here https://indico.cern.ch/event/1170924/contributions/4954511/

Note that we seem to even get higher throughput from several CPU processes sharing the same GPU (see the slides above).

One suggestion (thanks Vincent Maillou) is the following "Hi, here some information about multi-process cuda applications: It is called MPS for Multi-Process Service. From the doc it seems that the only improvment that can happend by overloading from multi-process on the same kernel is if a single process is not completely using the GPU. https://docs.nvidia.com/deploy/mps/index.html "