In Fortran MadEvent, use larger arrays (beyond 16 events, up to 16k) to allow efficient usage of GPUs

valassi commented 2 years ago

Hi @oliviermattelaer I file this as a separate issue just to keep track of what I am doing.

This is probably not the best way, but it works. We can use it as a basis for discussion.

Essentially, your present code has a single parameter NB_PAGE (determined from the meexporter vector size), fixed at compile time. This is used for two things in Fortran: one, to allocate memory arrays with large enough dimension to hold those events; two, to loop over as many events.

The problem is that cuda and c++ require rather different numbers. In C++ I typically use 32 (16 is the absolute maximum today for floats in AVX512, you hold 16 32-bit single precision floats in 512bits). In Cuda on a V100, you need around 16k for complex processes and even 512k for simpler processes.

In the C++ standalone application, this is handled at runtime: you pass the infamous three parameters at the beginning, where the product of the first two (blocks * threads) is the grid size at each iteration. It is easy to do it at runtime because we use malloc. In Fortran, I am not sure how to use a malloc equivalent, and in any case it would be quite cumbersome.

The hack I am using is the following, I split your NB_PAGE into two parameters:

NB_PAGE_MAX, fixed at compile time, is used to allocate the arrays
NB_PAGE_LOOP, which can be determined at runtime (eg via the command line input) is used to determine on how many events you actually loop at each iteration, with NB_PAGE_LOOP<=NB_PAGE_MAX

For the moment it seems to work. I will point here an example of what I have done.

valassi commented 2 years ago

The strategy seems to work, see for instance https://github.com/valassi/madgraph4gpu/commit/b24caf051f8cbb4676917592b7953011d256bccf It is possible to split NB_PAGE into a two variables, one for allocations, one for the loop.

The problem however is that in cuda I seem to only be able to use up to 1024 events in a loop. If I increase to 2048, I get some strange errors https://github.com/valassi/madgraph4gpu/commit/7e8fdf107decd160aa10d8656584718ba4100afc https://github.com/madgraph5/madgraph4gpu/pull/454/commits/0498ed54378068055e1f13dd5c5b5a41a5a97e5f

*** EXECUTE MADEVENT (create results.dat) ***
--------------------
16384 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
0 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number for single-diagram enhancement multi-channel (IGNORED as suppress amplitude is 0?)
--------------------
Executing ' ./madevent < /tmp/avalassi/tmp.HkGyl3s2lr_fortran > /tmp/avalassi/tmp.KL9aSwClfx'
 [XSECTION] Cross section = 435.3
 [COUNTERS] PROGRAM TOTAL          :    1.0303s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.7452s
 [COUNTERS] Fortran MEs      ( 1 ) :    0.2851s for    16416 events => throughput is 5.76E+04 events/s

*** EXECUTE CMADEVENT_CUDACPP (create events.lhe) ***
--------------------
32 ! Number of events in a single C++ iteration (nb_page_loop)
16384 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
0 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number for single-diagram enhancement multi-channel (IGNORED as suppress amplitude is 0?)
--------------------
Executing ' ./cmadevent_cudacpp < /tmp/avalassi/tmp.uw8UDuRF7g_cuda > /tmp/avalassi/tmp.F3HguHT9az'
 [XSECTION] Cross section = 435.3
 [MERATIOS] ME ratio CudaCpp/Fortran: MIN = 0.99992635 = 1 - 7.4e-05
 [MERATIOS] ME ratio CudaCpp/Fortran: MAX = 1.00000527 = 1 + 5.3e-06
 [MERATIOS] ME ratio CudaCpp/Fortran: NENTRIES =  16416
 [MERATIOS] ME ratio CudaCpp/Fortran - 1: AVG = -2.5e-05 +- 2.6e-07
 [COUNTERS] PROGRAM TOTAL          :    1.1288s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.8076s
 [COUNTERS] Fortran MEs      ( 1 ) :    0.2800s for    16416 events => throughput is 5.86E+04 events/s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0412s for    16416 events => throughput is 3.98E+05 events/s

*** EXECUTE GMADEVENT_CUDACPP (create events.lhe) ***
--------------------
2048 ! Number of events in a single CUDA iteration (nb_page_loop)
16384 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
0 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number for single-diagram enhancement multi-channel (IGNORED as suppress amplitude is 0?)
--------------------
Executing ' ./gmadevent_cudacpp < /tmp/avalassi/tmp.Zh3HNp9Or1_cuda > /tmp/avalassi/tmp.auMJh5vETN'
STOP 5
ERROR! ' ./gmadevent_cudacpp < /tmp/avalassi/tmp.Zh3HNp9Or1_cuda > /tmp/avalassi/tmp.auMJh5vETN' failed
  Particle       3       4
      Et >     0.0     0.0
       E >     0.0     0.0
     Eta <    -1.0    -1.0
   xqcut:      0.0     0.0
d R # 3  >    -0.0     0.0
s min # 3>     0.0     0.0
xqcutij # 3>     0.0     0.0
 form factor with fixed_couplings not supported anymore
  please update model to use lorentz structure

Question, is there some hardcoded cutoff around 1000 events esewhere in the code?

valassi commented 2 years ago

Question, is there some hardcoded cutoff around 1000 events esewhere in the code?

Yes definitely there was. See #458, in some .inc files there was a number hardcoded, in some .inc there was some reference to NB_PAGE (now NB_PAGE_MAX for me)

Now this is under control,

There is NB_PAGE_MAX in vector.inc and this is used everywhere to set maximum array sizes
There is NB_PAGE_LOOP an dthis can be set at runtime dependeing on user inputs

Note that I only managed to go up to NB_PAGE_MAX around 32768. For 64k an dabove I get segfaults ( I guess out of memory).

I am keeping this open to reference other issues, and I want to add some performance numbers

valassi commented 2 years ago

I renamed this to indicate that this was about the very first step, going beyond 16 or 32 events per grid, up to say 16k. This is done in PR #454 that I am about to merge. I will therefore close this #455.

I have instead opened #460 that is a follow up to this one, to go beyond 16k events per grid.

Note that both #455 and #460 are related to Olivier's https://github.com/oliviermattelaer/mg5amc_test/issues/9

oliviermattelaer commented 2 years ago

Stupid question here, how the code can be vectorised correctly if "NB_PAGE_LOOP" is not know at compile time? Or is it working only if NB_PAGE_LOOP is a multiple of the vectorization size?

valassi commented 2 years ago

Hi Olivier, there are no stupid questions ;-)

At compile time you know NB_PAGE_MAX. This allocates the memory.

At runtime you know NB_PAGE_LOOP, which must be less than or equal NB_PAGE_MAX. You can set the max to 100, and the loop to 10, it means you use only 10 events in the array (you do several loops always on the same 10), while the other 90 are wasted unused space.

I know it's ugly, but I wanted the flexibility of changing NB_PAGE_LOOP at runtime without a rebuild every time. (Ok the fortran rebuilds should not be too slow - the C++/CUDA MEs are much slower - but still).

In the C++ standalone we have a single parameter (as if NB_PAGELOOP=NB_PAGE_MAX) because you do a malloc with the right size. Maybe in Fortran one could do something similar, but it would become much more complex...

oliviermattelaer commented 2 years ago

My question is more what happens if the vector size of the cpu is 4 but that you do not pass a multiple of 4? The compiler will need an extra-loop to tackle the reminder which is in principle not needed if you have the correct multiple hardcoded within the code.

At runtime you know NB_PAGE_LOOP, which must be less than or equal NB_PAGE_MAX. You can set the max to 100, and the loop to 10, it means you use only 10 events in the array (you do several loops always on the same 10), while the other 90 are wasted unused space.

Ok we will need to discuss if we really want to do that (since the point is to be able to bypass fortran anyway) I understand that it is usefull for test, but maybe we can keep it for test only...

I actually started to split nb_page in two ... but in a fully different way I was planning to replace nb_page by block_size wrap_size such that I ensure lockstep on the wrap_size even if memory of the vectorization is set on block_size (which is a step in order to prepare for GPU)

On 29 Jul 2022, at 17:26, Andrea Valassi @.**@.>> wrote:

Hi Olivier, there are no stupid questions ;-)

At compile time you know NB_PAGE_MAX. This allocates the memory.

At runtime you know NB_PAGE_LOOP, which must be less than or equal NB_PAGE_MAX. You can set the max to 100, and the loop to 10, it means you use only 10 events in the array (you do several loops always on the same 10), while the other 90 are wasted unused space.

I know it's ugly, but I wanted the flexibility of changing NB_PAGE_LOOP at runtime without a rebuild every time. (Ok the fortran rebuilds should not be too slow - the C++/CUDA MEs are much slower - but still).

In the C++ standalone we have a single parameter (as if NB_PAGELOOP=NB_PAGE_MAX) because you do a malloc with the right size. Maybe in Fortran one could do something similar, but it would become much more complex...

— Reply to this email directly, view it on GitHubhttps://github.com/madgraph5/madgraph4gpu/issues/455#issuecomment-1199531035, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6535R2XMHXA3ETP3FBRWLVWPZZ7ANCNFSM5WFB2WKA. You are receiving this because you were mentioned.Message ID: @.***>

valassi commented 2 years ago

My question is more what happens if the vector size of the cpu is 4 but that you do not pass a multiple of 4? The compiler will need an extra-loop to tackle the reminder which is in principle not needed if you have the correct multiple hardcoded within the code.

Ah ok maybe I see what you mean, but then this is handled elsewhere. In the "pure Fortran/Madevent" handling of NB_PAGE_LOOP and NB_PAGE_MAX there is no check like you describe, I think I only check NB_PAGE_LOOP<=NB_PAGE_MAX.

Conversely, the place where indeed some checks are done that NB_PAGE_LOOP is a sensible number (multiple of two etc) is in the Bridge and other classes in C++. For instance here you check that nevt (NB_PAGE_LOOP) is at least 32 on GPUs and is a multiple of 32 https://github.com/madgraph5/madgraph4gpu/blob/05363606836c1ca7fe545d214fd537724e22cbe8/epochX/cudacpp/gg_tt.mad/SubProcesses/Bridge.h#L192 And here you check that it is a multiple of neppM that is used for vectorization (hm maybe it should be neppV but I think they are the same now) https://github.com/madgraph5/madgraph4gpu/blob/05363606836c1ca7fe545d214fd537724e22cbe8/epochX/cudacpp/gg_tt.mad/SubProcesses/MatrixElementKernels.cc#L37

Ok we will need to discuss if we really want to do that (since the point is to be able to bypass fortran anyway). I understand that it is usefull for test, but maybe we can keep it for test only...

Sure I agree maybe not something for production to have NB_PAGE_LOOP different from NB_PAGE_MAX. I still think that it may be better to keep two separate variables with different meanings (max for the array allocation, loop for how many are actually used), and then have the top level driver impose that they are the same (or not, in test mode). Do you think that's a problem?

I actually started to split nb_page in two ... but in a fully different way. I was planning to replace nb_page by block_size wrap_size such that I ensure lockstep on the wrap_size even if memory of the vectorization is set on block_size (which is a step in order to prepare for GPU)

I am not sure I would go that way. For the moment, as explained above, Fortran (MadEvent) knows nothing about vectorization of GPU, and does not need to know anything. It is just the C++ classes that in any case are already full of cross checks and/or assumptions here and there.

Maybe the thing that goes most in the direction of what you mention is the Bridge class. Here for instance I hardcoded 256 threads per block on the GPU, and all the rest comes from there. https://github.com/madgraph5/madgraph4gpu/blob/05363606836c1ca7fe545d214fd537724e22cbe8/epochX/cudacpp/gg_tt.mad/SubProcesses/Bridge.h#L174 The only thing I woulkd change is that maybe we could make blocks and threads user configurable, but Fortran/Madevent would only need to handle the reading of command line input. What I mean is that there is no need for different algorithms in Fortran/Madevent dependeing on gpublocks. And maybe eventually we can even do it further (I think we had discussed this), eg reading a configuration file for number of threads/blocks on the GPU, and maybe even number of threads on the CPU, and all this reading can be done in C++.

In MadEvent, you really only need to know how many events you are going to pass at the same time to the Bridge, then the Bridge (in C++/CUDA) takes care of everything else. Would you agree or would you do make this different?

Thanks! Andrea

valassi commented 2 years ago

I reopen it, just so that we can use this to discuss.

oliviermattelaer commented 2 years ago

In MadEvent, you really only need to know how many events you are going to pass at the same time to the Bridge, then the Bridge (in C++/CUDA) takes care of everything else. Would you agree or would you do make this different?

I need to know the wrap size (i.e. the number of consecutive event who need to be executed without thread divergence). Without that information, the matrix-element and the multi-channel factor might/will differ between two consecutive event. Which will be problematic.

Now that's true that I do not need to know the block size but rather the number of events that I have to keep in memory (i.e. block_size*nb_block)

valassi commented 2 years ago

I need to know the wrap size (i.e. the number of consecutive event who need to be executed without thread divergence). Without that information, the matrix-element and the multi-channel factor might/will differ between two consecutive event. Which will be problematic.

Now that's true that I do not need to know the block size but rather the number of events that I have to keep in memory (i.e. block_size*nb_block)

I am a bit confused, is this not NB_PAGE_LOOP already? I mean, in the "worst" case (the one with the largest number of events in parallel), you are on the GPU, and you may have NB_PAGE_LOOP between 8k and 524k approximately. For all these events, from MadEvent you offload them all at once and you get them back all at once into MadEvent. What I mean is, whether the GPU internally uses several blocks or threads (eg whether 8k is 8 times 1024 or 16 times 512) is not even visible from MadEvent, it should not have any impact on what you do in MadEvent,

madgraph5 / madgraph4gpu

In Fortran MadEvent, use larger arrays (beyond 16 events, up to 16k) to allow efficient usage of GPUs #455