master_june24: why is WARP_SIZE the same as VECSIZE_MEMMAX? (and document WARP_SIZE in vector.inc)

Hi @oliviermattelaer and @roiser,

in debugging #885 (in WIP PR #882) I realised that the code I generate out of the box has WARP_SIZE equal to VECSIZE_MEMMAX, with NB_WARP=1 hardcoded. In particular WARP_SIZE and VECSIZE_MEMMAX seem to be both controlled by vector_size in the runcards.

I am very surprised by this. I thought that on a GPU one would for instance use VECSIZE_MEMMAX=16384, while keeping WARP_SIZE=32. Actually, I thought that WARP_SIZE=32 would need to be hardcoded (this is the typical spec on an Nvidia GPU).

As for NB_WARP, I do not understand what this means.

Can you please explain which values WARP_SIZE and NB_WRAP should have, and how this functionality can be tested?

(On top of this, note that the actually used VECSIZE_USED can be lower than VECSIZE_MEMMAX. The crash in #885 comes from the fact that this does not seem to be handled corrrectly now).

Thanks Andrea

PS This is related and in large overlap to #765. But it is a question specifically about what exists now in master_june24. I would like to understand how this is supposed to work and be tested.

Some specific questions by looking at the code.

I see this in the runcard

   16 = vector_size ! size of fortran arrays allocated in the multi-event API for SIMD/GPU (VECSIZE_MEMMAX)
   1 = nb_warp ! total number of warp/frontwave

And I see this in banner.py

        self.add_param('vector_size', 1, include='vector.inc', hidden=True, comment='lockstep size for parralelism run', 
                       fortran_name='WARP_SIZE', fct_mod=(self.reset_simd,(),{}))
        self.add_param('nb_warp', 1, include='vector.inc', hidden=True, comment='number of warp for parralelism run', 
                       fortran_name='NB_WARP', fct_mod=(self.reset_simd,(),{}))
        self.add_param('vecsize_memmax', 0, include='vector.inc', system=True)
...
        self['vecsize_memmax'] = self['nb_warp'] * self['vector_size']

I think that the comments in the runcards are clearly wrong: vector_size says it is associated to VECSIZE_MEMMAX, but it is not!!

I would suggest to

rename vector_size as warp_size in the runcard (and set the default to 32)
clearly specify in the runcard that VECSIZE_MEMMAX is nb_warp times warp_size

Hi Andrea,

For sure, we should update the template of the run_card (while the help message from banner.py is actually correct)

Now from the naming, this is something that we already discussed (I think) since I believe that it was originally set to wrap_size and that you or Stefan asked me to change it... But anyway we can have it in any case.

Unfortunately, I do not have a perfect naming scheme that covers nicely (OpenMP, SIMD and GPU). But yes, I'm also in favour to change it to wrap_size (which is the name at fortran level)

Now from the naming, this is something that we already discussed (I think) since I believe that it was originally set to wrap_size and that you or Stefan asked me to change it... But anyway we can have it in any case.

Hi @oliviermattelaer thanks a lot for the feedback.

(I do not remember all discussions on this in the past, apologies if I made you take a wrong direction).

But yes, I'm also in favour to change it to wrap_size (which is the name at fortran level)

Yes then but please WARP not WRAP :-)

Can you please explain which values WARP_SIZE and NB_WARP should have, and how this functionality can be tested?

(This was my question above)

Can you confirm that we are supposed to have NB_WARP>1 for any of this work to make sense?

For instance is this a reasonable choice, NB_WARP=512 with WARP_SIZE=32 ie VECSIZE_MEMMAX=16384 ? (@roiser said in https://github.com/madgraph5/madgraph4gpu/issues/888#issuecomment-2210714590 that he used 256 and 32 which is similar, so I guess that should be ok)

And can you confirm how this is supposed to work normally (forget USED for now), i.e. why do you need warps here? Do you have warps ONLY so that all events in one warp have the same channelid?. Or do you need the concept of warps for something else?

In this example, is this what is supposed to happen?

you allocate space for 16384 events
in fortran you prepare them so that 512 "warps" of 32 events each are such that the channelid in each warp is the same, but different warps may have different channelids?
but then you still send all 16384 events to the GPU at the same time, and assume that the GPU (or the SIMD implementation) will optimally process events in separate warps?

(For SIMD I think we ahd agreed there would be a check that all channels in the same SIMD vector would be the same? Is this implemented?)

I guess CUDACPP_RUNTIME_VECSIZEUSED is not well propagated/handled somewhere?

(This is from https://github.com/madgraph5/madgraph4gpu/issues/885#issuecomment-2209381868)

Let's discuss the case VECSIZE_USED now.

I would imagine the following, can you confirm?

VECSIZE_USED should be a multiple of WARP_SIZE (else assert should fail)
you should compute NB_WARP_USED = VECSIZE_USED (from environment variable) / WARP_SIZE
while you have NB_WARP potentially storable, you only use fewer

Is the above correct?

And is the above implemented? If it is not, this explains the crash #885. And it is a blocker as far as I am concerned.

Speaking of which, @roiser @oliviermattelaer, how did you test the code in master_june24?

am I supposed to use a different input.txt file to pipe to madevent to specify a range of iconfig's, or will the current one with a single iconfig value be enough?

if I am supposed to use the same input.txt with a single iconfig (by looking at driver.f which has not changed I would guess this is the case), can you confirm that the code will still test the new functionality you have created and have a channelid array with different values, or will this result in a channelid array which all have the same value?

(@oliviermattelaer for my information, not directly or immediately relevant for tests: is the madevent fortran/python/bash infrastructure to orchestrate fewer G* jobs with many channels per job complete, or is this still under development?)

(and also for my information if I should have issues in the code: do I remember correctly that a channelID array eg of 32k channels will be segmented such that inside each 32-channel warp the channelid is the same, but different warps can have different channelids? or did you eventually modify the logic of this?)

(This instead is a related question from https://github.com/madgraph5/madgraph4gpu/pull/882#issue-2390983670)

Can you answer the points above? In particular, if I understand correctly that for instance the first warp has ICONFIG=1 but the second warp has ICONFIG=2, how can I test this? Is this implemented yet or not?

Thanks Andrea

madgraph5 / madgraph4gpu

master_june24: why is WARP_SIZE the same as VECSIZE_MEMMAX? (and document WARP_SIZE in vector.inc) #887