madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
28 stars 33 forks source link

master_june24: why is WARP_SIZE the same as VECSIZE_MEMMAX? (and document WARP_SIZE in vector.inc) #887

Open valassi opened 5 days ago

valassi commented 5 days ago

Hi @oliviermattelaer and @roiser,

in debugging #885 (in WIP PR #882) I realised that the code I generate out of the box has WARP_SIZE equal to VECSIZE_MEMMAX, with NB_WARP=1 hardcoded. In particular WARP_SIZE and VECSIZE_MEMMAX seem to be both controlled by vector_size in the runcards.

I am very surprised by this. I thought that on a GPU one would for instance use VECSIZE_MEMMAX=16384, while keeping WARP_SIZE=32. Actually, I thought that WARP_SIZE=32 would need to be hardcoded (this is the typical spec on an Nvidia GPU).

As for NB_WARP, I do not understand what this means.

Can you please explain which values WARP_SIZE and NB_WRAP should have, and how this functionality can be tested?

(On top of this, note that the actually used VECSIZE_USED can be lower than VECSIZE_MEMMAX. The crash in #885 comes from the fact that this does not seem to be handled corrrectly now).

Thanks Andrea

PS This is related and in large overlap to #765. But it is a question specifically about what exists now in master_june24. I would like to understand how this is supposed to work and be tested.

valassi commented 5 days ago

Some specific questions by looking at the code.

I see this in the runcard

   16 = vector_size ! size of fortran arrays allocated in the multi-event API for SIMD/GPU (VECSIZE_MEMMAX)
   1 = nb_warp ! total number of warp/frontwave

And I see this in banner.py

        self.add_param('vector_size', 1, include='vector.inc', hidden=True, comment='lockstep size for parralelism run', 
                       fortran_name='WARP_SIZE', fct_mod=(self.reset_simd,(),{}))
        self.add_param('nb_warp', 1, include='vector.inc', hidden=True, comment='number of warp for parralelism run', 
                       fortran_name='NB_WARP', fct_mod=(self.reset_simd,(),{}))
        self.add_param('vecsize_memmax', 0, include='vector.inc', system=True)
...
        self['vecsize_memmax'] = self['nb_warp'] * self['vector_size']       

I think that the comments in the runcards are clearly wrong: vector_size says it is associated to VECSIZE_MEMMAX, but it is not!!

I would suggest to

oliviermattelaer commented 5 days ago

Hi Andrea,

For sure, we should update the template of the run_card (while the help message from banner.py is actually correct)

Now from the naming, this is something that we already discussed (I think) since I believe that it was originally set to wrap_size and that you or Stefan asked me to change it... But anyway we can have it in any case.

Unfortunately, I do not have a perfect naming scheme that covers nicely (OpenMP, SIMD and GPU). But yes, I'm also in favour to change it to wrap_size (which is the name at fortran level)

valassi commented 4 days ago

Now from the naming, this is something that we already discussed (I think) since I believe that it was originally set to wrap_size and that you or Stefan asked me to change it... But anyway we can have it in any case.

Hi @oliviermattelaer thanks a lot for the feedback.

(I do not remember all discussions on this in the past, apologies if I made you take a wrong direction).

But yes, I'm also in favour to change it to wrap_size (which is the name at fortran level)

Yes then but please WARP not WRAP :-)

Can you please explain which values WARP_SIZE and NB_WARP should have, and how this functionality can be tested?

(This was my question above)

Can you confirm that we are supposed to have NB_WARP>1 for any of this work to make sense?

For instance is this a reasonable choice, NB_WARP=512 with WARP_SIZE=32 ie VECSIZE_MEMMAX=16384 ? (@roiser said in https://github.com/madgraph5/madgraph4gpu/issues/888#issuecomment-2210714590 that he used 256 and 32 which is similar, so I guess that should be ok)

And can you confirm how this is supposed to work normally (forget USED for now), i.e. why do you need warps here? Do you have warps ONLY so that all events in one warp have the same channelid?. Or do you need the concept of warps for something else?

In this example, is this what is supposed to happen?

(For SIMD I think we ahd agreed there would be a check that all channels in the same SIMD vector would be the same? Is this implemented?)

I guess CUDACPP_RUNTIME_VECSIZEUSED is not well propagated/handled somewhere?

(This is from https://github.com/madgraph5/madgraph4gpu/issues/885#issuecomment-2209381868)

Let's discuss the case VECSIZE_USED now.

I would imagine the following, can you confirm?

Is the above correct?

And is the above implemented? If it is not, this explains the crash #885. And it is a blocker as far as I am concerned.

Speaking of which, @roiser @oliviermattelaer, how did you test the code in master_june24?

am I supposed to use a different input.txt file to pipe to madevent to specify a range of iconfig's, or will the current one with a single iconfig value be enough?

if I am supposed to use the same input.txt with a single iconfig (by looking at driver.f which has not changed I would guess this is the case), can you confirm that the code will still test the new functionality you have created and have a channelid array with different values, or will this result in a channelid array which all have the same value?

(@oliviermattelaer for my information, not directly or immediately relevant for tests: is the madevent fortran/python/bash infrastructure to orchestrate fewer G* jobs with many channels per job complete, or is this still under development?)

(and also for my information if I should have issues in the code: do I remember correctly that a channelID array eg of 32k channels will be segmented such that inside each 32-channel warp the channelid is the same, but different warps can have different channelids? or did you eventually modify the logic of this?)

(This instead is a related question from https://github.com/madgraph5/madgraph4gpu/pull/882#issue-2390983670)

Can you answer the points above? In particular, if I understand correctly that for instance the first warp has ICONFIG=1 but the second warp has ICONFIG=2, how can I test this? Is this implemented yet or not?

Thanks Andrea