status of the implementation

[x] update python interface to be in control of the new parameter
[x] update the vector.inc to have the new parameter in fortran
- [x] VALIDATION SIMD mode should work as before (but parameter nb_wrap is ignored
[x] update dsample to group by wrap the symmetry selection
[x] update the selection_grouping to be applied on the latest subset
- symmetry parameter still integer at this stage
- [x] VALIDATION SIMD mode should work as before for one wrap
- [x] VALIDATION SIMD mode should work as before for multiple wrap but selection is done only based on last wrap
[x] update iproc.imirror,icong to a vector
- imirror_vec, iconfig_vec (NB_WRAP as size)
- iproc stay scalar
- pass only one value within the bridge
- [x] VALIDATION SIMD mode should work as before for one wrap
- [x] VALIDATION SIMD mode should work correctly for fortran mode
- [ ] VALIDATION SIMD mode should work correctly for CPP/CUDA mode (but fixing one of the config)
  - [x] update IB handling (do not move it to a vector since IMIRROR is already a vectory)
  - [ ] update cpp/bridge to handle multiple value for the channel information (choice: channel information passed to the bridge PER EVENT (not by wrap, so we can choose the wrap size freely)
  - [ ] VALIDATION SIMD mode should work correctly for CPP/CUDA mode (but fixing one of the config)
  - [ ] check the possibility to define a nb_wrap_used

Other Validation

[x] check that nb_wrap can be updated via the run_card
[x] check that nb_wrap can be updated at output stage

point to check so far

[ ] ~change (python signature for the function that create the PDF line)~
[x] vector.inc will have issue with the line
[x] parameter vexsize_memmac need to be added to the run_card has system parameter to allow the vector.inc to be written correctly
[ ] stays with a single index:
- all_G (use vecsize_memmax)
- couplings
- update_as_param (only use vecid)
- update_as_param2 (only use vecid) Some fortran action that needs to be done
[x] select_grouping sginature is now set as imirror, iproc, iconf, all_wgt, WRAP_SIZE, iwrap)
[x] prepare_grouping_choice signature is kept intact
update to a vector (size can be either full or nb_wrap --better but more complex?--)
- [ ] imirror,
- [ ] iproc (technically not needed since force to one)
- [ ] iconf

@roiser: I have tested a scan on nb_wrap with the current fortran implementation: (here wrap_size=32)

grid size	nb_wrap	cross-section (p p > e+ e-)	timing survey	timing refine
32	1	844.9 +- 1.978 pb	1s	12s
320	10	842.8 +- 1.973 pb	2s	10s
3200	100	844.6 +- 1.055	4s	34s
8000	250	843.2 +- 1.028 pb	5s	47s
16000	500	902.9 +- 31.41 pb	7s	69s
32,000	1000	841.9 +- 1.274	16s	0s
320,000	10,000	776.9 +- 52.13	69s	0s
3,200,000	100,000	not enough memory

So this worries me a bit, looks like at some point we have an issue like overfitting (so maybe this is also related to the fact that I pick a simple process). In principle this has nothing to do with the wrap_size but rather with the grid size that automatically increase the number of events by iteration (for the first few iterations).

Did you try something like that on GPU with the master branch? So I would suggest to avoid grid size > 10k for the moment.

Hi @oliviermattelaer here are the numbers for the GPU (A100). I have redone your numbers in the first table and then a table with more optimized ones IIUC, with vector sizes modulo 108 (which are the number of SMs on the A100).

The seconds to my mind are not conclusive but I also counted the number of refine jobs that are run which go to 0 from a certain stage on. Could this be the same indication of overfitting?

vector size	cross-section (p p > e+ e-)	timing survey	timing refine	# refine jobs
32	846.1 +- 2.143 pb	3s	45s	14
320	842.6 +- 1.857 pb	3s	18s	13
3200	845.1 +- 1.193 pb	7s	27s	11
8000	843.7 +- 1.13 pb	8s	35s	11
16000	845 +- 0.9486 pb	13s	53s	11
32000	843.1 +- 0.9455 pb	23s	53s	6
40000	842.9 +- 1.103 pb	29s	29s	0
320,000	not enough memory

vector size	cross-section (p p > e+ e-)	timing survey	timing refine	# refine jobs
3456	844.7 +- 1.226 pb	7s	27s	11
6912	843.8 +- 1.09 pb	9s	33s	11
13824	843.2 +- 1.252 pb	13s	28s	6
27648	842 +- 1.044 pb	20s	47s	6
31104	843.6 +- 1.032 pb	22s	51s	6
34560	843.5 +- 1.243 pb	24s	24s	0
38016	842.7 +- 1.123 pb	26s	26s	0
41472	841.9 +- 1.077 pb	29s	29s	0
55296	not enough memory

Those are much better number that the ones I got. This is likely a good news meaning that the issue is likely related to my wrap implementation and not to a real over-fitting. So I will need to investigate this to understand and ( more importantly) fix it. Thanks for the check

I would not give too much importance to cpu seconds though, I guess what we are mostly measuring there is the fortran part.

I agree for the timing, the critical point here is the fact that you have a stable cross-section which I do not have at all.

PS: I have pushed the change for having the list of channel (with name channels --with a s--). Remember that you need to update both the plugin (new_interface_wrap) and the MG5aMC interface (gpucpp_wrap). Thanks

Hi @oliviermattelaer @roiser thanks for all your work on this. Just a couple of points from a discussion with Stefan, since we were trying to remember what he had agreed in the past and I have been unable to find any notes about that.

First Olivier a quick point: eventually can you please change all "wrap" into "warp" in the code? Even today I got confused again what wrap is! ;-)

Then about the main issue: my main question is what cudacpp should do internally.

My understanding is that it is Fortran that know the warp size (I would actually hardcode this to 32 also for CPU?), ie how the channelIds need to be segmented. For instance, in the channelid array you may have 32 4's, then 32 5's, then 32 6's etc. Then cudacpp receives this array.
I am quite sure I had made the point, and we had all agreed, that there should be no hidden assumption. I mean, cudacpp should NOT assume that there are 32 x's, then 32 y's, then 32 z's. (I had also said I think that if we do some assumption we should check it with an assert, but I am not sure this is what we want).
I think at some point I had proposed (in a discussion with @hageboeck too?) that we can simply change this code
```
  if( channelId == 1 ) numerators_sv += cxabs2( amp_sv[0] );
  if( channelId != 0 ) denominators_sv += cxabs2( amp_sv[0] );
```
into something like
```
  channelids_sv = CHANNEL_ACCESS::kernelAccess( pchannelIds ); // the 4 channels in the SIMD vector
  bool_sv mask_sv = ( channelids_sv == 1 );
  numerators_sv += mask_sv * cxabs2( amp_sv[0] );
  if( pchannelIds != nullptr ) denominators_sv += cxabs2( amp_sv[0] );
```
In other words, cudacpp does not do any assumpton at all. In practice, thanks to how the forstran has created those arrays, then the masks will always be all 0 or al 1 in cpp, but this is not assumed. Similarly, in cuda (where _sv is a scalar in a kernel), all threads in a warp will end up having exactly the same mask value and they will all be in lockstep: but again, this depends on what fortran has made of the array, and there is no assumption in the code that the threads will be in lockstep.
Giving the above, there is no need to pass warp_size from fortran to cudacpp... and I think we ahd agreed on this, right?

Thanks... just trying to refresh my memory. Andrea

Hi,

   First Olivier a quick point: eventually can you please change all "wrap" into "warp" in the code? Even today I got confused again what wrap is! ;-)

consider it as done (just need to do a git push)

      if( channelId == 1 ) numerators_sv += cxabs2( amp_sv[0] );
      if( channelId != 0 ) denominators_sv += cxabs2( amp_sv[0] );
into something like

      channelids_sv = CHANNEL_ACCESS::kernelAccess( pchannelIds ); // the 4 channels in the SIMD vector
      bool_sv mask_sv = ( channelids_sv == 1 );
      numerators_sv += mask_sv * cxabs2( amp_sv[0] );
      if( pchannelIds != nullptr ) denominators_sv += cxabs2( amp_sv[0] );

Please put comment around those lines since you pass from an easy to read code to a code that I simply do not understand (which I'm fine with obviously but I would be nice to keep readibility whenever you have those cryptic accessor.

Next question is about performance are we not shooting ourself with such complicated accessor? But in any case, yes that was the point.

  *   Giving the above, there is no need to pass warp_size from fortran to cudacpp... and I think we ahd agreed on this, right?

This is indeed our current strategy/agreement indeed.

But clearly it has some drawback that I quote here for reference:

You do transfer many useless data from cpu to gpu (since each event has it's own information rather than each warp)
You could have set those values within the shared_memory for GPU (reducing register pressure) --which would have been a natural fit--
1. You need to implement this masking procedure (where we hope that we will never need it in practise)

While the advantages are 1) easier interface from fortran to cpp 2) possibility to compile for multiple hardware and be bullet proof for future hardware --if the warp_size is too large to ensure that MC is not bias--. 3) possibility to not trust that fortran

Oh, I may have completely misunderstood you then @oliviermattelaer . Are you saying you are passing in a vector of size [n events / warp size] instead of [n events]? I remember we discussed that but I thought we had agreed on n events. Either is of course fine with me.

I have for now developed everything in the branch to pass the vector in until calculdate_wavefunctions, now I need to handle the individual array entries (so e.g. the code that we discussed in a previous meeting with masking and that @valassi has spelled out above).

Hi @roiser,

Are you saying you are passing in a vector of size [n events / warp size] instead of [n events]?

I do pass a vector of size [n events]. So this is has you did remember/agreed sorry if that was not clear.

No I do pass in a vector of n events (NOT n events / warp size), ok then I may have misunderstood you from the comment above. All good thanks, sorry my misunderstanding, reading it again its all clear

No I do pass in a vector of n events (NOT n events / warp size), ok then I may have misunderstood you from the comment above. All good thanks, sorry my misunderstanding, reading it again its all clear

Thanks Olivier and Stefan!

Stefan, I think Olivier means that an alternative to passing n events would be to pass n_warp (n_events/warp_size) events. But this is not what we had agreed (we agreed to pass n events).

Olivier, I see your point, indeed we might pass less memory around if we passed n_warp channels only. The disadvantage however in my opinion is that the computation becomes more complex in cudacpp as we need to introduce that concept of warp. Yes it's debatable, maybe the performance gain would be large. Anyway, I suggest we stick to our original plan, we implement the functionality (ie the possibility to have difrent channels in different events) and then we reconsider later on?

consider it as done (just need to do a git push)

Thanks! (for wrap->warp)

Please put comment around those lines since you pass from an easy to read code to a code that I simply do not understand (which I'm fine with obviously but I would be nice to keep readibility whenever you have those cryptic accessor.

Yes good point, TODO Andrea put some comments.

Next question is about performance are we not shooting ourself with such complicated accessor?

Listen, maybe yes. Again, up for review later on... but as this is done elsewhere, I would keep the same style ebverywhere, an dthen change everything if/when we decide to change it. Personally I think that the readability is ~ok (yes I should add comments). What I find less nice is that the use of templates all over the place ends up exploding build times I fear.

My suggestion on this is: if we model the accessors differently in the future, let's follow the cpp parallel model. From what I understood of the recent seminar, the model is a bit similar, there is mdspan which takes non-owning a pointer and then some metadata for defining an accessor that depends on the chosen layout. See https://en.cppreference.com/w/cpp/container/mdspan. In principle this should eventually work for GPUs too, Anyway, also about this one lets rediscuss later ok?

My suggestion on this is: if we model the accessors differently in the future, let's follow the cpp parallel model. From what I understood of the recent seminar, the model is a bit similar, there is mdspan which takes non-owning a pointer and then some metadata for defining an accessor that depends on the chosen layout. See https://en.cppreference.com/w/cpp/container/mdspan. In principle this should eventually work for GPUs too, Anyway, also about this one lets rediscuss later ok?

PS As the speaker had mentioned, this has close links to kokkos or is even modeled from there. Just for reference this is a tutorial I found from kokkos about mdspan https://github.com/kokkos/mdspan/wiki/A-Gentle-Introduction-to-mdspan

I think at some point I had proposed (in a discussion with @hageboeck too?) that we can simply change this code

      if( channelId == 1 ) numerators_sv += cxabs2( amp_sv[0] );
      if( channelId != 0 ) denominators_sv += cxabs2( amp_sv[0] );

into something like

      channelids_sv = CHANNEL_ACCESS::kernelAccess( pchannelIds ); // the 4 channels in the SIMD vector
      bool_sv mask_sv = ( channelids_sv == 1 );
      numerators_sv += mask_sv * cxabs2( amp_sv[0] );
      if( pchannelIds != nullptr ) denominators_sv += cxabs2( amp_sv[0] );

Hi @roiser @oliviermattelaer , concerning the discussion we had yesterday at the development meeting. I checked that indeed

numerators_sv += mask_sv * cxabs2( amp_sv[0] );

would fail compilation with something like invalid operands to binary * ('__vector(4) long int' and '__vector(4) double').

However there is an extremely simple mechanism that I had already implemented in the code for the ixxxxx SIMD implementation (or the reimplementation after FPE fixes), based on fpternary. It is enough to do something like

constexpr fptype_sv fpZERO_sv{}; // 0000
//numerators_sv += mask_sv * cxabs2( amp_sv[0] );
numerators_sv += fpternary( mask_sv, cxabs2( amp_sv[0] ), fpZERO_sv );

This function fpternary (there is also a cxternary for complex numbers) is simply a vector version of the ternary operator ( bool ? if_true : if_false ). The first argument is a vector mask (or scalar flag), the second and third arguments are fptype_sv (vector or scalar of FPs). If you use this with the third argument a vector of zeros, then you are effectively doing what I meant initially by mask_sv * cxabs2( amp_sv[0] ). Actually this is even much more correct logically, it is a check on bool and not a multiplication by bool, even in the scalar case this is really a ternary operator, not a multiplication.

I added a simple test to show this more in detail in testmisc.cc here https://github.com/madgraph5/madgraph4gpu/pull/796/commits/0d2a5081013363b89c9f94edbe152a0171030c3b

Hi @valassi , what I tried to do was to go through the whole chain only with SIMD operations. If I understand your solution correctly then this would loop through the SIMD entries and do operations on each of the entries one by one. Yes that's a possibility but my preference would be to stay in SIMD and from an operation point of view we also need the channelIDs only for the GPU where the number of parallel operations is huge compared to CPU.

Hi @valassi , what I tried to do was to go through the whole chain only with SIMD operations. If I understand your solution correctly then this would loop through the SIMD entries and do operations on each of the entries one by one. Yes that's a possibility but my preference would be to stay in SIMD and from an operation point of view we also need the channelIDs only for the GPU where the number of parallel operations is huge compared to CPU.

Stefan, we have those fpternary and cxternary all over the place. I spent weeks, months, to get the SIMD implementation working in all its details. In particular fixing the FPEs (#701 #733 #736 etc) for SIMD. And the SIMD gymnastics for the mixed floating point mode may contain even "uglier" stuff. But it WORKS, meaning it is FAST and achieves performance where it is needed.

My suggestion is: let's get the LO release out, COMPLETE with all the SIMD stuff that is there now, including all the for( int i = 0; i < neppV; i++ ) it takes (which IMHO are presently mostly harmless, otherwise we would never be seeing factors x16 speedups). Then after the LO release you do whatever you want with this code, and if you really think you want to start using intrinsics, you do it, good luck.

All the SIMD stuff we have so far uses a good pragmatic combination of SIMD compiler vector extensions to get the speed, and of explicit loops in a few places to get the functionality, at a very limited performance cost. This is the design I enforced, and it works. If you want to completely change the design, please do it only after the first LO release, without me. Thanks.

from an operation point of view we also need the channelIDs only for the GPU where the number of parallel operations is huge compared to CPU.

PS My point here is: it would take one line to implement this in SIMD, and it would work out of the box. Therefore I see no reason at all not to implement it also for SIMD. Saying that "this is only really needed for GPU" might maybe be an argument if this was overly complicated and required weeks of work (and even then I might object), but IMO this is a completely irrelevant argument to argue against something that requires one minute of work.

Well I would say that none of the approach are optimal.

Using non SIMD instruction when not really needed and that a work around does exists is not optimal as the opposite.

To me this is Stefan choice at this point.

@oliviermattelaer as you want.

Using non SIMD instruction

Please keep this into account: ALL of the code we call "SIMD" in madgraph, ALL of the x16 speedups we get, pass through functions which here and there use explicit loops over vectors, for instance ixxxx. This is needed here and there to ensure the functionality. Saying "we will only do SIMD if all the code has only SIMD instructions" is wishful thinking.

IF I had said "I will not implement ixxxx for SIMD because internally it uses potentially instructions that are not vectorized", THEN we would not have a fully functioning madgraph with x16 speedups, because peices of the code would be missing. Period.

Therefore, PLEASE, lets just get this LO release out of the way with the existing approach and use this everywhere. Then for NLO I will just disappear. But let's just get LO out of the way and lets have it fully "SIMD" even if this has internal loops. PLEASE,

PS See this

[avalassi@itscrd90 gcc12.1/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp> grep neppV gg_tt.mad -r
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:                           fptype_sv* jamp2_sv            // output: jamp2[nParity][ncolor][neppV] for color choice (nullptr if disabled)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    //MemoryBufferWavefunctions w_buffer[nwf]{ neppV };
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt0 = ievt00 + iParity * neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    for( int ipagV = 0; ipagV < nevt / neppV; ++ipagV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt0 = ipagV * neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    // Allocate arrays at build time to contain at least 16 events (or at least neppV events if neppV>16, e.g. in future VPUs)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    constexpr int maxtry0 = std::max( 16, neppV ); // 16, but at least neppV (otherwise the npagV loop does not even start)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    // Loop over only nevt events if nevt is < 16 (note that nevt is always >= neppV)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    assert( nevt >= neppV );
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV = maxtry / neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    assert( npagV % 2 == 0 );     // SANITY CHECK for mixed fptypes: two neppV-pages are merged to one 2*neppV-page
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV2 = npagV / 2; // loop on two SIMD pages (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV2 = npagV;            // loop on one SIMD page (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt00 = ipagV2 * neppV * 2; // loop on two SIMD pages (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt00 = ipagV2 * neppV; // loop on one SIMD page (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        for( int ieppV = 0; ieppV < neppV; ++ieppV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:          const int ievt2 = ievt00 + ieppV + neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        for( int ieppV = 0; ieppV < neppV; ++ieppV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:          const int ievt2 = ievt00 + ieppV + neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV = nevt / neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt0 = ipagV * neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    assert( npagV % 2 == 0 );     // SANITY CHECK for mixed fptypes: two neppV-pages are merged to one 2*neppV-page
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV2 = npagV / 2; // loop on two SIMD pages (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV2 = npagV;            // loop on one SIMD page (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      // (jamp2[nParity][ncolor][neppV] for the SIMD vector - or the two SIMD vectors - of events processed in calculate_wavefunctions)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      fptype_sv MEs_ighel[ncomb] = { 0 };    // sum of MEs for all good helicities up to ighel (for the first - and/or only - neppV page)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      fptype_sv MEs_ighel2[ncomb] = { 0 };   // sum of MEs for all good helicities up to ighel (for the second neppV page)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt00 = ipagV2 * neppV * 2; // loop on two SIMD pages (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt00 = ipagV2 * neppV; // loop on one SIMD page (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        MEs_ighel2[ighel] = E_ACCESS::kernelAccess( E_ACCESS::ieventAccessRecord( allMEs, ievt00 + neppV ) );
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      for( int ieppV = 0; ieppV < neppV; ++ieppV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        const int ievt2 = ievt00 + ieppV + neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        for( int ieppV = 0; ieppV < neppV; ++ieppV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:          const int ievt2 = ievt00 + ieppV + neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt0 = ipagV * neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      //for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      //  const unsigned int ievt = ipagV * neppV + ieppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = SCALAR ('none': ~vector[" << neppV
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = VECTOR[" << neppV
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = VECTOR[" << neppV
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = VECTOR[" << neppV
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = VECTOR[" << neppV
gg_tt.mad/SubProcesses/MemoryAccessMatrixElements.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessGs.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessGs.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      static_assert( neppC >= neppV );                              // ASSUME CONTIGUOUS ARRAYS
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      static_assert( neppC % neppV == 0 );                          // ASSUME CONTIGUOUS ARRAYS
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      static_assert( neppC >= neppV ); // ASSUME CONTIGUOUS ARRAYS
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      static_assert( neppC % neppV == 0 ); // ASSUME CONTIGUOUS ARRAYS
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/testmisc.cc:    for( int i = 1; i < neppV; i++ ) out << ", " << ( v[i] ? "T" : "F" );
gg_tt.mad/SubProcesses/testmisc.cc:    for( int i = 1; i < neppV; i++ )
gg_tt.mad/SubProcesses/testmisc.cc:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/SubProcesses/testmisc.cc:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/SubProcesses/MemoryAccessVectors.h:  // Cast one non-const fptype_v reference (one vector of neppV fptype values) from one non-const fptype reference (#435),
gg_tt.mad/SubProcesses/MemoryAccessVectors.h:  // Cast one const fptype_v reference (one vector of neppV fptype values) from one const fptype reference,
gg_tt.mad/SubProcesses/MemoryAccessVectors.h:  // Build one fptype_v (one vector of neppV fptype values) from one fptype reference,
gg_tt.mad/SubProcesses/MemoryAccessVectors.h:  // Build one fptype_v (one vector of neppV fptype values) from one fptype reference,
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    // --- Note that neppR is hardcoded and may differ from neppM and neppV on some platforms
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    // --- CPUs: neppM is best set equal to the number of fptype's (neppV) in a vector register
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    // --- However, neppM is now decoupled from neppV (issue #176) and can be separately hardcoded
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    // --- In practice, neppR, neppM and neppV could now (in principle) all be different
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    static constexpr int neppM = MGONGPU_CPPSIMD; // (DEFAULT) neppM=neppV for optimal performance
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    static constexpr int neppM = 1; // (DEFAULT) neppM=neppV for optimal performance (NB: this is equivalent to AOS)
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:      if constexpr( useContiguousEventsIfPossible && ( neppM >= neppV ) && ( neppM % neppV == 0 ) )
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:          return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:          // This does not require buffer alignment, but it requires AOSOA with neppM>=neppV and neppM%neppV==0
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:          return mg5amcCpu::fptypevFromUnalignedArray( out ); // SIMD bulk load of neppV, do not use reinterpret_cast (fewer SIMD operations)
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:        // This does not even require AOSOA with neppM>=neppV and neppM%neppV==0 (e.g. can be used with AOS neppM==1)
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:        return mg5amcCpu::fptypevFromArbitraryArray( decoderIeppv ); // iterate over ieppV in neppV (no SIMD)
gg_tt.mad/SubProcesses/testxxx.cc:    std::cerr << "Floating Point Exception (CPU neppV=" << neppV << "): '" << FPEhandlerMessage << "' ievt=" << FPEhandlerIevt << std::endl;
gg_tt.mad/SubProcesses/testxxx.cc:  assert( nevt % neppV == 0 ); // nevt must be a multiple of neppV
gg_tt.mad/SubProcesses/testxxx.cc:      const int ieppV = ievt % neppV; // #event in the current event vector in this iteration
gg_tt.mad/SubProcesses/testxxx.cc:          const int ieppV = ievt % neppV; // #event in the current event vector in this iteration
gg_tt.mad/SubProcesses/testxxx.cc:          const int ieppV = ievt % neppV; // #event in the current event vector in this iteration
gg_tt.mad/SubProcesses/testxxx.cc:      const int ipagV = ievt / neppV; // #event vector in this iteration
gg_tt.mad/SubProcesses/testxxx.cc:      const fptype* ievt0Momenta = MemoryAccessMomenta::ieventAccessRecordConst( hstMomenta.data(), ipagV * neppV );
gg_tt.mad/src/mgOnGpuVectors.h:  const int neppV = MGONGPU_CPPSIMD;
gg_tt.mad/src/mgOnGpuVectors.h:  // SANITY CHECK: cppAlign must be a multiple of neppV * sizeof(fptype)
gg_tt.mad/src/mgOnGpuVectors.h:  static_assert( mgOnGpu::cppAlign % ( neppV * sizeof( fptype ) ) == 0 );
gg_tt.mad/src/mgOnGpuVectors.h:  // SANITY CHECK: check that neppV is a power of two
gg_tt.mad/src/mgOnGpuVectors.h:  static_assert( ispoweroftwo( neppV ), "neppV is not a power of 2" );
gg_tt.mad/src/mgOnGpuVectors.h:  typedef fptype fptype_v __attribute__( ( ext_vector_type( neppV ) ) ); // RRRR
gg_tt.mad/src/mgOnGpuVectors.h:  typedef fptype fptype_v __attribute__( ( vector_size( neppV * sizeof( fptype ) ) ) ); // RRRR
gg_tt.mad/src/mgOnGpuVectors.h:  const int neppV2 = MGONGPU_CPPSIMD * 2;
gg_tt.mad/src/mgOnGpuVectors.h:  static_assert( mgOnGpu::cppAlign % ( neppV2 * sizeof( fptype2 ) ) == 0 );
gg_tt.mad/src/mgOnGpuVectors.h:  static_assert( ispoweroftwo( neppV2 ), "neppV2 is not a power of 2" );
gg_tt.mad/src/mgOnGpuVectors.h:  typedef fptype2 fptype2_v __attribute__( ( ext_vector_type( neppV2 ) ) ); // RRRRRRRR
gg_tt.mad/src/mgOnGpuVectors.h:  typedef fptype2 fptype2_v __attribute__( ( vector_size( neppV2 * sizeof( fptype2 ) ) ) ); // RRRRRRRR
gg_tt.mad/src/mgOnGpuVectors.h:  typedef long int bool_v __attribute__( ( ext_vector_type( neppV ) ) ); // bbbb
gg_tt.mad/src/mgOnGpuVectors.h:  typedef int bool_v __attribute__( ( ext_vector_type( neppV ) ) );                         // bbbb
gg_tt.mad/src/mgOnGpuVectors.h:  typedef long int bool_v __attribute__( ( vector_size( neppV * sizeof( long int ) ) ) ); // bbbb
gg_tt.mad/src/mgOnGpuVectors.h:  typedef int bool_v __attribute__( ( vector_size( neppV * sizeof( int ) ) ) ); // bbbb
gg_tt.mad/src/mgOnGpuVectors.h:  const int neppV = 1;
gg_tt.mad/src/mgOnGpuVectors.h://using mgOnGpu::neppV;
gg_tt.mad/src/mgOnGpuVectors.h:    for ( int i=1; i<neppV; i++ ) out << ", " << (bool)(v[i]);
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 1; i < neppV; i++ ) out << ", " << v[i];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 1; i < neppV; i++ ) out << ", " << v[i];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 1; i < neppV; i++ ) out << ", " << cxmake( v.real()[i], v.imag()[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = fpsqrt( v[i] );
gg_tt.mad/src/mgOnGpuVectors.h:  fpvmake( const fptype v[neppV] )
gg_tt.mad/src/mgOnGpuVectors.h:    for ( int i=0; i<neppV; i++ ) out[i] = v[i];gg_tt.mad/src/mgOnGpuVectors.h:    for ( int i=0; i<neppV; i++ ) out[i] = c;
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a[i] : b[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a[i] : b );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a : b[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a : b );
gg_tt.mad/src/mgOnGpuVectors.h:    //for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a[i] : b[i] ); // OLD error-prone depends on "cxtype_ref& operator=( cxtype_ref&& c )"
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = cxtype( mask[i] ? a[i] : b[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a[i] : b );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a : b[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a : b );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for ( int i=0; i<neppV; i++ ) out = out || mask[i];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out = out && mask[i];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/src/mgOnGpuVectors.h:      out[ieppV+neppV] = v2[ieppV];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/src/mgOnGpuVectors.h:      out[ieppV] = v[ieppV+neppV];
gg_tt.mad/src/mgOnGpuConfig.h:// C++ SIMD vectorization width (this will be used to set neppV)
gg_tt.mad/src/mgOnGpuConfig.h:// For SANITY CHECKS: check that neppR, neppM, neppV... are powers of two (https://stackoverflow.com/a/108360)

And see this

[avalassi@itscrd90 gcc12.1/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp> grep ternary gg_tt.mad -r
gg_tt.mad/SubProcesses/testmisc.cc:    fptype_sv numerators0_sv = fpternary( mask0_sv, absamp0_sv, fpZERO_sv ); // equivalent to "mask0_sv * absamp0_sv"
gg_tt.mad/SubProcesses/testmisc.cc:    fptype_sv numerators1_sv = fpternary( mask1_sv, absamp1_sv, fpZERO_sv ); // equivalent to "mask1_sv * absamp1_sv"
gg_tt.mad/src/mgOnGpuVectors.h:  // Functions and operators for bool_v (ternary and masks)
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool_v& mask, const fptype_v& a, const fptype_v& b )
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool_v& mask, const fptype_v& a, const fptype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool_v& mask, const fptype& a, const fptype_v& b )
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool_v& mask, const fptype& a, const fptype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool_v& mask, const cxtype_v& a, const cxtype_v& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool_v& mask, const cxtype_v& a, const cxtype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool_v& mask, const cxtype& a, const cxtype_v& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool_v& mask, const cxtype& a, const cxtype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool& mask, const fptype& a, const fptype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool& mask, const cxtype& a, const cxtype& b )
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( b < a ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( b < a ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( b < a ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( a < b ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( a < b ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( a < b ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool& mask, const fptype& a, const fptype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool& mask, const cxtype& a, const cxtype& b )
gg_tt.mad/src/mgOnGpuCxtypes.h:    //__host__ __device__ cxtype_ref& operator=( cxtype_ref&& c ) {...} // REMOVED! Should copy refs or copy values? No longer needed in cxternary
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ppDENOM = fpternary( pp != 0, pp, 1. );    // hack: ppDENOM[ieppV]=1 if pp[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v pp3DENOM = fpternary( pp3 != 0, pp3, 1. ); // hack: pp3DENOM[ieppV]=1 if pp3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:                                cxternary( ( pp3 == 0. ),
gg_tt.mad/src/HelAmps_sm.h:      fi[2] = cxternary( mask, fiA_2, fiB_2 );
gg_tt.mad/src/HelAmps_sm.h:      fi[3] = cxternary( mask, fiA_3, fiB_3 );
gg_tt.mad/src/HelAmps_sm.h:      fi[4] = cxternary( mask, fiA_4, fiB_4 );
gg_tt.mad/src/HelAmps_sm.h:      fi[5] = cxternary( mask, fiA_5, fiB_5 );
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. and pvec2 == 0. and pvec3 < 0. ),
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv sqp0p3DENOM = fpternary( sqp0p3 != 0, (fptype_sv)sqp0p3, 1. ); // hack: dummy sqp0p3DENOM[ieppV]=1 if sqp0p3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:                           cxternary( sqp0p3 == 0,
gg_tt.mad/src/HelAmps_sm.h:      const fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. and pvec2 == 0. and pvec3 < 0. ),
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ppDENOM = fpternary( pp != 0, pp, 1. ); // hack: ppDENOM[ieppV]=1 if pp[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ptDENOM = fpternary( pt != 0, pt, 1. );                                                     // hack: ptDENOM[ieppV]=1 if pt[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      const cxtype_v vcB2_4 = cxmake( 0., (fptype)nsvahl * fpternary( ( pvec3 < 0 ), -sqh, sqh ) ); // AV: removed an abs here
gg_tt.mad/src/HelAmps_sm.h:      vc[2] = cxternary( mask, vcA_2, vcB_2 );
gg_tt.mad/src/HelAmps_sm.h:      vc[3] = cxternary( mask, vcA_3, cxternary( maskB, vcB1_3, vcB2_3 ) );gg_tt.mad/src/HelAmps_sm.h:      vc[4] = cxternary( mask, vcA_4, cxternary( maskB, vcB1_4, vcB2_4 ) );
gg_tt.mad/src/HelAmps_sm.h:      vc[5] = cxternary( mask, vcA_5, vcB_5 );
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ptDENOM = fpternary( pt != 0, pt, 1. );                             // hack: ptDENOM[ieppV]=1 if pt[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      const cxtype_v vcB_4 = cxmake( 0, (fptype)nsv * fpternary( ( pvec3 < 0 ), -sqh, sqh ) ); // AV: removed an abs here
gg_tt.mad/src/HelAmps_sm.h:      vc[3] = cxternary( mask, vcA_3, vcB_3 );
gg_tt.mad/src/HelAmps_sm.h:      vc[4] = cxternary( mask, vcA_4, vcB_4 );
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ppDENOM = fpternary( pp != 0, pp, 1. );    // hack: ppDENOM[ieppV]=1 if pp[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v pp3DENOM = fpternary( pp3 != 0, pp3, 1. ); // hack: pp3DENOM[ieppV]=1 if pp3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:                                ( cxternary( ( pp3 == 0. ),
gg_tt.mad/src/HelAmps_sm.h:      fo[2] = cxternary( mask, foA_2, foB_2 );
gg_tt.mad/src/HelAmps_sm.h:      fo[3] = cxternary( mask, foA_3, foB_3 );
gg_tt.mad/src/HelAmps_sm.h:      fo[4] = cxternary( mask, foA_4, foB_4 );
gg_tt.mad/src/HelAmps_sm.h:      fo[5] = cxternary( mask, foA_5, foB_5 );
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. and pvec2 == 0. and pvec3 < 0. ),
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v sqp0p3DENOM = fpternary( sqp0p3 != 0, (fptype_sv)sqp0p3, 1. ); // hack: sqp0p3DENOM[ieppV]=1 if sqp0p3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:                                cxternary( ( sqp0p3 == 0. ),
gg_tt.mad/src/HelAmps_sm.h:      const fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. ) and ( pvec2 == 0. ) and ( pvec3 < 0. ),

Rephrasing: the current, existing code that has SIMD speedups is full of explicit for ( int i=0; i<neppV; i++ ) loops and is full of fpternary and cxternary calls which internally contain more of those explicit for ( int i=0; i<neppV; i++ ) loops. If we were not using those, presently there would not be a vectorized version of MG5AMC with x16 SIMD speedups.

You may want to use a different approach, intrinsics, or whatever else. But please keep this for a next phase of the project. For this phase, where we are so close to a production release, please stick to the existing approach and lets get this done. Then change it inwhichever other way you find appropriate. Is this a reasonable compromise? Thanks Andrea

Hi Andrea,

Thanks for the input on this and I do see your point. I would add that doing simple loop are also auto-vectorisable by the compiler and typically does not need advanced function for doing it (which is likely the case here).

My preference would be to move forward on one solution rather than debating for a long period on a solution, lets be practical and move forward. So given: 1) we agreed last time to implement it in a given way. 2) the having a solution implemented is more important than having the best solution implemented. 3) that you have clearly presented to Stefan a way to do it which will not break the simd/gpu handling.

I will re-iterate my previous position which is to give some freedom to the person doing the implementation (so here Stefan) to choose what he thinks is the best solution (or the solution that he thinks is the most appropriate or ...)

Cheers,

Olivier

PS @roiser: If you want me to take the decision on the implementation to pick, just tell me and I can make the cut

PPS@roiser: given the above post, I'm putting the part of the implementation that I was planning to do on hold as long as this is not decided.

Hi,

there are some misunderstandings in the posts further up (e.g. using intrinsics, the need for channelD arrays on CPU, use SIMD when not needed).

I do agree with Olivier's post above that having one solution now instead of discussing this for possibly a long time is what we need. Therefore I implement the workaround with looping over the SIMD types, then we get the release out and move on to the next phase of the project as quickly as possible.

regards, Stefan

My preference would be to move forward on one solution rather than debating for a long period on a solution, lets be practical and move forward.

Thanks @oliviermattelaer

I do agree with Olivier's post above that having one solution now instead of discussing this for possibly a long time is what we need. Therefore I implement the workaround with looping over the SIMD types, then we get the release out and move on to the next phase of the project as quickly as possible.

Thanks @roiser

madgraph5 / madgraph4gpu

splitting vecsize_memmax into wrap_size and nb_wrap #765

status of the implementation

Other Validation