madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

splitting vecsize_memmax into wrap_size and nb_wrap #765

Open oliviermattelaer opened 1 year ago

oliviermattelaer commented 1 year ago

status of the implementation

Other Validation

oliviermattelaer commented 1 year ago

point to check so far

oliviermattelaer commented 1 year ago

@roiser: I have tested a scan on nb_wrap with the current fortran implementation: (here wrap_size=32)

grid size nb_wrap cross-section (p p > e+ e-) timing survey timing refine
32 1 844.9 +- 1.978 pb 1s 12s
320 10 842.8 +- 1.973 pb 2s 10s
3200 100 844.6 +- 1.055 4s 34s
8000 250 843.2 +- 1.028 pb 5s 47s
16000 500 902.9 +- 31.41 pb 7s 69s
32,000 1000 841.9 +- 1.274 16s 0s
320,000 10,000 776.9 +- 52.13 69s 0s
3,200,000 100,000 not enough memory

So this worries me a bit, looks like at some point we have an issue like overfitting (so maybe this is also related to the fact that I pick a simple process). In principle this has nothing to do with the wrap_size but rather with the grid size that automatically increase the number of events by iteration (for the first few iterations).

Did you try something like that on GPU with the master branch? So I would suggest to avoid grid size > 10k for the moment.

roiser commented 1 year ago

Hi @oliviermattelaer here are the numbers for the GPU (A100). I have redone your numbers in the first table and then a table with more optimized ones IIUC, with vector sizes modulo 108 (which are the number of SMs on the A100).

The seconds to my mind are not conclusive but I also counted the number of refine jobs that are run which go to 0 from a certain stage on. Could this be the same indication of overfitting?

vector size cross-section (p p > e+ e-) timing survey timing refine # refine jobs
32 846.1 +- 2.143 pb 3s 45s 14
320 842.6 +- 1.857 pb 3s 18s 13
3200 845.1 +- 1.193 pb 7s 27s 11
8000 843.7 +- 1.13 pb 8s 35s 11
16000 845 +- 0.9486 pb 13s 53s 11
32000 843.1 +- 0.9455 pb 23s 53s 6
40000 842.9 +- 1.103 pb 29s 29s 0
320,000 not enough memory
vector size cross-section (p p > e+ e-) timing survey timing refine # refine jobs
3456 844.7 +- 1.226 pb 7s 27s 11
6912 843.8 +- 1.09 pb 9s 33s 11
13824 843.2 +- 1.252 pb 13s 28s 6
27648 842 +- 1.044 pb 20s 47s 6
31104 843.6 +- 1.032 pb 22s 51s 6
34560 843.5 +- 1.243 pb 24s 24s 0
38016 842.7 +- 1.123 pb 26s 26s 0
41472 841.9 +- 1.077 pb 29s 29s 0
55296 not enough memory
oliviermattelaer commented 1 year ago

Those are much better number that the ones I got. This is likely a good news meaning that the issue is likely related to my wrap implementation and not to a real over-fitting. So I will need to investigate this to understand and ( more importantly) fix it. Thanks for the check

roiser commented 1 year ago

I would not give too much importance to cpu seconds though, I guess what we are mostly measuring there is the fortran part.

oliviermattelaer commented 1 year ago

I agree for the timing, the critical point here is the fact that you have a stable cross-section which I do not have at all.

PS: I have pushed the change for having the list of channel (with name channels --with a s--). Remember that you need to update both the plugin (new_interface_wrap) and the MG5aMC interface (gpucpp_wrap). Thanks

valassi commented 11 months ago

Hi @oliviermattelaer @roiser thanks for all your work on this. Just a couple of points from a discussion with Stefan, since we were trying to remember what he had agreed in the past and I have been unable to find any notes about that.

First Olivier a quick point: eventually can you please change all "wrap" into "warp" in the code? Even today I got confused again what wrap is! ;-)

Then about the main issue: my main question is what cudacpp should do internally.

Thanks... just trying to refresh my memory. Andrea

oliviermattelaer commented 11 months ago

Hi,

   First Olivier a quick point: eventually can you please change all "wrap" into "warp" in the code? Even today I got confused again what wrap is! ;-)

consider it as done (just need to do a git push)

      if( channelId == 1 ) numerators_sv += cxabs2( amp_sv[0] );
      if( channelId != 0 ) denominators_sv += cxabs2( amp_sv[0] );
into something like

      channelids_sv = CHANNEL_ACCESS::kernelAccess( pchannelIds ); // the 4 channels in the SIMD vector
      bool_sv mask_sv = ( channelids_sv == 1 );
      numerators_sv += mask_sv * cxabs2( amp_sv[0] );
      if( pchannelIds != nullptr ) denominators_sv += cxabs2( amp_sv[0] );

Please put comment around those lines since you pass from an easy to read code to a code that I simply do not understand (which I'm fine with obviously but I would be nice to keep readibility whenever you have those cryptic accessor.

Next question is about performance are we not shooting ourself with such complicated accessor? But in any case, yes that was the point.

  *   Giving the above, there is no need to pass warp_size from fortran to cudacpp... and I think we ahd agreed on this, right?

This is indeed our current strategy/agreement indeed.

But clearly it has some drawback that I quote here for reference:

  1. You do transfer many useless data from cpu to gpu (since each event has it's own information rather than each warp)
  2. You could have set those values within the shared_memory for GPU (reducing register pressure) --which would have been a natural fit--
    1. You need to implement this masking procedure (where we hope that we will never need it in practise)

While the advantages are 1) easier interface from fortran to cpp 2) possibility to compile for multiple hardware and be bullet proof for future hardware --if the warp_size is too large to ensure that MC is not bias--. 3) possibility to not trust that fortran

roiser commented 11 months ago

Oh, I may have completely misunderstood you then @oliviermattelaer . Are you saying you are passing in a vector of size [n events / warp size] instead of [n events]? I remember we discussed that but I thought we had agreed on n events. Either is of course fine with me.

I have for now developed everything in the branch to pass the vector in until calculdate_wavefunctions, now I need to handle the individual array entries (so e.g. the code that we discussed in a previous meeting with masking and that @valassi has spelled out above).

oliviermattelaer commented 11 months ago

Hi @roiser,

Are you saying you are passing in a vector of size [n events / warp size] instead of [n events]?

I do pass a vector of size [n events]. So this is has you did remember/agreed sorry if that was not clear.

roiser commented 11 months ago

No I do pass in a vector of n events (NOT n events / warp size), ok then I may have misunderstood you from the comment above. All good thanks, sorry my misunderstanding, reading it again its all clear

valassi commented 11 months ago

No I do pass in a vector of n events (NOT n events / warp size), ok then I may have misunderstood you from the comment above. All good thanks, sorry my misunderstanding, reading it again its all clear

Thanks Olivier and Stefan!

Stefan, I think Olivier means that an alternative to passing n events would be to pass n_warp (n_events/warp_size) events. But this is not what we had agreed (we agreed to pass n events).

Olivier, I see your point, indeed we might pass less memory around if we passed n_warp channels only. The disadvantage however in my opinion is that the computation becomes more complex in cudacpp as we need to introduce that concept of warp. Yes it's debatable, maybe the performance gain would be large. Anyway, I suggest we stick to our original plan, we implement the functionality (ie the possibility to have difrent channels in different events) and then we reconsider later on?

consider it as done (just need to do a git push)

Thanks! (for wrap->warp)

Please put comment around those lines since you pass from an easy to read code to a code that I simply do not understand (which I'm fine with obviously but I would be nice to keep readibility whenever you have those cryptic accessor.

Yes good point, TODO Andrea put some comments.

Next question is about performance are we not shooting ourself with such complicated accessor?

Listen, maybe yes. Again, up for review later on... but as this is done elsewhere, I would keep the same style ebverywhere, an dthen change everything if/when we decide to change it. Personally I think that the readability is ~ok (yes I should add comments). What I find less nice is that the use of templates all over the place ends up exploding build times I fear.

My suggestion on this is: if we model the accessors differently in the future, let's follow the cpp parallel model. From what I understood of the recent seminar, the model is a bit similar, there is mdspan which takes non-owning a pointer and then some metadata for defining an accessor that depends on the chosen layout. See https://en.cppreference.com/w/cpp/container/mdspan. In principle this should eventually work for GPUs too, Anyway, also about this one lets rediscuss later ok?

valassi commented 11 months ago

My suggestion on this is: if we model the accessors differently in the future, let's follow the cpp parallel model. From what I understood of the recent seminar, the model is a bit similar, there is mdspan which takes non-owning a pointer and then some metadata for defining an accessor that depends on the chosen layout. See https://en.cppreference.com/w/cpp/container/mdspan. In principle this should eventually work for GPUs too, Anyway, also about this one lets rediscuss later ok?

PS As the speaker had mentioned, this has close links to kokkos or is even modeled from there. Just for reference this is a tutorial I found from kokkos about mdspan https://github.com/kokkos/mdspan/wiki/A-Gentle-Introduction-to-mdspan

valassi commented 9 months ago

I think at some point I had proposed (in a discussion with @hageboeck too?) that we can simply change this code

      if( channelId == 1 ) numerators_sv += cxabs2( amp_sv[0] );
      if( channelId != 0 ) denominators_sv += cxabs2( amp_sv[0] );

into something like

      channelids_sv = CHANNEL_ACCESS::kernelAccess( pchannelIds ); // the 4 channels in the SIMD vector
      bool_sv mask_sv = ( channelids_sv == 1 );
      numerators_sv += mask_sv * cxabs2( amp_sv[0] );
      if( pchannelIds != nullptr ) denominators_sv += cxabs2( amp_sv[0] );

Hi @roiser @oliviermattelaer , concerning the discussion we had yesterday at the development meeting. I checked that indeed

numerators_sv += mask_sv * cxabs2( amp_sv[0] );

would fail compilation with something like invalid operands to binary * ('__vector(4) long int' and '__vector(4) double').

However there is an extremely simple mechanism that I had already implemented in the code for the ixxxxx SIMD implementation (or the reimplementation after FPE fixes), based on fpternary. It is enough to do something like

constexpr fptype_sv fpZERO_sv{}; // 0000
//numerators_sv += mask_sv * cxabs2( amp_sv[0] );
numerators_sv += fpternary( mask_sv, cxabs2( amp_sv[0] ), fpZERO_sv );

This function fpternary (there is also a cxternary for complex numbers) is simply a vector version of the ternary operator ( bool ? if_true : if_false ). The first argument is a vector mask (or scalar flag), the second and third arguments are fptype_sv (vector or scalar of FPs). If you use this with the third argument a vector of zeros, then you are effectively doing what I meant initially by mask_sv * cxabs2( amp_sv[0] ). Actually this is even much more correct logically, it is a check on bool and not a multiplication by bool, even in the scalar case this is really a ternary operator, not a multiplication.

I added a simple test to show this more in detail in testmisc.cc here https://github.com/madgraph5/madgraph4gpu/pull/796/commits/0d2a5081013363b89c9f94edbe152a0171030c3b

roiser commented 9 months ago

Hi @valassi , what I tried to do was to go through the whole chain only with SIMD operations. If I understand your solution correctly then this would loop through the SIMD entries and do operations on each of the entries one by one. Yes that's a possibility but my preference would be to stay in SIMD and from an operation point of view we also need the channelIDs only for the GPU where the number of parallel operations is huge compared to CPU.

valassi commented 9 months ago

Hi @valassi , what I tried to do was to go through the whole chain only with SIMD operations. If I understand your solution correctly then this would loop through the SIMD entries and do operations on each of the entries one by one. Yes that's a possibility but my preference would be to stay in SIMD and from an operation point of view we also need the channelIDs only for the GPU where the number of parallel operations is huge compared to CPU.

Stefan, we have those fpternary and cxternary all over the place. I spent weeks, months, to get the SIMD implementation working in all its details. In particular fixing the FPEs (#701 #733 #736 etc) for SIMD. And the SIMD gymnastics for the mixed floating point mode may contain even "uglier" stuff. But it WORKS, meaning it is FAST and achieves performance where it is needed.

My suggestion is: let's get the LO release out, COMPLETE with all the SIMD stuff that is there now, including all the for( int i = 0; i < neppV; i++ ) it takes (which IMHO are presently mostly harmless, otherwise we would never be seeing factors x16 speedups). Then after the LO release you do whatever you want with this code, and if you really think you want to start using intrinsics, you do it, good luck.

All the SIMD stuff we have so far uses a good pragmatic combination of SIMD compiler vector extensions to get the speed, and of explicit loops in a few places to get the functionality, at a very limited performance cost. This is the design I enforced, and it works. If you want to completely change the design, please do it only after the first LO release, without me. Thanks.

valassi commented 9 months ago

from an operation point of view we also need the channelIDs only for the GPU where the number of parallel operations is huge compared to CPU.

PS My point here is: it would take one line to implement this in SIMD, and it would work out of the box. Therefore I see no reason at all not to implement it also for SIMD. Saying that "this is only really needed for GPU" might maybe be an argument if this was overly complicated and required weeks of work (and even then I might object), but IMO this is a completely irrelevant argument to argue against something that requires one minute of work.

oliviermattelaer commented 9 months ago

Well I would say that none of the approach are optimal.

Using non SIMD instruction when not really needed and that a work around does exists is not optimal as the opposite.

To me this is Stefan choice at this point.

valassi commented 9 months ago

@oliviermattelaer as you want.

Using non SIMD instruction

Please keep this into account: ALL of the code we call "SIMD" in madgraph, ALL of the x16 speedups we get, pass through functions which here and there use explicit loops over vectors, for instance ixxxx. This is needed here and there to ensure the functionality. Saying "we will only do SIMD if all the code has only SIMD instructions" is wishful thinking.

IF I had said "I will not implement ixxxx for SIMD because internally it uses potentially instructions that are not vectorized", THEN we would not have a fully functioning madgraph with x16 speedups, because peices of the code would be missing. Period.

Therefore, PLEASE, lets just get this LO release out of the way with the existing approach and use this everywhere. Then for NLO I will just disappear. But let's just get LO out of the way and lets have it fully "SIMD" even if this has internal loops. PLEASE,

PS See this

[avalassi@itscrd90 gcc12.1/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp> grep neppV gg_tt.mad -r
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:                           fptype_sv* jamp2_sv            // output: jamp2[nParity][ncolor][neppV] for color choice (nullptr if disabled)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    //MemoryBufferWavefunctions w_buffer[nwf]{ neppV };
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt0 = ievt00 + iParity * neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    for( int ipagV = 0; ipagV < nevt / neppV; ++ipagV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt0 = ipagV * neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    // Allocate arrays at build time to contain at least 16 events (or at least neppV events if neppV>16, e.g. in future VPUs)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    constexpr int maxtry0 = std::max( 16, neppV ); // 16, but at least neppV (otherwise the npagV loop does not even start)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    // Loop over only nevt events if nevt is < 16 (note that nevt is always >= neppV)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    assert( nevt >= neppV );
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV = maxtry / neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    assert( npagV % 2 == 0 );     // SANITY CHECK for mixed fptypes: two neppV-pages are merged to one 2*neppV-page
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV2 = npagV / 2; // loop on two SIMD pages (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV2 = npagV;            // loop on one SIMD page (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt00 = ipagV2 * neppV * 2; // loop on two SIMD pages (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt00 = ipagV2 * neppV; // loop on one SIMD page (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        for( int ieppV = 0; ieppV < neppV; ++ieppV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:          const int ievt2 = ievt00 + ieppV + neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        for( int ieppV = 0; ieppV < neppV; ++ieppV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:          const int ievt2 = ievt00 + ieppV + neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV = nevt / neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt0 = ipagV * neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    assert( npagV % 2 == 0 );     // SANITY CHECK for mixed fptypes: two neppV-pages are merged to one 2*neppV-page
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV2 = npagV / 2; // loop on two SIMD pages (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:    const int npagV2 = npagV;            // loop on one SIMD page (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      // (jamp2[nParity][ncolor][neppV] for the SIMD vector - or the two SIMD vectors - of events processed in calculate_wavefunctions)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      fptype_sv MEs_ighel[ncomb] = { 0 };    // sum of MEs for all good helicities up to ighel (for the first - and/or only - neppV page)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      fptype_sv MEs_ighel2[ncomb] = { 0 };   // sum of MEs for all good helicities up to ighel (for the second neppV page)
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt00 = ipagV2 * neppV * 2; // loop on two SIMD pages (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt00 = ipagV2 * neppV; // loop on one SIMD page (neppV events) at a time
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        MEs_ighel2[ighel] = E_ACCESS::kernelAccess( E_ACCESS::ieventAccessRecord( allMEs, ievt00 + neppV ) );
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      for( int ieppV = 0; ieppV < neppV; ++ieppV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        const int ievt2 = ievt00 + ieppV + neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:        for( int ieppV = 0; ieppV < neppV; ++ieppV )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:          const int ievt2 = ievt00 + ieppV + neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      const int ievt0 = ipagV * neppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      //for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:      //  const unsigned int ievt = ipagV * neppV + ieppV;
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = SCALAR ('none': ~vector[" << neppV
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = VECTOR[" << neppV
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = VECTOR[" << neppV
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = VECTOR[" << neppV
gg_tt.mad/SubProcesses/P1_gg_ttx/check_sa.cc:              << "Internal loops fptype_sv    = VECTOR[" << neppV
gg_tt.mad/SubProcesses/MemoryAccessMatrixElements.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessGs.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessGs.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      static_assert( neppC >= neppV );                              // ASSUME CONTIGUOUS ARRAYS
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      static_assert( neppC % neppV == 0 );                          // ASSUME CONTIGUOUS ARRAYS
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      static_assert( neppC >= neppV ); // ASSUME CONTIGUOUS ARRAYS
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      static_assert( neppC % neppV == 0 ); // ASSUME CONTIGUOUS ARRAYS
gg_tt.mad/SubProcesses/MemoryAccessCouplings.h:      return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/testmisc.cc:    for( int i = 1; i < neppV; i++ ) out << ", " << ( v[i] ? "T" : "F" );
gg_tt.mad/SubProcesses/testmisc.cc:    for( int i = 1; i < neppV; i++ )
gg_tt.mad/SubProcesses/testmisc.cc:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/SubProcesses/testmisc.cc:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/SubProcesses/MemoryAccessVectors.h:  // Cast one non-const fptype_v reference (one vector of neppV fptype values) from one non-const fptype reference (#435),
gg_tt.mad/SubProcesses/MemoryAccessVectors.h:  // Cast one const fptype_v reference (one vector of neppV fptype values) from one const fptype reference,
gg_tt.mad/SubProcesses/MemoryAccessVectors.h:  // Build one fptype_v (one vector of neppV fptype values) from one fptype reference,
gg_tt.mad/SubProcesses/MemoryAccessVectors.h:  // Build one fptype_v (one vector of neppV fptype values) from one fptype reference,
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    // --- Note that neppR is hardcoded and may differ from neppM and neppV on some platforms
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    // --- CPUs: neppM is best set equal to the number of fptype's (neppV) in a vector register
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    // --- However, neppM is now decoupled from neppV (issue #176) and can be separately hardcoded
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    // --- In practice, neppR, neppM and neppV could now (in principle) all be different
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    static constexpr int neppM = MGONGPU_CPPSIMD; // (DEFAULT) neppM=neppV for optimal performance
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:    static constexpr int neppM = 1; // (DEFAULT) neppM=neppV for optimal performance (NB: this is equivalent to AOS)
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:      if constexpr( useContiguousEventsIfPossible && ( neppM >= neppV ) && ( neppM % neppV == 0 ) )
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:          return mg5amcCpu::fptypevFromAlignedArray( out ); // SIMD bulk load of neppV, use reinterpret_cast
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:          // This does not require buffer alignment, but it requires AOSOA with neppM>=neppV and neppM%neppV==0
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:          return mg5amcCpu::fptypevFromUnalignedArray( out ); // SIMD bulk load of neppV, do not use reinterpret_cast (fewer SIMD operations)
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:        // This does not even require AOSOA with neppM>=neppV and neppM%neppV==0 (e.g. can be used with AOS neppM==1)
gg_tt.mad/SubProcesses/MemoryAccessMomenta.h:        return mg5amcCpu::fptypevFromArbitraryArray( decoderIeppv ); // iterate over ieppV in neppV (no SIMD)
gg_tt.mad/SubProcesses/testxxx.cc:    std::cerr << "Floating Point Exception (CPU neppV=" << neppV << "): '" << FPEhandlerMessage << "' ievt=" << FPEhandlerIevt << std::endl;
gg_tt.mad/SubProcesses/testxxx.cc:  assert( nevt % neppV == 0 ); // nevt must be a multiple of neppV
gg_tt.mad/SubProcesses/testxxx.cc:      const int ieppV = ievt % neppV; // #event in the current event vector in this iteration
gg_tt.mad/SubProcesses/testxxx.cc:          const int ieppV = ievt % neppV; // #event in the current event vector in this iteration
gg_tt.mad/SubProcesses/testxxx.cc:          const int ieppV = ievt % neppV; // #event in the current event vector in this iteration
gg_tt.mad/SubProcesses/testxxx.cc:      const int ipagV = ievt / neppV; // #event vector in this iteration
gg_tt.mad/SubProcesses/testxxx.cc:      const fptype* ievt0Momenta = MemoryAccessMomenta::ieventAccessRecordConst( hstMomenta.data(), ipagV * neppV );
gg_tt.mad/src/mgOnGpuVectors.h:  const int neppV = MGONGPU_CPPSIMD;
gg_tt.mad/src/mgOnGpuVectors.h:  // SANITY CHECK: cppAlign must be a multiple of neppV * sizeof(fptype)
gg_tt.mad/src/mgOnGpuVectors.h:  static_assert( mgOnGpu::cppAlign % ( neppV * sizeof( fptype ) ) == 0 );
gg_tt.mad/src/mgOnGpuVectors.h:  // SANITY CHECK: check that neppV is a power of two
gg_tt.mad/src/mgOnGpuVectors.h:  static_assert( ispoweroftwo( neppV ), "neppV is not a power of 2" );
gg_tt.mad/src/mgOnGpuVectors.h:  typedef fptype fptype_v __attribute__( ( ext_vector_type( neppV ) ) ); // RRRR
gg_tt.mad/src/mgOnGpuVectors.h:  typedef fptype fptype_v __attribute__( ( vector_size( neppV * sizeof( fptype ) ) ) ); // RRRR
gg_tt.mad/src/mgOnGpuVectors.h:  const int neppV2 = MGONGPU_CPPSIMD * 2;
gg_tt.mad/src/mgOnGpuVectors.h:  static_assert( mgOnGpu::cppAlign % ( neppV2 * sizeof( fptype2 ) ) == 0 );
gg_tt.mad/src/mgOnGpuVectors.h:  static_assert( ispoweroftwo( neppV2 ), "neppV2 is not a power of 2" );
gg_tt.mad/src/mgOnGpuVectors.h:  typedef fptype2 fptype2_v __attribute__( ( ext_vector_type( neppV2 ) ) ); // RRRRRRRR
gg_tt.mad/src/mgOnGpuVectors.h:  typedef fptype2 fptype2_v __attribute__( ( vector_size( neppV2 * sizeof( fptype2 ) ) ) ); // RRRRRRRR
gg_tt.mad/src/mgOnGpuVectors.h:  typedef long int bool_v __attribute__( ( ext_vector_type( neppV ) ) ); // bbbb
gg_tt.mad/src/mgOnGpuVectors.h:  typedef int bool_v __attribute__( ( ext_vector_type( neppV ) ) );                         // bbbb
gg_tt.mad/src/mgOnGpuVectors.h:  typedef long int bool_v __attribute__( ( vector_size( neppV * sizeof( long int ) ) ) ); // bbbb
gg_tt.mad/src/mgOnGpuVectors.h:  typedef int bool_v __attribute__( ( vector_size( neppV * sizeof( int ) ) ) ); // bbbb
gg_tt.mad/src/mgOnGpuVectors.h:  const int neppV = 1;
gg_tt.mad/src/mgOnGpuVectors.h://using mgOnGpu::neppV;
gg_tt.mad/src/mgOnGpuVectors.h:    for ( int i=1; i<neppV; i++ ) out << ", " << (bool)(v[i]);
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 1; i < neppV; i++ ) out << ", " << v[i];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 1; i < neppV; i++ ) out << ", " << v[i];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 1; i < neppV; i++ ) out << ", " << cxmake( v.real()[i], v.imag()[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = fpsqrt( v[i] );
gg_tt.mad/src/mgOnGpuVectors.h:  fpvmake( const fptype v[neppV] )
gg_tt.mad/src/mgOnGpuVectors.h:    for ( int i=0; i<neppV; i++ ) out[i] = v[i];gg_tt.mad/src/mgOnGpuVectors.h:    for ( int i=0; i<neppV; i++ ) out[i] = c;
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a[i] : b[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a[i] : b );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a : b[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a : b );
gg_tt.mad/src/mgOnGpuVectors.h:    //for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a[i] : b[i] ); // OLD error-prone depends on "cxtype_ref& operator=( cxtype_ref&& c )"
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = cxtype( mask[i] ? a[i] : b[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a[i] : b );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a : b[i] );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out[i] = ( mask[i] ? a : b );
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for ( int i=0; i<neppV; i++ ) out = out || mask[i];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int i = 0; i < neppV; i++ ) out = out && mask[i];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/src/mgOnGpuVectors.h:      out[ieppV+neppV] = v2[ieppV];
gg_tt.mad/src/mgOnGpuVectors.h:    for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/src/mgOnGpuVectors.h:    for( int ieppV = 0; ieppV < neppV; ieppV++ )
gg_tt.mad/src/mgOnGpuVectors.h:      out[ieppV] = v[ieppV+neppV];
gg_tt.mad/src/mgOnGpuConfig.h:// C++ SIMD vectorization width (this will be used to set neppV)
gg_tt.mad/src/mgOnGpuConfig.h:// For SANITY CHECKS: check that neppR, neppM, neppV... are powers of two (https://stackoverflow.com/a/108360)
valassi commented 9 months ago

And see this

[avalassi@itscrd90 gcc12.1/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp> grep ternary gg_tt.mad -r
gg_tt.mad/SubProcesses/testmisc.cc:    fptype_sv numerators0_sv = fpternary( mask0_sv, absamp0_sv, fpZERO_sv ); // equivalent to "mask0_sv * absamp0_sv"
gg_tt.mad/SubProcesses/testmisc.cc:    fptype_sv numerators1_sv = fpternary( mask1_sv, absamp1_sv, fpZERO_sv ); // equivalent to "mask1_sv * absamp1_sv"
gg_tt.mad/src/mgOnGpuVectors.h:  // Functions and operators for bool_v (ternary and masks)
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool_v& mask, const fptype_v& a, const fptype_v& b )
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool_v& mask, const fptype_v& a, const fptype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool_v& mask, const fptype& a, const fptype_v& b )
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool_v& mask, const fptype& a, const fptype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool_v& mask, const cxtype_v& a, const cxtype_v& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool_v& mask, const cxtype_v& a, const cxtype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool_v& mask, const cxtype& a, const cxtype_v& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool_v& mask, const cxtype& a, const cxtype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool& mask, const fptype& a, const fptype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool& mask, const cxtype& a, const cxtype& b )
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( b < a ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( b < a ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( b < a ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( a < b ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( a < b ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:    return fpternary( ( a < b ), a, b );
gg_tt.mad/src/mgOnGpuVectors.h:  fpternary( const bool& mask, const fptype& a, const fptype& b )
gg_tt.mad/src/mgOnGpuVectors.h:  cxternary( const bool& mask, const cxtype& a, const cxtype& b )
gg_tt.mad/src/mgOnGpuCxtypes.h:    //__host__ __device__ cxtype_ref& operator=( cxtype_ref&& c ) {...} // REMOVED! Should copy refs or copy values? No longer needed in cxternary
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ppDENOM = fpternary( pp != 0, pp, 1. );    // hack: ppDENOM[ieppV]=1 if pp[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v pp3DENOM = fpternary( pp3 != 0, pp3, 1. ); // hack: pp3DENOM[ieppV]=1 if pp3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:                                cxternary( ( pp3 == 0. ),
gg_tt.mad/src/HelAmps_sm.h:      fi[2] = cxternary( mask, fiA_2, fiB_2 );
gg_tt.mad/src/HelAmps_sm.h:      fi[3] = cxternary( mask, fiA_3, fiB_3 );
gg_tt.mad/src/HelAmps_sm.h:      fi[4] = cxternary( mask, fiA_4, fiB_4 );
gg_tt.mad/src/HelAmps_sm.h:      fi[5] = cxternary( mask, fiA_5, fiB_5 );
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. and pvec2 == 0. and pvec3 < 0. ),
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv sqp0p3DENOM = fpternary( sqp0p3 != 0, (fptype_sv)sqp0p3, 1. ); // hack: dummy sqp0p3DENOM[ieppV]=1 if sqp0p3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:                           cxternary( sqp0p3 == 0,
gg_tt.mad/src/HelAmps_sm.h:      const fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. and pvec2 == 0. and pvec3 < 0. ),
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ppDENOM = fpternary( pp != 0, pp, 1. ); // hack: ppDENOM[ieppV]=1 if pp[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ptDENOM = fpternary( pt != 0, pt, 1. );                                                     // hack: ptDENOM[ieppV]=1 if pt[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      const cxtype_v vcB2_4 = cxmake( 0., (fptype)nsvahl * fpternary( ( pvec3 < 0 ), -sqh, sqh ) ); // AV: removed an abs here
gg_tt.mad/src/HelAmps_sm.h:      vc[2] = cxternary( mask, vcA_2, vcB_2 );
gg_tt.mad/src/HelAmps_sm.h:      vc[3] = cxternary( mask, vcA_3, cxternary( maskB, vcB1_3, vcB2_3 ) );gg_tt.mad/src/HelAmps_sm.h:      vc[4] = cxternary( mask, vcA_4, cxternary( maskB, vcB1_4, vcB2_4 ) );
gg_tt.mad/src/HelAmps_sm.h:      vc[5] = cxternary( mask, vcA_5, vcB_5 );
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ptDENOM = fpternary( pt != 0, pt, 1. );                             // hack: ptDENOM[ieppV]=1 if pt[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      const cxtype_v vcB_4 = cxmake( 0, (fptype)nsv * fpternary( ( pvec3 < 0 ), -sqh, sqh ) ); // AV: removed an abs here
gg_tt.mad/src/HelAmps_sm.h:      vc[3] = cxternary( mask, vcA_3, vcB_3 );
gg_tt.mad/src/HelAmps_sm.h:      vc[4] = cxternary( mask, vcA_4, vcB_4 );
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v ppDENOM = fpternary( pp != 0, pp, 1. );    // hack: ppDENOM[ieppV]=1 if pp[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v pp3DENOM = fpternary( pp3 != 0, pp3, 1. ); // hack: pp3DENOM[ieppV]=1 if pp3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:                                ( cxternary( ( pp3 == 0. ),
gg_tt.mad/src/HelAmps_sm.h:      fo[2] = cxternary( mask, foA_2, foB_2 );
gg_tt.mad/src/HelAmps_sm.h:      fo[3] = cxternary( mask, foA_3, foB_3 );
gg_tt.mad/src/HelAmps_sm.h:      fo[4] = cxternary( mask, foA_4, foB_4 );
gg_tt.mad/src/HelAmps_sm.h:      fo[5] = cxternary( mask, foA_5, foB_5 );
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. and pvec2 == 0. and pvec3 < 0. ),
gg_tt.mad/src/HelAmps_sm.h:      volatile fptype_v sqp0p3DENOM = fpternary( sqp0p3 != 0, (fptype_sv)sqp0p3, 1. ); // hack: sqp0p3DENOM[ieppV]=1 if sqp0p3[ieppV]==0
gg_tt.mad/src/HelAmps_sm.h:                                cxternary( ( sqp0p3 == 0. ),
gg_tt.mad/src/HelAmps_sm.h:      const fptype_sv sqp0p3 = fpternary( ( pvec1 == 0. ) and ( pvec2 == 0. ) and ( pvec3 < 0. ),
valassi commented 9 months ago

Rephrasing: the current, existing code that has SIMD speedups is full of explicit for ( int i=0; i<neppV; i++ ) loops and is full of fpternary and cxternary calls which internally contain more of those explicit for ( int i=0; i<neppV; i++ ) loops. If we were not using those, presently there would not be a vectorized version of MG5AMC with x16 SIMD speedups.

You may want to use a different approach, intrinsics, or whatever else. But please keep this for a next phase of the project. For this phase, where we are so close to a production release, please stick to the existing approach and lets get this done. Then change it inwhichever other way you find appropriate. Is this a reasonable compromise? Thanks Andrea

oliviermattelaer commented 9 months ago

Hi Andrea,

Thanks for the input on this and I do see your point. I would add that doing simple loop are also auto-vectorisable by the compiler and typically does not need advanced function for doing it (which is likely the case here).

My preference would be to move forward on one solution rather than debating for a long period on a solution, lets be practical and move forward. So given: 1) we agreed last time to implement it in a given way. 2) the having a solution implemented is more important than having the best solution implemented. 3) that you have clearly presented to Stefan a way to do it which will not break the simd/gpu handling.

I will re-iterate my previous position which is to give some freedom to the person doing the implementation (so here Stefan) to choose what he thinks is the best solution (or the solution that he thinks is the most appropriate or ...)

Cheers,

Olivier

PS @roiser: If you want me to take the decision on the implementation to pick, just tell me and I can make the cut

PPS@roiser: given the above post, I'm putting the part of the implementation that I was planning to do on hold as long as this is not decided.

roiser commented 9 months ago

Hi,

there are some misunderstandings in the posts further up (e.g. using intrinsics, the need for channelD arrays on CPU, use SIMD when not needed).

I do agree with Olivier's post above that having one solution now instead of discussing this for possibly a long time is what we need. Therefore I implement the workaround with looping over the SIMD types, then we get the release out and move on to the next phase of the project as quickly as possible.

regards, Stefan

valassi commented 9 months ago

My preference would be to move forward on one solution rather than debating for a long period on a solution, lets be practical and move forward.

Thanks @oliviermattelaer

I do agree with Olivier's post above that having one solution now instead of discussing this for possibly a long time is what we need. Therefore I implement the workaround with looping over the SIMD types, then we get the release out and move on to the next phase of the project as quickly as possible.

Thanks @roiser