set_gpugrid is doing nothing - fix propagation of number of GPU threads to bridge (test GPU branch efficiency)

I am recreating several summary tables for ichep22 and more.

I realised that the test I had added to test GPU branch efficiency does seem to work actually. My "8tpb" tests with 8 threads per block does seem to give a factor 4 less throughput than an optimized version with 32 threads or more per block.

This is a nice show equivalent to reducing CPU SIMD to none: if we artificially reduce the "vector processing" in GPU SIMT, then we lose. The fact that we lose maximally means that we are actually exploiting this maximally :-)

Anyway, the issue is that the tables do not seem right. In non-bridge mode I see the throughput loss, in bridge mode I do not. There is already a set_gpugrid method, but I guess this is not correctly called. See this latest ggttggg table at cern

===========================================================================================================
|            | mad                        | mad               | mad               | sa/brdg   | sa/full   |
-----------------------------------------------------------------------------------------------------------
| ggttggg    | [sec] tot = mad + MEs      | [TOT/sec]         | [MEs/sec]         | [MEs/sec] | [MEs/sec] |
===========================================================================================================
| nevt/grid  |                       8192 |              8192 |              8192 |      8192 |      8192 |
| nevt total |                      90112 |             90112 |             90112 |  256*32*1 |  256*32*1 |
-----------------------------------------------------------------------------------------------------------
| FORTRAN    | 1283.80 =  61.57 + 1222.22 |  7.02e+01 (= 1.0) |  7.37e+01 (= 1.0) |       --- |       --- |
| CPP/none   | 1544.77 = 170.24 + 1374.53 |  5.83e+01 (x 0.8) |  6.56e+01 (x 0.9) |  7.64e+01 |  7.63e+01 |
| CPP/sse4   |  441.71 =  89.75 +  351.96 |  2.04e+02 (x 2.9) |  2.56e+02 (x 3.5) |  2.87e+02 |  2.87e+02 |
| CPP/avx2   |  281.23 =  78.45 +  202.77 |  3.20e+02 (x 4.6) |  4.44e+02 (x 6.0) |  5.04e+02 |  5.03e+02 |
| CPP/512y   |  261.51 =  76.08 +  185.44 |  3.45e+02 (x 4.9) |  4.86e+02 (x 6.6) |  5.64e+02 |  5.63e+02 |
| CPP/512z   |  241.72 =  76.22 +  165.50 |  3.73e+02 (x 5.3) |  5.44e+02 (x 7.4) |  5.75e+02 |  5.77e+02 |
| CUDA/8192  |   69.69 =  64.20 +    5.49 |  1.29e+03 (x18.4) |  1.64e+04 (x222.) |  1.66e+04 |  1.67e+04 |
===========================================================================================================
| nevt/grid  |                                                                    |     16384 |     16384 |
| nevt total |                                                                    |  512*32*1 |  512*32*1 |
--------------                                                                    -------------------------
| CUDA/max   |                                                                    |  2.37e+04 |  2.40e+04 |
|            |                                                                    |           |   (x326.) |
==============                                                                    =========================
| nevt/grid  |                                                                    |     16384 |     16384 |
| nevt total |                                                                    |    2k*8*1 |    2k*8*1 |
--------------                                                                    -------------------------
| CUDA/8tpb  |                                                                    |  2.34e+04 |  6.49e+03 |
|            |                                                                    |           |   (x88.1) |
==============                                                                    =========================

The x88 is 1/4 of the x326. However the 2.34 E4 on the last line should be lower than the 6.49E3. The fact that it is 4x larger means that the bridge test is still proceeding with 32 threads per block or more, not the nominal 8 that the test is supposed to be testing

I have just added -Wunused-parameter to cuda builds (I am not sure why -Wall is still off) and I noticed this

    MatrixElementKernels.cc: In member function ‘void mg5amcGpu::MatrixElementKernelDevice::setGrid(int, int)’:
    MatrixElementKernels.cc:233:51: warning: unused parameter ‘gpublocks’ [-Wunused-parameter]
      233 |   void MatrixElementKernelDevice::setGrid( const int gpublocks, const int gputhreads )
          |                                         ~~~~~~~~~~^~~~~~~~~
    MatrixElementKernels.cc:233:72: warning: unused parameter ‘gputhreads’ [-Wunused-parameter]
      233 |   void MatrixElementKernelDevice::setGrid( const int gpublocks, const int gputhreads )
          |                                                              ~~~~~~~~~~^~~~~~~~~~

The whole chain of changing gpu grid needs fixing. the number of blocks and threads is set in MEK ctor, is it confirmed that these can be changed a posteriori? To be reassess. Anyway the build warnings are there for a good reason (though I will hide these two for now)

madgraph5 / madgraph4gpu

set_gpugrid is doing nothing - fix propagation of number of GPU threads to bridge (test GPU branch efficiency) #543