Optimize grouping indexes

I am again working with this input file:

WHOLEMOLECULES ENTITY0=1-100000
c: CENTER ATOMS=1-100000
pos: POSITION ATOM=c
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1

In this PR I tried to optimize loops like this:

 for(const auto & a : atom_value_ind) {
   plumed_dbg_massert( ind<forcesToApply.size(), "problem setting forces in " + getLabel() );
   std::size_t nn = a.first, kk = a.second;
   xpos[nn]->inputForce[kk] += forcesToApply[ind]; ind++;
   ypos[nn]->inputForce[kk] += forcesToApply[ind]; ind++;
   zpos[nn]->inputForce[kk] += forcesToApply[ind]; ind++;
 }

Into loops like this:

  for(const auto & a : atom_value_ind_grouped) {
    const auto nn=a.first;
    plumed_dbg_assert(ind<forcesToApply.size()) << "problem setting forces in " << getLabel();
    auto & xp=xpos[nn]->inputForce;
    auto & yp=ypos[nn]->inputForce;
    auto & zp=zpos[nn]->inputForce;
    for(const auto & kk : a.second) {
      xp[kk] += forcesToApply[ind]; ind++;
      yp[kk] += forcesToApply[ind]; ind++;
      zp[kk] += forcesToApply[ind]; ind++;
    }
  }

Here atom_value_ind_grouped is constructed when atom_value_ind is updated, and basically stores the same information in a different way, exployting the fact that for most of the iterations in the first implementation nn is constant.

@gtribello can you have a look and check if this makes sense? Is this a reasonable assumption on the memory access pattern?

The improvement is quite measurable. Below the comparison is:

v2.9,
the commit where I optimized using blas, (#1043)
this change

The overhead decreases from 18% (using blas) to 12% (this commit).

BENCH:  Kernel:      /Users/bussi/plumed2/tmp/v2.9-mpi-ref/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.162712     0.162712     0.162712     0.162712
BENCH:  B0 First step                                      1     0.008648     0.008648     0.008648     0.008648
BENCH:  B1 Warm-up                                        99     0.450701     0.004553     0.004138     0.004917
BENCH:  B2 Calculation part 1                            200     0.914300     0.004572     0.004184     0.005419
BENCH:  B3 Calculation part 2                            200     0.907306     0.004537     0.004116     0.005461
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.440795     2.440795     2.440795     2.440795
PLUMED: 1 Prepare dependencies                           500     0.001112     0.000002     0.000001     0.000016
PLUMED: 2 Sharing data                                   500     0.143430     0.000287     0.000232     0.001424
PLUMED: 3 Waiting for data                               500     0.000257     0.000001     0.000000     0.000007
PLUMED: 4 Calculating (forward loop)                     500     1.663443     0.003327     0.002997     0.006315
PLUMED: 5 Applying (backward loop)                       500     0.461751     0.000924     0.000834     0.001461
PLUMED: 6 Update                                         500     0.000515     0.000001     0.000001     0.000003
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/tmp/reference/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.176 +- 0.004
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.170923     0.170923     0.170923     0.170923
BENCH:  B0 First step                                      1     0.011203     0.011203     0.011203     0.011203
BENCH:  B1 Warm-up                                        99     0.530119     0.005355     0.004808     0.005859
BENCH:  B2 Calculation part 1                            200     1.073778     0.005369     0.004811     0.007811
BENCH:  B3 Calculation part 2                            200     1.068996     0.005345     0.004818     0.006343
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.852410     2.852410     2.852410     2.852410
PLUMED: 1 Prepare dependencies                           500     0.001712     0.000003     0.000001     0.000021
PLUMED: 2 Sharing data                                   500     0.173275     0.000347     0.000287     0.000940
PLUMED: 3 Waiting for data                               500     0.001615     0.000003     0.000001     0.000023
PLUMED: 4 Calculating (forward loop)                     500     1.899739     0.003799     0.003378     0.007691
PLUMED: 5 Applying (backward loop)                       500     0.595992     0.001192     0.001059     0.002648
PLUMED: 6 Update                                         500     0.002300     0.000005     0.000003     0.000042
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/src/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.125 +- 0.003
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.176497     0.176497     0.176497     0.176497
BENCH:  B0 First step                                      1     0.009901     0.009901     0.009901     0.009901
BENCH:  B1 Warm-up                                        99     0.513381     0.005186     0.004732     0.005876
BENCH:  B2 Calculation part 1                            200     1.028328     0.005142     0.004676     0.005917
BENCH:  B3 Calculation part 2                            200     1.021688     0.005108     0.004611     0.005853
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.747601     2.747601     2.747601     2.747601
PLUMED: 1 Prepare dependencies                           500     0.001586     0.000003     0.000001     0.000005
PLUMED: 2 Sharing data                                   500     0.168293     0.000337     0.000287     0.000723
PLUMED: 3 Waiting for data                               500     0.001344     0.000003     0.000001     0.000016
PLUMED: 4 Calculating (forward loop)                     500     1.850951     0.003702     0.003321     0.007262
PLUMED: 5 Applying (backward loop)                       500     0.540332     0.001081     0.000951     0.001962
PLUMED: 6 Update                                         500     0.002144     0.000004     0.000003     0.000026

The improvement with intel compiler on my workstation is less (29% to 27% overhead), but still measurable.

BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-v2.9/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.424078     0.424078     0.424078     0.424078
BENCH:  B0 First step                                      1     0.017435     0.017435     0.017435     0.017435
BENCH:  B1 Warm-up                                        99     0.899508     0.009086     0.008296     0.012992
BENCH:  B2 Calculation part 1                            200     1.743061     0.008715     0.008302     0.011589
BENCH:  B3 Calculation part 2                            200     1.745426     0.008727     0.008314     0.011583
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     4.821678     4.821678     4.821678     4.821678
PLUMED: 1 Prepare dependencies                           500     0.001949     0.000004     0.000002     0.000015
PLUMED: 2 Sharing data                                   500     0.404638     0.000809     0.000504     0.002490
PLUMED: 3 Waiting for data                               500     0.000810     0.000002     0.000001     0.000013
PLUMED: 4 Calculating (forward loop)                     500     3.145875     0.006292     0.006110     0.012230
PLUMED: 5 Applying (backward loop)                       500     0.824214     0.001648     0.001609     0.002594
PLUMED: 6 Update                                         500     0.001113     0.000002     0.000001     0.000006
BENCH:  
BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-reference/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.287 +- 0.003
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.440200     0.440200     0.440200     0.440200
BENCH:  B0 First step                                      1     0.022010     0.022010     0.022010     0.022010
BENCH:  B1 Warm-up                                        99     1.163765     0.011755     0.010770     0.017656
BENCH:  B2 Calculation part 1                            200     2.243936     0.011220     0.010745     0.017048
BENCH:  B3 Calculation part 2                            200     2.243593     0.011218     0.010752     0.017610
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     6.106817     6.106817     6.106817     6.106817
PLUMED: 1 Prepare dependencies                           500     0.002200     0.000004     0.000002     0.000014
PLUMED: 2 Sharing data                                   500     0.343587     0.000687     0.000527     0.001934
PLUMED: 3 Waiting for data                               500     0.002487     0.000005     0.000004     0.000024
PLUMED: 4 Calculating (forward loop)                     500     4.122769     0.008246     0.007840     0.015409
PLUMED: 5 Applying (backward loop)                       500     1.176878     0.002354     0.002314     0.004537
PLUMED: 6 Update                                         500     0.004813     0.000010     0.000009     0.000021
BENCH:  
BENCH:  Kernel:      /scratch/bussi/plumed2/src/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.267 +- 0.002
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.445317     0.445317     0.445317     0.445317
BENCH:  B0 First step                                      1     0.027140     0.027140     0.027140     0.027140
BENCH:  B1 Warm-up                                        99     1.140178     0.011517     0.010588     0.016933
BENCH:  B2 Calculation part 1                            200     2.206530     0.011033     0.010579     0.015644
BENCH:  B3 Calculation part 2                            200     2.213058     0.011065     0.010590     0.015673
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     6.026409     6.026409     6.026409     6.026409
PLUMED: 1 Prepare dependencies                           500     0.002142     0.000004     0.000003     0.000014
PLUMED: 2 Sharing data                                   500     0.344054     0.000688     0.000530     0.003223
PLUMED: 3 Waiting for data                               500     0.002326     0.000005     0.000004     0.000026
PLUMED: 4 Calculating (forward loop)                     500     4.091521     0.008183     0.007775     0.019535
PLUMED: 5 Applying (backward loop)                       500     1.123146     0.002246     0.002201     0.004265
PLUMED: 6 Update                                         500     0.004250     0.000009     0.000008     0.000018

plumed / plumed2

Optimize grouping indexes #1044