Replaced some bottleneck loop with blas

GiovanniBussi commented 3 months ago

I am working with this input file:

WHOLEMOLECULES ENTITY0=1-100000
c: CENTER ATOMS=1-100000
pos: POSITION ATOM=c
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1

By just replacing a loop with a blas call I can see a significant gain. I only tried in my laptop, but I guess this could be quite general (when optimized blas are available).

Below the full results, comparing: v2.9, master before this commit, and master with this commit. The slowdown wrt v2.9 is reduced from 26% to 18%

BENCH:  Kernel:      /Users/bussi/plumed2/tmp/v2.9-mpi-ref/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.162258     0.162258     0.162258     0.162258
BENCH:  B0 First step                                      1     0.008525     0.008525     0.008525     0.008525
BENCH:  B1 Warm-up                                        99     0.459748     0.004644     0.004194     0.008631
BENCH:  B2 Calculation part 1                            200     0.898147     0.004491     0.004087     0.004991
BENCH:  B3 Calculation part 2                            200     0.900684     0.004503     0.004103     0.005355
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.426477     2.426477     2.426477     2.426477
PLUMED: 1 Prepare dependencies                           500     0.001066     0.000002     0.000001     0.000013
PLUMED: 2 Sharing data                                   500     0.142706     0.000285     0.000233     0.000956
PLUMED: 3 Waiting for data                               500     0.000290     0.000001     0.000000     0.000008
PLUMED: 4 Calculating (forward loop)                     500     1.655126     0.003310     0.002919     0.007286
PLUMED: 5 Applying (backward loop)                       500     0.457173     0.000914     0.000828     0.001423
PLUMED: 6 Update                                         500     0.000556     0.000001     0.000001     0.000012
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/tmp/reference/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.260 +- 0.004
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.171702     0.171702     0.171702     0.171702
BENCH:  B0 First step                                      1     0.011417     0.011417     0.011417     0.011417
BENCH:  B1 Warm-up                                        99     0.569624     0.005754     0.005128     0.007905
BENCH:  B2 Calculation part 1                            200     1.135651     0.005678     0.005146     0.007376
BENCH:  B3 Calculation part 2                            200     1.130528     0.005653     0.005228     0.006545
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     3.016210     3.016210     3.016210     3.016210
PLUMED: 1 Prepare dependencies                           500     0.001550     0.000003     0.000002     0.000018
PLUMED: 2 Sharing data                                   500     0.170590     0.000341     0.000289     0.000747
PLUMED: 3 Waiting for data                               500     0.001610     0.000003     0.000001     0.000048
PLUMED: 4 Calculating (forward loop)                     500     1.898306     0.003797     0.003423     0.008020
PLUMED: 5 Applying (backward loop)                       500     0.763545     0.001527     0.001363     0.002595
PLUMED: 6 Update                                         500     0.002246     0.000004     0.000003     0.000032
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/src/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.182 +- 0.004
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.172252     0.172252     0.172252     0.172252
BENCH:  B0 First step                                      1     0.010180     0.010180     0.010180     0.010180
BENCH:  B1 Warm-up                                        99     0.533851     0.005392     0.004919     0.007108
BENCH:  B2 Calculation part 1                            200     1.063894     0.005319     0.004862     0.006318
BENCH:  B3 Calculation part 2                            200     1.062286     0.005311     0.004843     0.006118
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.840387     2.840387     2.840387     2.840387
PLUMED: 1 Prepare dependencies                           500     0.001577     0.000003     0.000001     0.000011
PLUMED: 2 Sharing data                                   500     0.168367     0.000337     0.000287     0.000634
PLUMED: 3 Waiting for data                               500     0.001345     0.000003     0.000001     0.000012
PLUMED: 4 Calculating (forward loop)                     500     1.893861     0.003788     0.003393     0.007536
PLUMED: 5 Applying (backward loop)                       500     0.594073     0.001188     0.001054     0.002308
PLUMED: 6 Update                                         500     0.002268     0.000005     0.000003     0.000026

codecov-commenter commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 83.25%. Comparing base (267c68f) to head (8701a7d). Report is 1 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1043 +/- ## ========================================== - Coverage 83.27% 83.25% -0.03% ========================================== Files 619 619 Lines 59216 59216 ========================================== - Hits 49315 49300 -15 - Misses 9901 9916 +15 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

GiovanniBussi commented 3 months ago

Speed up is less but still measurable when using intel compiler with system blas on my workstation (from 32% to 29% slowdown), so I think I can merge this.

BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-v2.9/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.808588     0.808588     0.808588     0.808588
BENCH:  B0 First step                                      1     0.016654     0.016654     0.016654     0.016654
BENCH:  B1 Warm-up                                       399     3.560118     0.008923     0.008333     0.026649
BENCH:  B2 Calculation part 1                            800     7.011192     0.008764     0.008338     0.012098
BENCH:  B3 Calculation part 2                            800     6.990790     0.008738     0.008322     0.011580
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1    18.354650    18.354650    18.354650    18.354650
PLUMED: 1 Prepare dependencies                          2000     0.008286     0.000004     0.000002     0.000018
PLUMED: 2 Sharing data                                  2000     1.602436     0.000801     0.000499     0.005897
PLUMED: 3 Waiting for data                              2000     0.003293     0.000002     0.000001     0.000014
PLUMED: 4 Calculating (forward loop)                    2000    12.499940     0.006250     0.006114     0.018466
PLUMED: 5 Applying (backward loop)                      2000     3.349057     0.001675     0.001641     0.004945
PLUMED: 6 Update                                        2000     0.004910     0.000002     0.000002     0.000013
BENCH:
BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-reference/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.325 +- 0.001
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.989982     0.989982     0.989982     0.989982
BENCH:  B0 First step                                      1     0.021248     0.021248     0.021248     0.021248
BENCH:  B1 Warm-up                                       399     4.729897     0.011854     0.011131     0.029778
BENCH:  B2 Calculation part 1                            800     9.287099     0.011609     0.011132     0.020115
BENCH:  B3 Calculation part 2                            800     9.269063     0.011586     0.011125     0.017037
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1    24.268297    24.268297    24.268297    24.268297
PLUMED: 1 Prepare dependencies                          2000     0.011475     0.000006     0.000003     0.000019
PLUMED: 2 Sharing data                                  2000     1.379957     0.000690     0.000537     0.003343
PLUMED: 3 Waiting for data                              2000     0.010423     0.000005     0.000004     0.000018
PLUMED: 4 Calculating (forward loop)                    2000    16.268256     0.008134     0.007786     0.018418
PLUMED: 5 Applying (backward loop)                      2000     5.528992     0.002764     0.002709     0.019713
PLUMED: 6 Update                                        2000     0.021584     0.000011     0.000008     0.000031
BENCH:
BENCH:  Kernel:      this
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.287 +- 0.001
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.494030     0.494030     0.494030     0.494030
BENCH:  B0 First step                                      1     0.030694     0.030694     0.030694     0.030694
BENCH:  B1 Warm-up                                       399     4.585869     0.011493     0.010771     0.037033
BENCH:  B2 Calculation part 1                            800     9.031793     0.011290     0.010769     0.017739
BENCH:  B3 Calculation part 2                            800     8.991378     0.011239     0.010752     0.017647
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1    23.114919    23.114919    23.114919    23.114919
PLUMED: 1 Prepare dependencies                          2000     0.010794     0.000005     0.000003     0.000023
PLUMED: 2 Sharing data                                  2000     1.394624     0.000697     0.000533     0.003837
PLUMED: 3 Waiting for data                              2000     0.009667     0.000005     0.000004     0.000035
PLUMED: 4 Calculating (forward loop)                    2000    16.324953     0.008162     0.007791     0.029984
PLUMED: 5 Applying (backward loop)                      2000     4.799277     0.002400     0.002368     0.010953
PLUMED: 6 Update                                        2000     0.021809     0.000011     0.000009     0.000024

plumed / plumed2

Replaced some bottleneck loop with blas #1043

Codecov Report