Optimize wholemolecules

GiovanniBussi commented 6 months ago

@gtribello here I tried a trick similar to #1044 to optimize wholemolecules. Notice that I didn't tough the EMST stuff, so tests are expected to fail. Anyway, tests with simple wholemolecules should work.

On my usual input I get some further speedup, with overhead going from 13% to 10%.

On @carlocamilloni 's input see here it's even better. Before this commit, I get same performance as v2.9. After this commit I gain something like 5%.

Maybe @gtribello you want to have a look at this code and think if there's some other reasonable (and simpler) solution. Notice that these PR are all stacked onto each other, so you should only look at the last commit (but the performance is measured including the previous PRs)

With my input:

BENCH:  Running comparative analysis, 800 blocks with size 1
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/tmp/v2.9-mpi-ref/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.162008     0.162008     0.162008     0.162008
BENCH:  B0 First step                                      1     0.008341     0.008341     0.008341     0.008341
BENCH:  B1 Warm-up                                       199     0.881318     0.004429     0.004052     0.006962
BENCH:  B2 Calculation part 1                            400     1.863671     0.004659     0.004073     0.013056
BENCH:  B3 Calculation part 2                            400     1.888927     0.004722     0.004100     0.012414
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     4.798456     4.798456     4.798456     4.798456
PLUMED: 1 Prepare dependencies                          1000     0.002245     0.000002     0.000001     0.000047
PLUMED: 2 Sharing data                                  1000     0.293262     0.000293     0.000232     0.001090
PLUMED: 3 Waiting for data                              1000     0.000644     0.000001     0.000000     0.000012
PLUMED: 4 Calculating (forward loop)                    1000     3.360588     0.003361     0.002908     0.009978
PLUMED: 5 Applying (backward loop)                      1000     0.963043     0.000963     0.000824     0.009227
PLUMED: 6 Update                                        1000     0.001164     0.000001     0.000001     0.000022
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/tmp/reference/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.134 +- 0.005
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.171429     0.171429     0.171429     0.171429
BENCH:  B0 First step                                      1     0.010750     0.010750     0.010750     0.010750
BENCH:  B1 Warm-up                                       199     0.993128     0.004991     0.004647     0.006216
BENCH:  B2 Calculation part 1                            400     2.108566     0.005271     0.004644     0.013903
BENCH:  B3 Calculation part 2                            400     2.146938     0.005367     0.004638     0.011803
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     5.425472     5.425472     5.425472     5.425472
PLUMED: 1 Prepare dependencies                          1000     0.003131     0.000003     0.000001     0.000033
PLUMED: 2 Sharing data                                  1000     0.347163     0.000347     0.000284     0.002668
PLUMED: 3 Waiting for data                              1000     0.002986     0.000003     0.000001     0.000037
PLUMED: 4 Calculating (forward loop)                    1000     3.781152     0.003781     0.003307     0.009611
PLUMED: 5 Applying (backward loop)                      1000     1.101889     0.001102     0.000954     0.003914
PLUMED: 6 Update                                        1000     0.004631     0.000005     0.000003     0.000095
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/src/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.106 +- 0.006
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.169502     0.169502     0.169502     0.169502
BENCH:  B0 First step                                      1     0.010354     0.010354     0.010354     0.010354
BENCH:  B1 Warm-up                                       199     0.969736     0.004873     0.004506     0.005702
BENCH:  B2 Calculation part 1                            400     2.062366     0.005156     0.004521     0.012825
BENCH:  B3 Calculation part 2                            400     2.085763     0.005214     0.004497     0.011861
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     5.293271     5.293271     5.293271     5.293271
PLUMED: 1 Prepare dependencies                          1000     0.003049     0.000003     0.000001     0.000043
PLUMED: 2 Sharing data                                  1000     0.353476     0.000353     0.000280     0.004270
PLUMED: 3 Waiting for data                              1000     0.002797     0.000003     0.000001     0.000017
PLUMED: 4 Calculating (forward loop)                    1000     3.641979     0.003642     0.003180     0.008875
PLUMED: 5 Applying (backward loop)                      1000     1.104568     0.001105     0.000953     0.006503
PLUMED: 6 Update                                        1000     0.004674     0.000005     0.000003     0.000105

With Carlo's input:

BENCH:  Kernel:      /Users/bussi/plumed2/tmp/v2.9-mpi-ref/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.050107     0.050107     0.050107     0.050107
BENCH:  B0 First step                                      1     0.013779     0.013779     0.013779     0.013779
BENCH:  B1 Warm-up                                       199     1.104527     0.005550     0.000092     0.030359
BENCH:  B2 Calculation part 1                            400     2.557483     0.006394     0.000092     0.033349
BENCH:  B3 Calculation part 2                            400     2.640922     0.006602     0.000091     0.034374
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     6.364339     6.364339     6.364339     6.364339
PLUMED: 1 Prepare dependencies                          1000     0.000861     0.000001     0.000000     0.000016
PLUMED: 2 Sharing data                                  1000     0.049020     0.000049     0.000036     0.001395
PLUMED: 3 Waiting for data                              1000     0.000258     0.000000     0.000000     0.000004
PLUMED: 4 Calculating (forward loop)                    1000     6.229546     0.006230     0.000037     0.034195
PLUMED: 5 Applying (backward loop)                      1000     0.026668     0.000027     0.000012     0.000322
PLUMED: 6 Update                                        1000     0.000623     0.000001     0.000000     0.000325
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/tmp/reference/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.003 +- 0.002
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.053169     0.053169     0.053169     0.053169
BENCH:  B0 First step                                      1     0.012605     0.012605     0.012605     0.012605
BENCH:  B1 Warm-up                                       199     1.100971     0.005533     0.000093     0.030146
BENCH:  B2 Calculation part 1                            400     2.563696     0.006409     0.000093     0.033939
BENCH:  B3 Calculation part 2                            400     2.651712     0.006629     0.000090     0.034148
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     6.379922     6.379922     6.379922     6.379922
PLUMED: 1 Prepare dependencies                          1000     0.001426     0.000001     0.000000     0.000034
PLUMED: 2 Sharing data                                  1000     0.049450     0.000049     0.000030     0.000857
PLUMED: 3 Waiting for data                              1000     0.001059     0.000001     0.000000     0.000030
PLUMED: 4 Calculating (forward loop)                    1000     6.249724     0.006250     0.000055     0.033957
PLUMED: 5 Applying (backward loop)                      1000     0.015287     0.000015     0.000000     0.000734
PLUMED: 6 Update                                        1000     0.001769     0.000002     0.000000     0.000377
BENCH:  
BENCH:  Kernel:      /Users/bussi/plumed2/src/lib/libplumedKernel.dylib
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 0.950 +- 0.002
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.052167     0.052167     0.052167     0.052167
BENCH:  B0 First step                                      1     0.011640     0.011640     0.011640     0.011640
BENCH:  B1 Warm-up                                       199     1.035437     0.005203     0.000089     0.030170
BENCH:  B2 Calculation part 1                            400     2.425335     0.006063     0.000088     0.032003
BENCH:  B3 Calculation part 2                            400     2.511439     0.006279     0.000087     0.033047
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     6.034105     6.034105     6.034105     6.034105
PLUMED: 1 Prepare dependencies                          1000     0.001266     0.000001     0.000000     0.000028
PLUMED: 2 Sharing data                                  1000     0.050709     0.000051     0.000031     0.000961
PLUMED: 3 Waiting for data                              1000     0.001011     0.000001     0.000000     0.000023
PLUMED: 4 Calculating (forward loop)                    1000     5.903333     0.005903     0.000050     0.032839
PLUMED: 5 Applying (backward loop)                      1000     0.015169     0.000015     0.000000     0.000758
PLUMED: 6 Update                                        1000     0.001846     0.000002     0.000000     0.000371

GiovanniBussi commented 6 months ago

And here results for the intel compiler on my workstation. Reference is current master (just pulled), then I time master + this optimization of wholemolecules.

My input: 28% -> 26% overhead

Carlo's input: no measurable overhead in both cases (<1%)

BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-v2.9/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.439503     0.439503     0.439503     0.439503
BENCH:  B0 First step                                      1     0.018843     0.018843     0.018843     0.018843
BENCH:  B1 Warm-up                                       199     1.762787     0.008858     0.008241     0.012670
BENCH:  B2 Calculation part 1                            400     3.468244     0.008671     0.008249     0.012408
BENCH:  B3 Calculation part 2                            400     3.465002     0.008663     0.008249     0.011359
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     9.140007     9.140007     9.140007     9.140007
PLUMED: 1 Prepare dependencies                          1000     0.003898     0.000004     0.000002     0.000016
PLUMED: 2 Sharing data                                  1000     0.802382     0.000802     0.000497     0.002601
PLUMED: 3 Waiting for data                              1000     0.001643     0.000002     0.000001     0.000014
PLUMED: 4 Calculating (forward loop)                    1000     6.209310     0.006209     0.006058     0.013588
PLUMED: 5 Applying (backward loop)                      1000     1.643378     0.001643     0.001610     0.002541
PLUMED: 6 Update                                        1000     0.002249     0.000002     0.000001     0.000011
BENCH:  
BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-reference/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.276 +- 0.002
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.458510     0.458510     0.458510     0.458510
BENCH:  B0 First step                                      1     0.021890     0.021890     0.021890     0.021890
BENCH:  B1 Warm-up                                       199     2.256600     0.011340     0.010602     0.018078
BENCH:  B2 Calculation part 1                            400     4.421297     0.011053     0.010608     0.016984
BENCH:  B3 Calculation part 2                            400     4.424044     0.011060     0.010600     0.015301
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1    11.568168    11.568168    11.568168    11.568168
PLUMED: 1 Prepare dependencies                          1000     0.004773     0.000005     0.000003     0.000015
PLUMED: 2 Sharing data                                  1000     0.681942     0.000682     0.000530     0.001941
PLUMED: 3 Waiting for data                              1000     0.006367     0.000006     0.000006     0.000022
PLUMED: 4 Calculating (forward loop)                    1000     8.122590     0.008123     0.007776     0.015270
PLUMED: 5 Applying (backward loop)                      1000     2.252827     0.002253     0.002219     0.004536
PLUMED: 6 Update                                        1000     0.008153     0.000008     0.000007     0.000020
BENCH:  
BENCH:  Kernel:      /scratch/bussi/plumed2/src/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.255 +- 0.001
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.441582     0.441582     0.441582     0.441582
BENCH:  B0 First step                                      1     0.027241     0.027241     0.027241     0.027241
BENCH:  B1 Warm-up                                       199     2.213178     0.011121     0.010403     0.017219
BENCH:  B2 Calculation part 1                            400     4.352620     0.010882     0.010401     0.017213
BENCH:  B3 Calculation part 2                            400     4.348310     0.010871     0.010397     0.016596
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1    11.370794    11.370794    11.370794    11.370794
PLUMED: 1 Prepare dependencies                          1000     0.004210     0.000004     0.000002     0.000016
PLUMED: 2 Sharing data                                  1000     0.686542     0.000687     0.000526     0.003764
PLUMED: 3 Waiting for data                              1000     0.004769     0.000005     0.000004     0.000038
PLUMED: 4 Calculating (forward loop)                    1000     7.938052     0.007938     0.007563     0.017875
PLUMED: 5 Applying (backward loop)                      1000     2.260929     0.002261     0.002225     0.005465
PLUMED: 6 Update                                        1000     0.007763     0.000008     0.000007     0.000022

BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-v2.9/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.000 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.138326     0.138326     0.138326     0.138326
BENCH:  B0 First step                                      1     0.185941     0.185941     0.185941     0.185941
BENCH:  B1 Warm-up                                       199     3.294411     0.016555     0.000207     0.090417
BENCH:  B2 Calculation part 1                            400     7.457271     0.018643     0.000207     0.095864
BENCH:  B3 Calculation part 2                            400     7.553318     0.018883     0.000207     0.097540
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1    18.619273    18.619273    18.619273    18.619273
PLUMED: 1 Prepare dependencies                          1000     0.001505     0.000002     0.000001     0.000012
PLUMED: 2 Sharing data                                  1000     0.097701     0.000098     0.000088     0.001876
PLUMED: 3 Waiting for data                              1000     0.000837     0.000001     0.000001     0.000010
PLUMED: 4 Calculating (forward loop)                    1000    18.178325     0.018178     0.000062     0.097289
PLUMED: 5 Applying (backward loop)                      1000     0.043801     0.000044     0.000028     0.000621
PLUMED: 6 Update                                        1000     0.141444     0.000141     0.000001     0.140419
BENCH:  
BENCH:  Kernel:      /scratch/bussi/plumed2/tmp/intel-reference/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.002 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.152307     0.152307     0.152307     0.152307
BENCH:  B0 First step                                      1     0.061523     0.061523     0.061523     0.061523
BENCH:  B1 Warm-up                                       199     3.296157     0.016564     0.000220     0.090576
BENCH:  B2 Calculation part 1                            400     7.479278     0.018698     0.000219     0.096017
BENCH:  B3 Calculation part 2                            400     7.562410     0.018906     0.000220     0.097389
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1    18.541842    18.541842    18.541842    18.541842
PLUMED: 1 Prepare dependencies                          1000     0.001789     0.000002     0.000001     0.000013
PLUMED: 2 Sharing data                                  1000     0.091294     0.000091     0.000071     0.002350
PLUMED: 3 Waiting for data                              1000     0.004282     0.000004     0.000004     0.000026
PLUMED: 4 Calculating (forward loop)                    1000    18.245589     0.018246     0.000115     0.097074
PLUMED: 5 Applying (backward loop)                      1000     0.024121     0.000024     0.000001     0.001297
PLUMED: 6 Update                                        1000     0.002991     0.000003     0.000002     0.000521
BENCH:  
BENCH:  Kernel:      /scratch/bussi/plumed2/src/lib/libplumedKernel.so
BENCH:  Input:       plumed.dat
BENCH:  Comparative: 1.003 +- 0.000
BENCH:                                                Cycles        Total      Average      Minimum      Maximum
BENCH:  A Initialization                                   1     0.148465     0.148465     0.148465     0.148465
BENCH:  B0 First step                                      1     0.064416     0.064416     0.064416     0.064416
BENCH:  B1 Warm-up                                       199     3.300723     0.016587     0.000198     0.090649
BENCH:  B2 Calculation part 1                            400     7.484045     0.018710     0.000198     0.096261
BENCH:  B3 Calculation part 2                            400     7.564513     0.018911     0.000198     0.097447
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1    18.552670    18.552670    18.552670    18.552670
PLUMED: 1 Prepare dependencies                          1000     0.001862     0.000002     0.000001     0.000012
PLUMED: 2 Sharing data                                  1000     0.092785     0.000093     0.000071     0.003759
PLUMED: 3 Waiting for data                              1000     0.004292     0.000004     0.000003     0.000028
PLUMED: 4 Calculating (forward loop)                    1000    18.260201     0.018260     0.000094     0.097126
PLUMED: 5 Applying (backward loop)                      1000     0.022730     0.000023     0.000001     0.001235
PLUMED: 6 Update                                        1000     0.002970     0.000003     0.000002     0.000508

gtribello commented 6 months ago

Hello @GiovanniBussi

What you have done here seems sensible. I don't think I can do it better.

GiovanniBussi commented 6 months ago

@gtribello I really don't like the way it's done, because it's intrusive in what's supposed to be "user code" (wholemolecules), with modifications that are difficult to understand. I also want to repeat the timings, because there is an interplay between all the optimizations we are doing. This one might be not so relevant, so I would keep it on hold

plumed / plumed2

Optimize wholemolecules #1045