Optimize wholemolecules

GiovanniBussi commented 4 months ago

@gtribello I tried to optimize wholemolecules, it's mostly inlining stuff but you might want to check if I did things right.

What worries me a bit is that even with this impressive improvement (almost a factor 2) I am back to timings pre-htt. Here's my timing (no graphs, but it should be easy to understand):

pre htt: 00b63d594487f4445e7d6748ef65c246c7ce3173

PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.426496     2.426496     2.426496     2.426496
PLUMED: 1 Prepare dependencies                          2000     0.001444     0.000001     0.000000     0.000012
PLUMED: 2 Sharing data                                  2000     0.493404     0.000247     0.000217     0.001881
PLUMED: 3 Waiting for data                              2000     0.000590     0.000000     0.000000     0.000013
PLUMED: 4 Calculating (forward loop)                    2000     1.376356     0.000688     0.000623     0.002047
PLUMED: 5 Applying (backward loop)                      2000     0.452501     0.000226     0.000200     0.002366
PLUMED: 6 Update                                        2000     0.002034     0.000001     0.000000     0.000078

current master 05fb5480bd42a1d077f564c55460c0d0961469fd

PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     4.537703     4.537703     4.537703     4.537703
PLUMED: 1 Prepare dependencies                          2000     0.002650     0.000001     0.000001     0.000027
PLUMED: 2 Sharing data                                  2000     0.651106     0.000326     0.000256     0.003563
PLUMED: 3 Waiting for data                              2000     0.002185     0.000001     0.000000     0.000048
PLUMED: 4 Calculating (forward loop)                    2000     2.870725     0.001435     0.001229     0.009485
PLUMED: 5 Applying (backward loop)                      2000     0.890085     0.000445     0.000349     0.004437
PLUMED: 6 Update                                        2000     0.002864     0.000001     0.000001     0.000035

optimize-wholemolecules 2daf02676cba141417062cea84baec51590f8b70 

PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.434463     2.434463     2.434463     2.434463
PLUMED: 1 Prepare dependencies                          2000     0.001600     0.000001     0.000000     0.000065
PLUMED: 2 Sharing data                                  2000     0.494312     0.000247     0.000216     0.000901
PLUMED: 3 Waiting for data                              2000     0.000606     0.000000     0.000000     0.000008
PLUMED: 4 Calculating (forward loop)                    2000     1.387576     0.000694     0.000602     0.001377
PLUMED: 5 Applying (backward loop)                      2000     0.450181     0.000225     0.000197     0.002094
PLUMED: 6 Update                                        2000     0.002222     0.000001     0.000001     0.000013

This is my input file:

WHOLEMOLECULES ENTITY0=1-100000
pos: POSITION ATOM=1
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1

So far so good, but previous @gtribello 's tests were reporting wholemolecules as being faster in htt than in master, whereas I see it slower by a factor ~ 2x (before this modifications).

Is there something that I am missing? I am doing my tests on an intel MacBook Pro.

gtribello commented 4 months ago

@GiovanniBussi I had at this and I think it is fine to merge it. It looks like it is the same.

I don't know why my computer was not reporting the slow down for WHOLEMOLECULES. Perhaps my compiler automatically inlined those functions?

GiovanniBussi commented 4 months ago

Maybe it can inline across objects. Did you try on the ARM Mac? I will test with mine as well.

gtribello commented 4 months ago

I tested it on the ARM Mac. it is the only one I have :-(

GiovanniBussi commented 4 months ago

I actually managed to try. Very interesting:

in the test Gareth was doing (without applying force) I also see htt to be faster than pre-htt, largely because the apply phase don't do anything in htt (I guess this optimization was missing before). Calculate is instead slightly slower in htt. This is consistent with what Gareth has seen before.
in the test I was doing (applying force) I see htt to be slower than pre-htt
my optimization merged in this PR are pretty useless on M2 processor

This is very weird. It looks like:

the missing inlining per se is little relevant (calculate itself is not affected by inlining)
application of forces is still a bit suboptimal in htt (there might be more work to do).
putting this together with the timings I did on an intel mac, it looks like optimizations might have very different effects on different machines.

Raw data below.

bash-3.2$ cat pre
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     1.551619     1.551619     1.551619     1.551619
PLUMED: 1 Prepare dependencies                          2000     0.000081     0.000000     0.000000     0.000009
PLUMED: 2 Sharing data                                  2000     0.172640     0.000086     0.000085     0.000242
PLUMED: 3 Waiting for data                              2000     0.000051     0.000000     0.000000     0.000000
PLUMED: 4 Calculating (forward loop)                    2000     0.992900     0.000496     0.000489     0.000566
PLUMED: 5 Applying (backward loop)                      2000     0.314155     0.000157     0.000154     0.000210
PLUMED: 6 Update                                        2000     0.000054     0.000000     0.000000     0.000007
bash-3.2$ cat master
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     1.327825     1.327825     1.327825     1.327825
PLUMED: 1 Prepare dependencies                          2000     0.000158     0.000000     0.000000     0.000001
PLUMED: 2 Sharing data                                  2000     0.185173     0.000093     0.000092     0.000193
PLUMED: 3 Waiting for data                              2000     0.000146     0.000000     0.000000     0.000009
PLUMED: 4 Calculating (forward loop)                    2000     1.096683     0.000548     0.000543     0.000621
PLUMED: 5 Applying (backward loop)                      2000     0.000077     0.000000     0.000000     0.000001
PLUMED: 6 Update                                        2000     0.000061     0.000000     0.000000     0.000003
bash-3.2$ cat opt
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     1.328747     1.328747     1.328747     1.328747
PLUMED: 1 Prepare dependencies                          2000     0.000151     0.000000     0.000000     0.000001
PLUMED: 2 Sharing data                                  2000     0.187392     0.000094     0.000093     0.000196
PLUMED: 3 Waiting for data                              2000     0.000129     0.000000     0.000000     0.000001
PLUMED: 4 Calculating (forward loop)                    2000     1.095795     0.000548     0.000542     0.000596
PLUMED: 5 Applying (backward loop)                      2000     0.000096     0.000000     0.000000     0.000000
PLUMED: 6 Update                                        2000     0.000054     0.000000     0.000000     0.000002
bash-3.2$ cat pre.force
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     1.519698     1.519698     1.519698     1.519698
PLUMED: 1 Prepare dependencies                          2000     0.000092     0.000000     0.000000     0.000000
PLUMED: 2 Sharing data                                  2000     0.171945     0.000086     0.000085     0.000245
PLUMED: 3 Waiting for data                              2000     0.000050     0.000000     0.000000     0.000001
PLUMED: 4 Calculating (forward loop)                    2000     0.990990     0.000495     0.000492     0.000564
PLUMED: 5 Applying (backward loop)                      2000     0.311235     0.000156     0.000153     0.000229
PLUMED: 6 Update                                        2000     0.000073     0.000000     0.000000     0.000005
bash-3.2$ cat master.force 
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     1.789899     1.789899     1.789899     1.789899
PLUMED: 1 Prepare dependencies                          2000     0.000223     0.000000     0.000000     0.000001
PLUMED: 2 Sharing data                                  2000     0.187675     0.000094     0.000093     0.000193
PLUMED: 3 Waiting for data                              2000     0.000144     0.000000     0.000000     0.000002
PLUMED: 4 Calculating (forward loop)                    2000     1.149530     0.000575     0.000562     0.000631
PLUMED: 5 Applying (backward loop)                      2000     0.407120     0.000204     0.000202     0.000246
PLUMED: 6 Update                                        2000     0.000109     0.000000     0.000000     0.000006
bash-3.2$ cat opt.force 
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     1.764365     1.764365     1.764365     1.764365
PLUMED: 1 Prepare dependencies                          2000     0.000219     0.000000     0.000000     0.000001
PLUMED: 2 Sharing data                                  2000     0.185834     0.000093     0.000092     0.000196
PLUMED: 3 Waiting for data                              2000     0.000135     0.000000     0.000000     0.000002
PLUMED: 4 Calculating (forward loop)                    2000     1.128362     0.000564     0.000555     0.000620
PLUMED: 5 Applying (backward loop)                      2000     0.404132     0.000202     0.000200     0.000243
PLUMED: 6 Update                                        2000     0.000113     0.000000     0.000000     0.000006

GiovanniBussi commented 4 months ago

And both tests were done with identical compilers, just different processor:

Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: arm64-apple-darwin23.2.0
Thread model: posix

vs

Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: x86_64-apple-darwin22.6.0
Thread model: posix

GiovanniBussi commented 4 months ago

Actually I found the mistake. My original timing were wrong, I think there was some issue with loading the proper PLUMED_KERNEL in doing tests. I likely compared pre-htt with itself. Now, also on INTEL I get a significant difference between pre-htt and current master when applying forces:

pre htt:

PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.593459     2.593459     2.593459     2.593459
PLUMED: 1 Prepare dependencies                          2000     0.000949     0.000000     0.000000     0.000027
PLUMED: 2 Sharing data                                  2000     0.493889     0.000247     0.000188     0.000744
PLUMED: 3 Waiting for data                              2000     0.000353     0.000000     0.000000     0.000007
PLUMED: 4 Calculating (forward loop)                    2000     1.490857     0.000745     0.000613     0.001737
PLUMED: 5 Applying (backward loop)                      2000     0.497509     0.000249     0.000194     0.000770
PLUMED: 6 Update                                        2000     0.000829     0.000000     0.000000     0.000037

master:

PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     4.278036     4.278036     4.278036     4.278036
PLUMED: 1 Prepare dependencies                          2000     0.002487     0.000001     0.000001     0.000073
PLUMED: 2 Sharing data                                  2000     0.625973     0.000313     0.000240     0.000894
PLUMED: 3 Waiting for data                              2000     0.001588     0.000001     0.000000     0.000032
PLUMED: 4 Calculating (forward loop)                    2000     2.611230     0.001306     0.001100     0.002691
PLUMED: 5 Applying (backward loop)                      2000     0.917656     0.000459     0.000372     0.001163
PLUMED: 6 Update                                        2000     0.000984     0.000000     0.000000     0.000018

gtribello commented 4 months ago

@GiovanniBussi was there anything in the old master code that knew that forces didn't need to be applied to atoms that were only requested by WHOLEMOLECULES? I ask because if you think about the input you are using:

WHOLEMOLECULES ENTITY0=1-100000
pos: POSITION ATOM=1
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1

There is only a force on atom 1. I think in hack-the-tree you can get a lot of speed up in apply (for that input) if you skip adding forces to atoms 2-100000. I think the same is true of master, though, as there is nothing in there that recognises that there are many atoms in that input that have no force acting on them. I am worried this a pretty niche thing to optimise, though, and I am not sure it would be useful in general.

How is the performance of the various versions if you do this input:

WHOLEMOLECULES ENTITY0=1-100000
c1: COM ATOMS=1-100000
pos: POSITION ATOM=c1
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1

GiovanniBussi commented 4 months ago

I think that the optimization you are mentioned is not used anywhere (both master and pre-htt). Anyway, I tried what you suggested and get these:

pre htt:
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     9.718018     9.718018     9.718018     9.718018
PLUMED: 1 Prepare dependencies                          2000     0.003129     0.000002     0.000001     0.000028
PLUMED: 2 Sharing data                                  2000     0.569606     0.000285     0.000230     0.000903
PLUMED: 3 Waiting for data                              2000     0.000819     0.000000     0.000000     0.000017
PLUMED: 4 Calculating (forward loop)                    2000     6.908385     0.003454     0.002874     0.007172
PLUMED: 5 Applying (backward loop)                      2000     2.000433     0.001000     0.000823     0.002401
PLUMED: 6 Update                                        2000     0.001787     0.000001     0.000000     0.000050

current master:
PLUMED:                                                    1    12.107791    12.107791    12.107791    12.107791
PLUMED: 1 Prepare dependencies                          2000     0.004344     0.000002     0.000001     0.000027
PLUMED: 2 Sharing data                                  2000     0.686253     0.000343     0.000280     0.000946
PLUMED: 3 Waiting for data                              2000     0.002432     0.000001     0.000000     0.000084
PLUMED: 4 Calculating (forward loop)                    2000     9.105021     0.004553     0.003803     0.008684
PLUMED: 5 Applying (backward loop)                      2000     2.057898     0.001029     0.000838     0.002042
PLUMED: 6 Update                                        2000     0.001813     0.000001     0.000001     0.000024

I guess the difference might come from both CENTER and WHOLEMOLECULES now, however

gtribello commented 4 months ago

@GiovanniBussi OK but now you see that the difference in the timings for applying forces disappears. i.e. step 5: Applying is the same for both setups.

I think I might be missing something here. Is the problem that this input:

WHOLEMOLECULES ENTITY0=1-100000
pos: POSITION ATOM=1
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1

Is significantly slower than this one:

WHOLEMOLECULES ENTITY0=1-100000

?

GiovanniBussi commented 4 months ago

I think the problem is not (only) related to force application. Even when not applying forces you get a significantly slower calculate.

I suspect there might also be a bug. Pre htt code looks like this:

void WholeMolecules::calculate() {
  for(unsigned i=0; i<groups.size(); ++i) {
    if(addref) {
      Vector & first (modifyGlobalPosition(groups[i][0]));
      first = refs[i]+pbcDistance(refs[i],first);
    }
    for(unsigned j=0; j<groups[i].size()-1; ++j) {
      const Vector & first (getGlobalPosition(roots[i][j]));
      Vector & second (modifyGlobalPosition(groups[i][j+1]));
      second=first+pbcDistance(first,second);
    }
  }
}

Master looks like this:

for(unsigned i=0; i<p_groups.size(); ++i) {
    Vector first = getGlobalPosition(p_groups[i][0]);
    if(addref) {
      first = refs[i]+pbcDistance(refs[i],first);
      setGlobalPosition( p_groups[i][0], first );
    }
    for(unsigned j=1; j<p_groups[i].size(); ++j) {
      Vector second=getGlobalPosition(p_groups[i][j]);
      first = first+pbcDistance(first,second);
      setGlobalPosition(p_groups[i][j], first );
    }
  }

I think you are not using p_roots. By superficial look it seems this will introduce a bug in EMST, which is not tested by the way.

EDIT: fixed at 1b05292e4c2bb34400aec1c6bbeb207092944b38

gtribello commented 4 months ago

@GiovanniBussi So does:

WHOLEMOLECULES ENTITY0=1-100000

take the same time as:

WHOLEMOLECULES ENTITY0=1-100000
pos: POSITION ATOM=1
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1

or not?

From what you wrote earlier in this thread it seems that the slowdown is much larger for the second input. My reading was that the timings for the first input is the same in htt and master. I'm confused and concerned that we seem to be discussing two things at once.

GiovanniBussi commented 4 months ago

I think you are confused by my mistake above, I made a mess sorry.

I ran them again now, I get something like this on INTEL. Timings on ARM are similar (same "ranking", but the differences between branches are someway smaller).

without htt stuff

no force, pre htt:

PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     2.769249     2.769249     2.769249     2.769249
PLUMED: 1 Prepare dependencies                          2000     0.000914     0.000000     0.000000     0.000021
PLUMED: 2 Sharing data                                  2000     0.541431     0.000271     0.000201     0.000891
PLUMED: 3 Waiting for data                              2000     0.000609     0.000000     0.000000     0.000023
PLUMED: 4 Calculating (forward loop)                    2000     1.577261     0.000789     0.000632     0.001633
PLUMED: 5 Applying (backward loop)                      2000     0.526821     0.000263     0.000202     0.000754
PLUMED: 6 Update                                        2000     0.000821     0.000000     0.000000     0.000074

force, pre htt:
PLUMED:                                                    1     2.777388     2.777388     2.777388     2.777388
PLUMED: 1 Prepare dependencies                          2000     0.001080     0.000001     0.000000     0.000022
PLUMED: 2 Sharing data                                  2000     0.524391     0.000262     0.000199     0.000941
PLUMED: 3 Waiting for data                              2000     0.000480     0.000000     0.000000     0.000021
PLUMED: 4 Calculating (forward loop)                    2000     1.613729     0.000807     0.000661     0.002137
PLUMED: 5 Applying (backward loop)                      2000     0.521073     0.000261     0.000201     0.000836
PLUMED: 6 Update                                        2000     0.000708     0.000000     0.000000     0.000041

Here you can see that pre-htt there's no significant optimization in place where you do not add any force. Also the cost of calculate is not affected, as expected.

with htt stuff

no force master (post htt)
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     3.054768     3.054768     3.054768     3.054768
PLUMED: 1 Prepare dependencies                          2000     0.001544     0.000001     0.000000     0.000023
PLUMED: 2 Sharing data                                  2000     0.559304     0.000280     0.000205     0.000992
PLUMED: 3 Waiting for data                              2000     0.001210     0.000001     0.000000     0.000024
PLUMED: 4 Calculating (forward loop)                    2000     2.377854     0.001189     0.000981     0.002303
PLUMED: 5 Applying (backward loop)                      2000     0.000898     0.000000     0.000000     0.000018
PLUMED: 6 Update                                        2000     0.000822     0.000000     0.000000     0.000284

force master (post htt)
PLUMED:                                               Cycles        Total      Average      Minimum      Maximum
PLUMED:                                                    1     4.343856     4.343856     4.343856     4.343856
PLUMED: 1 Prepare dependencies                          2000     0.002183     0.000001     0.000000     0.000081
PLUMED: 2 Sharing data                                  2000     0.656710     0.000328     0.000248     0.004149
PLUMED: 3 Waiting for data                              2000     0.001657     0.000001     0.000000     0.000035
PLUMED: 4 Calculating (forward loop)                    2000     2.612328     0.001306     0.001071     0.005927
PLUMED: 5 Applying (backward loop)                      2000     0.942530     0.000471     0.000371     0.004881
PLUMED: 6 Update                                        2000     0.001132     0.000001     0.000000     0.000021

The optimization is instead present in master (post htt merge). Notice that also "calculate" slows down when you apply forces, I am not sure why (it didn't happen in pre htt). This optimization makes the overall performance without force closer to pre htt. Still, calculate is slower in post htt than in pre htt.

When using ARM, the slow down of calculate that I was seeing was smaller (~ 10%) and was cancelled by the optimization. I think this is why you reported htt to be faster than pre-htt with WHOLEMOLECULES (without forces).

I hope this is clearer (and there are no other mistakes).

In any case, the results (with forces) are significant on ARM as well, which means you can try yourself. What you should do is just:

install somewhere a reference branch. I am using this as "pre htt": 00b63d594
get current master, compile it, and source it

run something like:

plumed-runtime benchmark --kernel /path/to/lib/libplumedKernel.dylib

with the kernel you want to test. If you use the installed master (and not the "sourceme") you have to do:

export PLUMED_KERNEL=/path/to/master/lib/libplumedKernel.dylib
/path/to/master/lib/plumed/plumed-runtime benchmark --kernel /path/to/lib/libplumedKernel.dylib

You need these tricks to run a different version of plumed within master

gtribello commented 4 months ago

@GiovanniBussi thanks for explaining. It is much clearer now. I will try to take a look at this in the next couple of days.

GiovanniBussi commented 4 months ago

I tried to see (on INTEL, where the discrepancy is larger) where the slowdown comes from.

From the attempts I did here: hack-whole-molecules-do-not-merge I guess it's memory access. I am sure there's a better way to test this. In this branch I slowly destroyed wholemolecules functionalities to make it faster

GiovanniBussi commented 4 months ago

@gtribello related to the instructions to run correctly the benchmarks, I should have fixed things on the master branch so that if you install plumed you should be able to do:

/path/to/master/lib/plumed/plumed-runtime benchmark --kernel /path/to/lib/libplumedKernel.dylib

without setting PLUMED_KERNEL, both on MacOS and Linux.

I think that running plumed in this way (as if it was chosen at runtime) has a number of advantages, that we might decide at some point to swap this for the real executable, I mean having /path/to/bin/plumed which is actually loading the kernel at runtime, and at the same time make --runtime the default option when patching MD codes.

plumed / plumed2

Optimize wholemolecules #1023