Closed GiovanniBussi closed 4 months ago
@GiovanniBussi I had at this and I think it is fine to merge it. It looks like it is the same.
I don't know why my computer was not reporting the slow down for WHOLEMOLECULES. Perhaps my compiler automatically inlined those functions?
Maybe it can inline across objects. Did you try on the ARM Mac? I will test with mine as well.
I tested it on the ARM Mac. it is the only one I have :-(
I actually managed to try. Very interesting:
apply
phase don't do anything in htt (I guess this optimization was missing before). Calculate is instead slightly slower in htt. This is consistent with what Gareth has seen before.This is very weird. It looks like:
Raw data below.
bash-3.2$ cat pre
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 1.551619 1.551619 1.551619 1.551619
PLUMED: 1 Prepare dependencies 2000 0.000081 0.000000 0.000000 0.000009
PLUMED: 2 Sharing data 2000 0.172640 0.000086 0.000085 0.000242
PLUMED: 3 Waiting for data 2000 0.000051 0.000000 0.000000 0.000000
PLUMED: 4 Calculating (forward loop) 2000 0.992900 0.000496 0.000489 0.000566
PLUMED: 5 Applying (backward loop) 2000 0.314155 0.000157 0.000154 0.000210
PLUMED: 6 Update 2000 0.000054 0.000000 0.000000 0.000007
bash-3.2$ cat master
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 1.327825 1.327825 1.327825 1.327825
PLUMED: 1 Prepare dependencies 2000 0.000158 0.000000 0.000000 0.000001
PLUMED: 2 Sharing data 2000 0.185173 0.000093 0.000092 0.000193
PLUMED: 3 Waiting for data 2000 0.000146 0.000000 0.000000 0.000009
PLUMED: 4 Calculating (forward loop) 2000 1.096683 0.000548 0.000543 0.000621
PLUMED: 5 Applying (backward loop) 2000 0.000077 0.000000 0.000000 0.000001
PLUMED: 6 Update 2000 0.000061 0.000000 0.000000 0.000003
bash-3.2$ cat opt
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 1.328747 1.328747 1.328747 1.328747
PLUMED: 1 Prepare dependencies 2000 0.000151 0.000000 0.000000 0.000001
PLUMED: 2 Sharing data 2000 0.187392 0.000094 0.000093 0.000196
PLUMED: 3 Waiting for data 2000 0.000129 0.000000 0.000000 0.000001
PLUMED: 4 Calculating (forward loop) 2000 1.095795 0.000548 0.000542 0.000596
PLUMED: 5 Applying (backward loop) 2000 0.000096 0.000000 0.000000 0.000000
PLUMED: 6 Update 2000 0.000054 0.000000 0.000000 0.000002
bash-3.2$ cat pre.force
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 1.519698 1.519698 1.519698 1.519698
PLUMED: 1 Prepare dependencies 2000 0.000092 0.000000 0.000000 0.000000
PLUMED: 2 Sharing data 2000 0.171945 0.000086 0.000085 0.000245
PLUMED: 3 Waiting for data 2000 0.000050 0.000000 0.000000 0.000001
PLUMED: 4 Calculating (forward loop) 2000 0.990990 0.000495 0.000492 0.000564
PLUMED: 5 Applying (backward loop) 2000 0.311235 0.000156 0.000153 0.000229
PLUMED: 6 Update 2000 0.000073 0.000000 0.000000 0.000005
bash-3.2$ cat master.force
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 1.789899 1.789899 1.789899 1.789899
PLUMED: 1 Prepare dependencies 2000 0.000223 0.000000 0.000000 0.000001
PLUMED: 2 Sharing data 2000 0.187675 0.000094 0.000093 0.000193
PLUMED: 3 Waiting for data 2000 0.000144 0.000000 0.000000 0.000002
PLUMED: 4 Calculating (forward loop) 2000 1.149530 0.000575 0.000562 0.000631
PLUMED: 5 Applying (backward loop) 2000 0.407120 0.000204 0.000202 0.000246
PLUMED: 6 Update 2000 0.000109 0.000000 0.000000 0.000006
bash-3.2$ cat opt.force
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 1.764365 1.764365 1.764365 1.764365
PLUMED: 1 Prepare dependencies 2000 0.000219 0.000000 0.000000 0.000001
PLUMED: 2 Sharing data 2000 0.185834 0.000093 0.000092 0.000196
PLUMED: 3 Waiting for data 2000 0.000135 0.000000 0.000000 0.000002
PLUMED: 4 Calculating (forward loop) 2000 1.128362 0.000564 0.000555 0.000620
PLUMED: 5 Applying (backward loop) 2000 0.404132 0.000202 0.000200 0.000243
PLUMED: 6 Update 2000 0.000113 0.000000 0.000000 0.000006
And both tests were done with identical compilers, just different processor:
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: arm64-apple-darwin23.2.0
Thread model: posix
vs
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: x86_64-apple-darwin22.6.0
Thread model: posix
Actually I found the mistake. My original timing were wrong, I think there was some issue with loading the proper PLUMED_KERNEL in doing tests. I likely compared pre-htt with itself. Now, also on INTEL I get a significant difference between pre-htt and current master when applying forces:
pre htt:
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 2.593459 2.593459 2.593459 2.593459
PLUMED: 1 Prepare dependencies 2000 0.000949 0.000000 0.000000 0.000027
PLUMED: 2 Sharing data 2000 0.493889 0.000247 0.000188 0.000744
PLUMED: 3 Waiting for data 2000 0.000353 0.000000 0.000000 0.000007
PLUMED: 4 Calculating (forward loop) 2000 1.490857 0.000745 0.000613 0.001737
PLUMED: 5 Applying (backward loop) 2000 0.497509 0.000249 0.000194 0.000770
PLUMED: 6 Update 2000 0.000829 0.000000 0.000000 0.000037
master:
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 4.278036 4.278036 4.278036 4.278036
PLUMED: 1 Prepare dependencies 2000 0.002487 0.000001 0.000001 0.000073
PLUMED: 2 Sharing data 2000 0.625973 0.000313 0.000240 0.000894
PLUMED: 3 Waiting for data 2000 0.001588 0.000001 0.000000 0.000032
PLUMED: 4 Calculating (forward loop) 2000 2.611230 0.001306 0.001100 0.002691
PLUMED: 5 Applying (backward loop) 2000 0.917656 0.000459 0.000372 0.001163
PLUMED: 6 Update 2000 0.000984 0.000000 0.000000 0.000018
@GiovanniBussi was there anything in the old master code that knew that forces didn't need to be applied to atoms that were only requested by WHOLEMOLECULES? I ask because if you think about the input you are using:
WHOLEMOLECULES ENTITY0=1-100000
pos: POSITION ATOM=1
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1
There is only a force on atom 1. I think in hack-the-tree you can get a lot of speed up in apply (for that input) if you skip adding forces to atoms 2-100000. I think the same is true of master, though, as there is nothing in there that recognises that there are many atoms in that input that have no force acting on them. I am worried this a pretty niche thing to optimise, though, and I am not sure it would be useful in general.
How is the performance of the various versions if you do this input:
WHOLEMOLECULES ENTITY0=1-100000
c1: COM ATOMS=1-100000
pos: POSITION ATOM=c1
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1
I think that the optimization you are mentioned is not used anywhere (both master and pre-htt). Anyway, I tried what you suggested and get these:
pre htt:
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 9.718018 9.718018 9.718018 9.718018
PLUMED: 1 Prepare dependencies 2000 0.003129 0.000002 0.000001 0.000028
PLUMED: 2 Sharing data 2000 0.569606 0.000285 0.000230 0.000903
PLUMED: 3 Waiting for data 2000 0.000819 0.000000 0.000000 0.000017
PLUMED: 4 Calculating (forward loop) 2000 6.908385 0.003454 0.002874 0.007172
PLUMED: 5 Applying (backward loop) 2000 2.000433 0.001000 0.000823 0.002401
PLUMED: 6 Update 2000 0.001787 0.000001 0.000000 0.000050
current master:
PLUMED: 1 12.107791 12.107791 12.107791 12.107791
PLUMED: 1 Prepare dependencies 2000 0.004344 0.000002 0.000001 0.000027
PLUMED: 2 Sharing data 2000 0.686253 0.000343 0.000280 0.000946
PLUMED: 3 Waiting for data 2000 0.002432 0.000001 0.000000 0.000084
PLUMED: 4 Calculating (forward loop) 2000 9.105021 0.004553 0.003803 0.008684
PLUMED: 5 Applying (backward loop) 2000 2.057898 0.001029 0.000838 0.002042
PLUMED: 6 Update 2000 0.001813 0.000001 0.000001 0.000024
I guess the difference might come from both CENTER and WHOLEMOLECULES now, however
@GiovanniBussi OK but now you see that the difference in the timings for applying forces disappears. i.e. step 5: Applying is the same for both setups.
I think I might be missing something here. Is the problem that this input:
WHOLEMOLECULES ENTITY0=1-100000
pos: POSITION ATOM=1
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1
Is significantly slower than this one:
WHOLEMOLECULES ENTITY0=1-100000
?
I think the problem is not (only) related to force application. Even when not applying forces you get a significantly slower calculate.
I suspect there might also be a bug. Pre htt code looks like this:
void WholeMolecules::calculate() {
for(unsigned i=0; i<groups.size(); ++i) {
if(addref) {
Vector & first (modifyGlobalPosition(groups[i][0]));
first = refs[i]+pbcDistance(refs[i],first);
}
for(unsigned j=0; j<groups[i].size()-1; ++j) {
const Vector & first (getGlobalPosition(roots[i][j]));
Vector & second (modifyGlobalPosition(groups[i][j+1]));
second=first+pbcDistance(first,second);
}
}
}
Master looks like this:
for(unsigned i=0; i<p_groups.size(); ++i) {
Vector first = getGlobalPosition(p_groups[i][0]);
if(addref) {
first = refs[i]+pbcDistance(refs[i],first);
setGlobalPosition( p_groups[i][0], first );
}
for(unsigned j=1; j<p_groups[i].size(); ++j) {
Vector second=getGlobalPosition(p_groups[i][j]);
first = first+pbcDistance(first,second);
setGlobalPosition(p_groups[i][j], first );
}
}
I think you are not using p_roots
. By superficial look it seems this will introduce a bug in EMST, which is not tested by the way.
EDIT: fixed at 1b05292e4c2bb34400aec1c6bbeb207092944b38
@GiovanniBussi So does:
WHOLEMOLECULES ENTITY0=1-100000
take the same time as:
WHOLEMOLECULES ENTITY0=1-100000
pos: POSITION ATOM=1
RESTRAINT ARG=pos.x AT=0.0 KAPPA=1
or not?
From what you wrote earlier in this thread it seems that the slowdown is much larger for the second input. My reading was that the timings for the first input is the same in htt and master. I'm confused and concerned that we seem to be discussing two things at once.
I think you are confused by my mistake above, I made a mess sorry.
I ran them again now, I get something like this on INTEL. Timings on ARM are similar (same "ranking", but the differences between branches are someway smaller).
without htt stuff
no force, pre htt:
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 2.769249 2.769249 2.769249 2.769249
PLUMED: 1 Prepare dependencies 2000 0.000914 0.000000 0.000000 0.000021
PLUMED: 2 Sharing data 2000 0.541431 0.000271 0.000201 0.000891
PLUMED: 3 Waiting for data 2000 0.000609 0.000000 0.000000 0.000023
PLUMED: 4 Calculating (forward loop) 2000 1.577261 0.000789 0.000632 0.001633
PLUMED: 5 Applying (backward loop) 2000 0.526821 0.000263 0.000202 0.000754
PLUMED: 6 Update 2000 0.000821 0.000000 0.000000 0.000074
force, pre htt:
PLUMED: 1 2.777388 2.777388 2.777388 2.777388
PLUMED: 1 Prepare dependencies 2000 0.001080 0.000001 0.000000 0.000022
PLUMED: 2 Sharing data 2000 0.524391 0.000262 0.000199 0.000941
PLUMED: 3 Waiting for data 2000 0.000480 0.000000 0.000000 0.000021
PLUMED: 4 Calculating (forward loop) 2000 1.613729 0.000807 0.000661 0.002137
PLUMED: 5 Applying (backward loop) 2000 0.521073 0.000261 0.000201 0.000836
PLUMED: 6 Update 2000 0.000708 0.000000 0.000000 0.000041
Here you can see that pre-htt there's no significant optimization in place where you do not add any force. Also the cost of calculate is not affected, as expected.
with htt stuff
no force master (post htt)
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 3.054768 3.054768 3.054768 3.054768
PLUMED: 1 Prepare dependencies 2000 0.001544 0.000001 0.000000 0.000023
PLUMED: 2 Sharing data 2000 0.559304 0.000280 0.000205 0.000992
PLUMED: 3 Waiting for data 2000 0.001210 0.000001 0.000000 0.000024
PLUMED: 4 Calculating (forward loop) 2000 2.377854 0.001189 0.000981 0.002303
PLUMED: 5 Applying (backward loop) 2000 0.000898 0.000000 0.000000 0.000018
PLUMED: 6 Update 2000 0.000822 0.000000 0.000000 0.000284
force master (post htt)
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 4.343856 4.343856 4.343856 4.343856
PLUMED: 1 Prepare dependencies 2000 0.002183 0.000001 0.000000 0.000081
PLUMED: 2 Sharing data 2000 0.656710 0.000328 0.000248 0.004149
PLUMED: 3 Waiting for data 2000 0.001657 0.000001 0.000000 0.000035
PLUMED: 4 Calculating (forward loop) 2000 2.612328 0.001306 0.001071 0.005927
PLUMED: 5 Applying (backward loop) 2000 0.942530 0.000471 0.000371 0.004881
PLUMED: 6 Update 2000 0.001132 0.000001 0.000000 0.000021
The optimization is instead present in master (post htt merge). Notice that also "calculate" slows down when you apply forces, I am not sure why (it didn't happen in pre htt). This optimization makes the overall performance without force closer to pre htt. Still, calculate is slower in post htt than in pre htt.
When using ARM, the slow down of calculate that I was seeing was smaller (~ 10%) and was cancelled by the optimization. I think this is why you reported htt to be faster than pre-htt with WHOLEMOLECULES (without forces).
I hope this is clearer (and there are no other mistakes).
In any case, the results (with forces) are significant on ARM as well, which means you can try yourself. What you should do is just:
master
, compile it, and source itplumed-runtime benchmark --kernel /path/to/lib/libplumedKernel.dylib
with the kernel you want to test. If you use the installed master (and not the "sourceme") you have to do:
export PLUMED_KERNEL=/path/to/master/lib/libplumedKernel.dylib
/path/to/master/lib/plumed/plumed-runtime benchmark --kernel /path/to/lib/libplumedKernel.dylib
You need these tricks to run a different version of plumed within master
@GiovanniBussi thanks for explaining. It is much clearer now. I will try to take a look at this in the next couple of days.
I tried to see (on INTEL, where the discrepancy is larger) where the slowdown comes from.
From the attempts I did here: hack-whole-molecules-do-not-merge I guess it's memory access. I am sure there's a better way to test this. In this branch I slowly destroyed wholemolecules functionalities to make it faster
@gtribello related to the instructions to run correctly the benchmarks, I should have fixed things on the master branch so that if you install plumed you should be able to do:
/path/to/master/lib/plumed/plumed-runtime benchmark --kernel /path/to/lib/libplumedKernel.dylib
without setting PLUMED_KERNEL
, both on MacOS and Linux.
I think that running plumed in this way (as if it was chosen at runtime) has a number of advantages, that we might decide at some point to swap this for the real executable, I mean having /path/to/bin/plumed
which is actually loading the kernel at runtime, and at the same time make --runtime
the default option when patching MD codes.
@gtribello I tried to optimize wholemolecules, it's mostly inlining stuff but you might want to check if I did things right.
What worries me a bit is that even with this impressive improvement (almost a factor 2) I am back to timings pre-htt. Here's my timing (no graphs, but it should be easy to understand):
This is my input file:
So far so good, but previous @gtribello 's tests were reporting wholemolecules as being faster in htt than in master, whereas I see it slower by a factor ~ 2x (before this modifications).
Is there something that I am missing? I am doing my tests on an intel MacBook Pro.