Open GiovanniBussi opened 6 months ago
And here results for the intel compiler on my workstation. Reference is current master (just pulled), then I time master + this optimization of wholemolecules.
My input: 28% -> 26% overhead
Carlo's input: no measurable overhead in both cases (<1%)
BENCH: Kernel: /scratch/bussi/plumed2/tmp/intel-v2.9/lib/libplumedKernel.so
BENCH: Input: plumed.dat
BENCH: Comparative: 1.000 +- 0.000
BENCH: Cycles Total Average Minimum Maximum
BENCH: A Initialization 1 0.439503 0.439503 0.439503 0.439503
BENCH: B0 First step 1 0.018843 0.018843 0.018843 0.018843
BENCH: B1 Warm-up 199 1.762787 0.008858 0.008241 0.012670
BENCH: B2 Calculation part 1 400 3.468244 0.008671 0.008249 0.012408
BENCH: B3 Calculation part 2 400 3.465002 0.008663 0.008249 0.011359
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 9.140007 9.140007 9.140007 9.140007
PLUMED: 1 Prepare dependencies 1000 0.003898 0.000004 0.000002 0.000016
PLUMED: 2 Sharing data 1000 0.802382 0.000802 0.000497 0.002601
PLUMED: 3 Waiting for data 1000 0.001643 0.000002 0.000001 0.000014
PLUMED: 4 Calculating (forward loop) 1000 6.209310 0.006209 0.006058 0.013588
PLUMED: 5 Applying (backward loop) 1000 1.643378 0.001643 0.001610 0.002541
PLUMED: 6 Update 1000 0.002249 0.000002 0.000001 0.000011
BENCH:
BENCH: Kernel: /scratch/bussi/plumed2/tmp/intel-reference/lib/libplumedKernel.so
BENCH: Input: plumed.dat
BENCH: Comparative: 1.276 +- 0.002
BENCH: Cycles Total Average Minimum Maximum
BENCH: A Initialization 1 0.458510 0.458510 0.458510 0.458510
BENCH: B0 First step 1 0.021890 0.021890 0.021890 0.021890
BENCH: B1 Warm-up 199 2.256600 0.011340 0.010602 0.018078
BENCH: B2 Calculation part 1 400 4.421297 0.011053 0.010608 0.016984
BENCH: B3 Calculation part 2 400 4.424044 0.011060 0.010600 0.015301
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 11.568168 11.568168 11.568168 11.568168
PLUMED: 1 Prepare dependencies 1000 0.004773 0.000005 0.000003 0.000015
PLUMED: 2 Sharing data 1000 0.681942 0.000682 0.000530 0.001941
PLUMED: 3 Waiting for data 1000 0.006367 0.000006 0.000006 0.000022
PLUMED: 4 Calculating (forward loop) 1000 8.122590 0.008123 0.007776 0.015270
PLUMED: 5 Applying (backward loop) 1000 2.252827 0.002253 0.002219 0.004536
PLUMED: 6 Update 1000 0.008153 0.000008 0.000007 0.000020
BENCH:
BENCH: Kernel: /scratch/bussi/plumed2/src/lib/libplumedKernel.so
BENCH: Input: plumed.dat
BENCH: Comparative: 1.255 +- 0.001
BENCH: Cycles Total Average Minimum Maximum
BENCH: A Initialization 1 0.441582 0.441582 0.441582 0.441582
BENCH: B0 First step 1 0.027241 0.027241 0.027241 0.027241
BENCH: B1 Warm-up 199 2.213178 0.011121 0.010403 0.017219
BENCH: B2 Calculation part 1 400 4.352620 0.010882 0.010401 0.017213
BENCH: B3 Calculation part 2 400 4.348310 0.010871 0.010397 0.016596
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 11.370794 11.370794 11.370794 11.370794
PLUMED: 1 Prepare dependencies 1000 0.004210 0.000004 0.000002 0.000016
PLUMED: 2 Sharing data 1000 0.686542 0.000687 0.000526 0.003764
PLUMED: 3 Waiting for data 1000 0.004769 0.000005 0.000004 0.000038
PLUMED: 4 Calculating (forward loop) 1000 7.938052 0.007938 0.007563 0.017875
PLUMED: 5 Applying (backward loop) 1000 2.260929 0.002261 0.002225 0.005465
PLUMED: 6 Update 1000 0.007763 0.000008 0.000007 0.000022
BENCH: Kernel: /scratch/bussi/plumed2/tmp/intel-v2.9/lib/libplumedKernel.so
BENCH: Input: plumed.dat
BENCH: Comparative: 1.000 +- 0.000
BENCH: Cycles Total Average Minimum Maximum
BENCH: A Initialization 1 0.138326 0.138326 0.138326 0.138326
BENCH: B0 First step 1 0.185941 0.185941 0.185941 0.185941
BENCH: B1 Warm-up 199 3.294411 0.016555 0.000207 0.090417
BENCH: B2 Calculation part 1 400 7.457271 0.018643 0.000207 0.095864
BENCH: B3 Calculation part 2 400 7.553318 0.018883 0.000207 0.097540
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 18.619273 18.619273 18.619273 18.619273
PLUMED: 1 Prepare dependencies 1000 0.001505 0.000002 0.000001 0.000012
PLUMED: 2 Sharing data 1000 0.097701 0.000098 0.000088 0.001876
PLUMED: 3 Waiting for data 1000 0.000837 0.000001 0.000001 0.000010
PLUMED: 4 Calculating (forward loop) 1000 18.178325 0.018178 0.000062 0.097289
PLUMED: 5 Applying (backward loop) 1000 0.043801 0.000044 0.000028 0.000621
PLUMED: 6 Update 1000 0.141444 0.000141 0.000001 0.140419
BENCH:
BENCH: Kernel: /scratch/bussi/plumed2/tmp/intel-reference/lib/libplumedKernel.so
BENCH: Input: plumed.dat
BENCH: Comparative: 1.002 +- 0.000
BENCH: Cycles Total Average Minimum Maximum
BENCH: A Initialization 1 0.152307 0.152307 0.152307 0.152307
BENCH: B0 First step 1 0.061523 0.061523 0.061523 0.061523
BENCH: B1 Warm-up 199 3.296157 0.016564 0.000220 0.090576
BENCH: B2 Calculation part 1 400 7.479278 0.018698 0.000219 0.096017
BENCH: B3 Calculation part 2 400 7.562410 0.018906 0.000220 0.097389
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 18.541842 18.541842 18.541842 18.541842
PLUMED: 1 Prepare dependencies 1000 0.001789 0.000002 0.000001 0.000013
PLUMED: 2 Sharing data 1000 0.091294 0.000091 0.000071 0.002350
PLUMED: 3 Waiting for data 1000 0.004282 0.000004 0.000004 0.000026
PLUMED: 4 Calculating (forward loop) 1000 18.245589 0.018246 0.000115 0.097074
PLUMED: 5 Applying (backward loop) 1000 0.024121 0.000024 0.000001 0.001297
PLUMED: 6 Update 1000 0.002991 0.000003 0.000002 0.000521
BENCH:
BENCH: Kernel: /scratch/bussi/plumed2/src/lib/libplumedKernel.so
BENCH: Input: plumed.dat
BENCH: Comparative: 1.003 +- 0.000
BENCH: Cycles Total Average Minimum Maximum
BENCH: A Initialization 1 0.148465 0.148465 0.148465 0.148465
BENCH: B0 First step 1 0.064416 0.064416 0.064416 0.064416
BENCH: B1 Warm-up 199 3.300723 0.016587 0.000198 0.090649
BENCH: B2 Calculation part 1 400 7.484045 0.018710 0.000198 0.096261
BENCH: B3 Calculation part 2 400 7.564513 0.018911 0.000198 0.097447
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 18.552670 18.552670 18.552670 18.552670
PLUMED: 1 Prepare dependencies 1000 0.001862 0.000002 0.000001 0.000012
PLUMED: 2 Sharing data 1000 0.092785 0.000093 0.000071 0.003759
PLUMED: 3 Waiting for data 1000 0.004292 0.000004 0.000003 0.000028
PLUMED: 4 Calculating (forward loop) 1000 18.260201 0.018260 0.000094 0.097126
PLUMED: 5 Applying (backward loop) 1000 0.022730 0.000023 0.000001 0.001235
PLUMED: 6 Update 1000 0.002970 0.000003 0.000002 0.000508
Hello @GiovanniBussi
What you have done here seems sensible. I don't think I can do it better.
@gtribello I really don't like the way it's done, because it's intrusive in what's supposed to be "user code" (wholemolecules), with modifications that are difficult to understand. I also want to repeat the timings, because there is an interplay between all the optimizations we are doing. This one might be not so relevant, so I would keep it on hold
@gtribello here I tried a trick similar to #1044 to optimize wholemolecules. Notice that I didn't tough the EMST stuff, so tests are expected to fail. Anyway, tests with simple wholemolecules should work.
On my usual input I get some further speedup, with overhead going from 13% to 10%.
On @carlocamilloni 's input see here it's even better. Before this commit, I get same performance as v2.9. After this commit I gain something like 5%.
Maybe @gtribello you want to have a look at this code and think if there's some other reasonable (and simpler) solution. Notice that these PR are all stacked onto each other, so you should only look at the last commit (but the performance is measured including the previous PRs)
With my input:
With Carlo's input: