Closed GiovanniBussi closed 6 months ago
The improvement with intel compiler on my workstation is less (29% to 27% overhead), but still measurable.
BENCH: Kernel: /scratch/bussi/plumed2/tmp/intel-v2.9/lib/libplumedKernel.so
BENCH: Input: plumed.dat
BENCH: Comparative: 1.000 +- 0.000
BENCH: Cycles Total Average Minimum Maximum
BENCH: A Initialization 1 0.424078 0.424078 0.424078 0.424078
BENCH: B0 First step 1 0.017435 0.017435 0.017435 0.017435
BENCH: B1 Warm-up 99 0.899508 0.009086 0.008296 0.012992
BENCH: B2 Calculation part 1 200 1.743061 0.008715 0.008302 0.011589
BENCH: B3 Calculation part 2 200 1.745426 0.008727 0.008314 0.011583
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 4.821678 4.821678 4.821678 4.821678
PLUMED: 1 Prepare dependencies 500 0.001949 0.000004 0.000002 0.000015
PLUMED: 2 Sharing data 500 0.404638 0.000809 0.000504 0.002490
PLUMED: 3 Waiting for data 500 0.000810 0.000002 0.000001 0.000013
PLUMED: 4 Calculating (forward loop) 500 3.145875 0.006292 0.006110 0.012230
PLUMED: 5 Applying (backward loop) 500 0.824214 0.001648 0.001609 0.002594
PLUMED: 6 Update 500 0.001113 0.000002 0.000001 0.000006
BENCH:
BENCH: Kernel: /scratch/bussi/plumed2/tmp/intel-reference/lib/libplumedKernel.so
BENCH: Input: plumed.dat
BENCH: Comparative: 1.287 +- 0.003
BENCH: Cycles Total Average Minimum Maximum
BENCH: A Initialization 1 0.440200 0.440200 0.440200 0.440200
BENCH: B0 First step 1 0.022010 0.022010 0.022010 0.022010
BENCH: B1 Warm-up 99 1.163765 0.011755 0.010770 0.017656
BENCH: B2 Calculation part 1 200 2.243936 0.011220 0.010745 0.017048
BENCH: B3 Calculation part 2 200 2.243593 0.011218 0.010752 0.017610
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 6.106817 6.106817 6.106817 6.106817
PLUMED: 1 Prepare dependencies 500 0.002200 0.000004 0.000002 0.000014
PLUMED: 2 Sharing data 500 0.343587 0.000687 0.000527 0.001934
PLUMED: 3 Waiting for data 500 0.002487 0.000005 0.000004 0.000024
PLUMED: 4 Calculating (forward loop) 500 4.122769 0.008246 0.007840 0.015409
PLUMED: 5 Applying (backward loop) 500 1.176878 0.002354 0.002314 0.004537
PLUMED: 6 Update 500 0.004813 0.000010 0.000009 0.000021
BENCH:
BENCH: Kernel: /scratch/bussi/plumed2/src/lib/libplumedKernel.so
BENCH: Input: plumed.dat
BENCH: Comparative: 1.267 +- 0.002
BENCH: Cycles Total Average Minimum Maximum
BENCH: A Initialization 1 0.445317 0.445317 0.445317 0.445317
BENCH: B0 First step 1 0.027140 0.027140 0.027140 0.027140
BENCH: B1 Warm-up 99 1.140178 0.011517 0.010588 0.016933
BENCH: B2 Calculation part 1 200 2.206530 0.011033 0.010579 0.015644
BENCH: B3 Calculation part 2 200 2.213058 0.011065 0.010590 0.015673
PLUMED: Cycles Total Average Minimum Maximum
PLUMED: 1 6.026409 6.026409 6.026409 6.026409
PLUMED: 1 Prepare dependencies 500 0.002142 0.000004 0.000003 0.000014
PLUMED: 2 Sharing data 500 0.344054 0.000688 0.000530 0.003223
PLUMED: 3 Waiting for data 500 0.002326 0.000005 0.000004 0.000026
PLUMED: 4 Calculating (forward loop) 500 4.091521 0.008183 0.007775 0.019535
PLUMED: 5 Applying (backward loop) 500 1.123146 0.002246 0.002201 0.004265
PLUMED: 6 Update 500 0.004250 0.000009 0.000008 0.000018
Hi @GiovanniBussi
This looks fine to me. As far as I am concerned you can go ahead and merge.
Thanks for taking the time to optimise the code.
I am again working with this input file:
In this PR I tried to optimize loops like this:
Into loops like this:
Here
atom_value_ind_grouped
is constructed whenatom_value_ind
is updated, and basically stores the same information in a different way, exployting the fact that for most of the iterations in the first implementationnn
is constant.@gtribello can you have a look and check if this makes sense? Is this a reasonable assumption on the memory access pattern?
The improvement is quite measurable. Below the comparison is:
The overhead decreases from 18% (using blas) to 12% (this commit).