Closed Iximiel closed 7 months ago
Attention: 7 lines
in your changes are missing coverage. Please review.
Comparison is base (
ed060dd
) 84.18% compared to head (19a6657
) 84.20%. Report is 20 commits behind head on master.:exclamation: Current head 19a6657 differs from pull request most recent head 5fd6ba7. Consider uploading reports for the commit 5fd6ba7 to get more accurate results
Files | Patch % | Lines |
---|---|---|
src/core/PlumedMain.cpp | 90.90% | 2 Missing :warning: |
src/core/TargetDist.cpp | 0.00% | 2 Missing :warning: |
src/tools/Pbc.cpp | 95.83% | 1 Missing :warning: |
src/tools/Pbc.h | 75.00% | 1 Missing :warning: |
src/tools/Tools.cpp | 91.66% | 1 Missing :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
So there are three unrelated points:
- speed up which I don't understand where it comes from (in the benchmark you are using distance, not apply, right?). If this is confirmed, this should affect other variables immediately (such as cordination) right? This would be amazing
The two benches are in the same file, have the same shape and arguments, i copy-pasted the one with the distances, here's the one with apply
void test(int operations, int repeat, const Pbc &pbc, std::ostream &stream) {
vector<Vector> distances(operations);
Tensor box = pbc.getBox();
box /= box.determinant();
duration d = duration::zero();
const double t = 1.0 / operations;
for (auto r = 0u; r < repeat; ++r) {
for (auto i = 0u; i < operations; ++i) {
distances[i] = 4 * i * t * box.getRow(i % 3);
}
auto beg = perfclock::now();
pbc.apply(distances);
auto end = perfclock::now();
d += std::chrono::duration_cast<duration>(end - beg);
}
// stream<< " " << d.count() << " us\n";
stream << operations << " " << d.count() << "\n";
}
About the benches, the one that I listed in the staring post gives performances improvements that changes between +3% and -3% on different runs, both for on ortho and generic. So calling Pbc::distance on its own does not gets benefits (in this synthetic benchmark). The one that benches Pbc::apply, in the graph that I posted up, is more consistent between different runs.
I think that the cause is that Tool::pbc, simplified by the constexpr, is inlined by the compiler into Pbc::apply (and in Pbc::distance) and so also ortho benefits from this. But, since in the benchmark the kernel is dynamically linked Pbc::distance cannot be inlined by the compiler and on top of that there is the cost of the call, in a for loop. Whereas the loop in Pbc::apply is within the function, among its internals, and are optimized by the compiler.
Another thing that may impact is that extra int*nshifts
, that in the codebase is used only in regtest/basic/rt-make-1/
(I searched with git grep 'distance(.*,.*,.*)'
for a call of that signature with more than 2 arguments). But I have to measure this to be sure about that, because the branch predictor on modern cpus should work around that (and maybe there is an [[unlikely]]
attribute deep inside the c++ documentation that may help with this).
- the doubts on the rounding thing that I think we should sort out in a more robust way
in <cmath>
there are a few "chose your own rounding" function, like floor, that plumed already use as the fallback choice of the Tools::pbc.
I see, I think it makes perfectly sense. A single PBC calculation is too short and should be inlined. In addition, the if over types is done at every iteration, and the compiler might not be able to understand that the variable type
is not modified during the execution.
If you think it's ready I can merge. A few more suggestions:
<cfenv>
and show that you get incorrect results). If this is the case, then a possible workaround would be to:
cmd("setStep")
, which should be cheap and safe.I think it is ok to merge as it is. The Views can be moved to their own headers in case are needed in other places.
- You did try to use a single function to avoid the duplication of the code inside the Pbc.cpp source? Usually transforming these constants into template parameters, so that you write the code with if's that are resolved at runtime, works. Otherwise I can try to do it after merge
I tried to do a Pbc::calculateGeneric hoping that being in the same object would be inlined both in Pcc::apply and in Pbc::distance, but it didn't performed well, I tried only with gcc9.4.
I think there is something that get optimized very well in the expression Vector s=matmul(Vector{dlist[k][0],dlist[k][1],dlist[k][2]},invReduced);
that I didn't manage to grasp and make work for Pbc::calculateGeneric
.
I suspect that is that int*nshifts
argument, but I didn't and I should measure it to be 100% certain of it.
Given the quite faster results using the vectorized version I was wondering if it's worth going to expensive CVs (like coordination) and modify the source code so that:
we preallocate a buffer to store a bunch distances (we have to be careful with openmp though
we use it to store the differences and apply Pbcs on the buffer
then we use it to compute the rest of the function Before trying it might be a good idea to check what's the relative weight of Pbcs in coordination (with a non orthorhombic cell)
It may be worth a try. I think I have to retrieve again the flamegraphs.
- I am still scared by the possibility to change the roundoff at runtime. Can you write a test that shows how this would fail (I mean: change the setting at runtime with
<cfenv>
and show that you get incorrect results).
I ~have made this pedantic example~ modified the example from cppreference when I tripped in the runtime rounding changing possibility rabbit hole.
@Iximiel I checked the issue about rounding. The problem is with std::nearbyint()
. We are using int()
, which seems unaffected by std::fesetround
.
@GiovanniBussi @Iximiel after this merge plumed does not compile anymore on my mac also after reconfiguring it
Pbc.cpp:202:37: error: no viable conversion from 'const gch::small_vector<Vector, 27>' (aka 'const small_vector<VectorGeneric<3>, 27>') to 'const std::vector
ok I see you @GiovanniBussi you just fixed it ;-)
@Iximiel can you have a look at branch optimize-pbc
?
What I did is:
if constexpr (compute_nshifts)
in front of this line). Now it's not even used, so I can remove it.This last change makes the implementation of apply
even faster. The performance of distance
should be unchanged wrt master.
Maybe you can check it on your test as well
@GiovanniBussi I tried to run it and I get a sigsegv at the very end of the execution (valgrind suspects an invalid free). But that happen after the bechmark data is collected:
what: | 10(ns) | 100(ns) | 1000(ns) | 10000(ns) |
---|---|---|---|---|
master[ortho]: | 504308 | 3200724 | 30097406 | 295965221 |
optimize-pbc[ortho]: | 518536 | 3260674 | 30622721 | 303423997 |
master[generic]: | 1231679 | 11304539 | 101428943 | 1005759963 |
optimize-pbc[generic]: | 939713 | 7325162 | 71680727 | 707344598 |
I find strange that the ortho calculation now is slightly slower, because that has not been touched
@Iximiel mm does it mean that with your compiler this optimization is useless? Performance looks very similar to what you posted at the beginning of this thread.
Description
As suggested in the review of #1001 I implemented a "memory view" for Pbc::apply. Now Pbc::apply can be used with any data stored in a "xyzxyzxyz..." fashion (aligned like a
std::vector<PLMD::Vector>
): Plumed will process it like astd::vector<PLMD::Vector>
.It is "zero cost abstraction" in the sense that you may pay little more compilation time but there should be no cost at runtime.
And as now the memory view is in a very crude status, with only the parts useful to make Pbc::apply work. It may need to be relocated in ad ad hoc header.
You should note that I inlined the "generic" branch of Pbc::distance into Pbc::apply. Because calling the original Pbc::distance from Pbc::apply with the new changes results in a heavy slow down.
Now Pbc::apply is 20-30% faster when using non orthogonal pbcs, as you can see in the graphs:
master is updated to commit ed060dd96 and the bench is:
NB:I think this speed up is not dependent on the memory view.
NB2:I also tried to wrote the common part between Pbc::apply and Pbc::distance as a single separate function to be called by the two, but that approach slowed down the computation.
The modification in Tools::pbc is out of scope of the original intent of the PR. The rationale is: since
std::numeric_limits<int>::round_style
is known at compile time (and actually is constexpr), I made the branchesconstexpr
to guarantee that are precalculated at compile time. This unexpectedly gave a slightly (circa 0.5-1%) improvement into the calculation time also for the orthogonal case (at least with gcc 9.4.0).Modifying Tools:pbc I studied a bit the rounding: it is possible to change the rounding at runtime, and
std::numeric_limits<int>::round_style
is known ONLY at compile time. Meaning that if an external code that is calling plumed changes the rounding style with#include <cfenv>
, we are stuck with the compile time choice (with or without these changes accepted). I know this is a 0.000x% scenario, and even more out of scope of the PR, but it is better to be aware of it.Target release
I would like my code to appear in release 2.10
Type of contribution
Copyright
COPYRIGHT
file with the correct license information. Code should be released under an open source license. I also used the commandcd src && ./header.sh mymodulename
in order to make sure the headers of the module are correct.Tests