Memory view (for pycv) and adding constexpr in Tools::pbc

Iximiel commented 7 months ago

Description

As suggested in the review of #1001 I implemented a "memory view" for Pbc::apply. Now Pbc::apply can be used with any data stored in a "xyzxyzxyz..." fashion (aligned like a std::vector<PLMD::Vector>): Plumed will process it like a std::vector<PLMD::Vector>.

It is "zero cost abstraction" in the sense that you may pay little more compilation time but there should be no cost at runtime.

And as now the memory view is in a very crude status, with only the parts useful to make Pbc::apply work. It may need to be relocated in ad ad hoc header.

You should note that I inlined the "generic" branch of Pbc::distance into Pbc::apply. Because calling the original Pbc::distance from Pbc::apply with the new changes results in a heavy slow down.

Now Pbc::apply is 20-30% faster when using non orthogonal pbcs, as you can see in the graphs:

note that the time scale is logarithmic, the numbers are:	what:	10(ns)	100(ns)	1000(ns)
master[ortho]:	4875589	31115489	292556383	2901822551
arrayLatest[ortho]:	4838736	30786646	290585504	2899403853
master[generic]:	12096329	100735040	982072095	9793919925
arrayLatest[generic]:	9092234	73282452	709869215	7111919112

master is updated to commit ed060dd96 and the bench is:

void testdistance(int operations, int repeat, const Pbc &pbc,
                  std::ostream &stream) {
  vector<Vector> distances(operations);
  Tensor box = pbc.getBox();
  box /= box.determinant();
  duration d = duration::zero();
  const double t = 1.0 / operations;
  for (auto r = 0u; r < repeat; ++r) {
    for (auto i = 0u; i < operations; ++i) {
      distances[i] = 4 * i * t * box.getRow(i % 3);
    }
    auto beg = perfclock::now();
    for (auto i = 0u; i < operations - 1; ++i) {
      pbc.distance(distances[i], distances[i + 1]);
    }
    auto end = perfclock::now();
    d += std::chrono::duration_cast<duration>(end - beg);
  }
  // stream<< " " << d.count() << " us\n";
  stream << operations << " " << d.count() << "\n";
}

NB:I think this speed up is not dependent on the memory view.

NB2:I also tried to wrote the common part between Pbc::apply and Pbc::distance as a single separate function to be called by the two, but that approach slowed down the computation.

The modification in Tools::pbc is out of scope of the original intent of the PR. The rationale is: since std::numeric_limits<int>::round_style is known at compile time (and actually is constexpr), I made the branches constexpr to guarantee that are precalculated at compile time. This unexpectedly gave a slightly (circa 0.5-1%) improvement into the calculation time also for the orthogonal case (at least with gcc 9.4.0).

Modifying Tools:pbc I studied a bit the rounding: it is possible to change the rounding at runtime, and std::numeric_limits<int>::round_style is known ONLY at compile time. Meaning that if an external code that is calling plumed changes the rounding style with #include <cfenv>, we are stuck with the compile time choice (with or without these changes accepted). I know this is a 0.000x% scenario, and even more out of scope of the PR, but it is better to be aware of it.

Target release

I would like my code to appear in release 2.10

Type of contribution

[X] changes to code or doc authored by PLUMED developers, or additions of code in the core or within the default modules
[ ] changes to a module not authored by you
[ ] new module contribution or edit of a module authored by you

Copyright

[X] I agree to transfer the copyright of the code I have written to the PLUMED developers or to the author of the code I am modifying.

[ ] the module I added or modified contains a COPYRIGHT file with the correct license information. Code should be released under an open source license. I also used the command cd src && ./header.sh mymodulename in order to make sure the headers of the module are correct.

Tests

[ ] I added a new regtest or modified an existing regtest to validate my changes.
[ ] I verified that all regtests are passed successfully on GitHub Actions.

codecov-commenter commented 7 months ago

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (ed060dd) 84.18% compared to head (19a6657) 84.20%. Report is 20 commits behind head on master.

:exclamation: Current head 19a6657 differs from pull request most recent head 5fd6ba7. Consider uploading reports for the commit 5fd6ba7 to get more accurate results

Files	Patch %	Lines
src/core/PlumedMain.cpp	90.90%	2 Missing :warning:
src/core/TargetDist.cpp	0.00%	2 Missing :warning:
src/tools/Pbc.cpp	95.83%	1 Missing :warning:
src/tools/Pbc.h	75.00%	1 Missing :warning:
src/tools/Tools.cpp	91.66%	1 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1022 +/- ## ========================================== + Coverage 84.18% 84.20% +0.01% ========================================== Files 612 613 +1 Lines 56497 56592 +95 ========================================== + Hits 47563 47651 +88 - Misses 8934 8941 +7 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

GiovanniBussi commented 7 months ago

So there are three unrelated points:

speed up which I don't understand where it comes from (in the benchmark you are using distance, not apply, right?). If this is confirmed, this should affect other variables immediately (such as cordination) right? This would be amazing
the new span like interface which I think is fine
the doubts on the rounding thing that I think we should sort out in a more robust way

Iximiel commented 7 months ago

speed up which I don't understand where it comes from (in the benchmark you are using distance, not apply, right?). If this is confirmed, this should affect other variables immediately (such as cordination) right? This would be amazing

The two benches are in the same file, have the same shape and arguments, i copy-pasted the one with the distances, here's the one with apply

void test(int operations, int repeat, const Pbc &pbc, std::ostream &stream) {
  vector<Vector> distances(operations);
  Tensor box = pbc.getBox();
  box /= box.determinant();
  duration d = duration::zero();
  const double t = 1.0 / operations;
  for (auto r = 0u; r < repeat; ++r) {
    for (auto i = 0u; i < operations; ++i) {
      distances[i] = 4 * i * t * box.getRow(i % 3);
    }
    auto beg = perfclock::now();

    pbc.apply(distances);
    auto end = perfclock::now();
    d += std::chrono::duration_cast<duration>(end - beg);
  }
  // stream<< " " << d.count() << " us\n";
  stream << operations << " " << d.count() << "\n";
}

About the benches, the one that I listed in the staring post gives performances improvements that changes between +3% and -3% on different runs, both for on ortho and generic. So calling Pbc::distance on its own does not gets benefits (in this synthetic benchmark). The one that benches Pbc::apply, in the graph that I posted up, is more consistent between different runs.

I think that the cause is that Tool::pbc, simplified by the constexpr, is inlined by the compiler into Pbc::apply (and in Pbc::distance) and so also ortho benefits from this. But, since in the benchmark the kernel is dynamically linked Pbc::distance cannot be inlined by the compiler and on top of that there is the cost of the call, in a for loop. Whereas the loop in Pbc::apply is within the function, among its internals, and are optimized by the compiler.

Another thing that may impact is that extra int*nshifts, that in the codebase is used only in regtest/basic/rt-make-1/ (I searched with git grep 'distance(.*,.*,.*)' for a call of that signature with more than 2 arguments). But I have to measure this to be sure about that, because the branch predictor on modern cpus should work around that (and maybe there is an [[unlikely]] attribute deep inside the c++ documentation that may help with this).

the doubts on the rounding thing that I think we should sort out in a more robust way

in <cmath> there are a few "chose your own rounding" function, like floor, that plumed already use as the fallback choice of the Tools::pbc.

GiovanniBussi commented 7 months ago

I see, I think it makes perfectly sense. A single PBC calculation is too short and should be inlined. In addition, the if over types is done at every iteration, and the compiler might not be able to understand that the variable type is not modified during the execution.

If you think it's ready I can merge. A few more suggestions:

You did try to use a single function to avoid the duplication of the code inside the Pbc.cpp source? Usually transforming these constants into template parameters, so that you write the code with if's that are resolved at runtime, works. Otherwise I can try to do it after merge
Certainly the nshifts thing is only there for debugging. If it affects speed, we should remove it completely. The idea was that, if we insert a bug so that less images are tested, the regtest will fail because we will not find the best image. But if we insert a bug so that too many images are tested, the only way to see it is by monitoring that we are always checking the same number of images. This test that you found however routinely fails (or require less digits) when we change architecture because results are highly sensitive to optimizations. I am not sure we really need it.
Given the quite faster results using the vectorized version I was wondering if it's worth going to expensive CVs (like coordination) and modify the source code so that:
- we preallocate a buffer to store a bunch distances (we have to be careful with openmp though
- we use it to store the differences and apply Pbcs on the buffer
- then we use it to compute the rest of the function Before trying it might be a good idea to check what's the relative weight of Pbcs in coordination (with a non orthorhombic cell)
I am still scared by the possibility to change the roundoff at runtime. Can you write a test that shows how this would fail (I mean: change the setting at runtime with <cfenv> and show that you get incorrect results). If this is the case, then a possible workaround would be to:
- write a function that, with a simple calculation, checks if the current (runtime) behavior is consistent with the behavior assumed at compilation time
- call it e.g. at every cmd("setStep"), which should be cheap and safe.

Iximiel commented 7 months ago

I think it is ok to merge as it is. The Views can be moved to their own headers in case are needed in other places.

You did try to use a single function to avoid the duplication of the code inside the Pbc.cpp source? Usually transforming these constants into template parameters, so that you write the code with if's that are resolved at runtime, works. Otherwise I can try to do it after merge

I tried to do a Pbc::calculateGeneric hoping that being in the same object would be inlined both in Pcc::apply and in Pbc::distance, but it didn't performed well, I tried only with gcc9.4. I think there is something that get optimized very well in the expression Vector s=matmul(Vector{dlist[k][0],dlist[k][1],dlist[k][2]},invReduced); that I didn't manage to grasp and make work for Pbc::calculateGeneric.

I suspect that is that int*nshifts argument, but I didn't and I should measure it to be 100% certain of it.

Given the quite faster results using the vectorized version I was wondering if it's worth going to expensive CVs (like coordination) and modify the source code so that:

we preallocate a buffer to store a bunch distances (we have to be careful with openmp though

we use it to store the differences and apply Pbcs on the buffer

then we use it to compute the rest of the function Before trying it might be a good idea to check what's the relative weight of Pbcs in coordination (with a non orthorhombic cell)

It may be worth a try. I think I have to retrieve again the flamegraphs.

I am still scared by the possibility to change the roundoff at runtime. Can you write a test that shows how this would fail (I mean: change the setting at runtime with <cfenv> and show that you get incorrect results).

I ~have made this pedantic example~ modified the example from cppreference when I tripped in the runtime rounding changing possibility rabbit hole.

GiovanniBussi commented 7 months ago

@Iximiel I checked the issue about rounding. The problem is with std::nearbyint(). We are using int(), which seems unaffected by std::fesetround.

See https://wandbox.org/permlink/vGmUXk2wa1tI4YKa

carlocamilloni commented 7 months ago

@GiovanniBussi @Iximiel after this merge plumed does not compile anymore on my mac also after reconfiguring it Pbc.cpp:202:37: error: no viable conversion from 'const gch::small_vector<Vector, 27>' (aka 'const small_vector<VectorGeneric<3>, 27>') to 'const std::vector' (aka 'const vector<VectorGeneric<3>>') const std::vector & myshifts(shifts[(s[0]>0?1:0)][(s[1]>0?1:0)][(s[2]>0?1:0)]);

carlocamilloni commented 7 months ago

ok I see you @GiovanniBussi you just fixed it ;-)

GiovanniBussi commented 7 months ago

@Iximiel can you have a look at branch optimize-pbc?

What I did is:

I factorized the common version of the two Pbc methods (distance and apply). I cannot see any slow down. I used a template variable to check if nshifts is passed, but it seems irrelevant. (NB: to see it in action you should add if constexpr (compute_nshifts) in front of this line). Now it's not even used, so I can remove it.
More importantly, implemented the first two loops so that they can be easily vectorized.

This last change makes the implementation of apply even faster. The performance of distance should be unchanged wrt master.

Maybe you can check it on your test as well

Iximiel commented 7 months ago

@GiovanniBussi I tried to run it and I get a sigsegv at the very end of the execution (valgrind suspects an invalid free). But that happen after the bechmark data is collected:

what:	10(ns)	100(ns)	1000(ns)	10000(ns)
master[ortho]:	504308	3200724	30097406	295965221
optimize-pbc[ortho]:	518536	3260674	30622721	303423997
master[generic]:	1231679	11304539	101428943	1005759963
optimize-pbc[generic]:	939713	7325162	71680727	707344598

I find strange that the ortho calculation now is slightly slower, because that has not been touched

GiovanniBussi commented 7 months ago

@Iximiel mm does it mean that with your compiler this optimization is useless? Performance looks very similar to what you posted at the beginning of this thread.

plumed / plumed2