plumed / plumed2

Development version of plumed 2
https://www.plumed.org
GNU Lesser General Public License v3.0
357 stars 284 forks source link

Memory view (for pycv) and adding constexpr in Tools::pbc #1022

Closed Iximiel closed 7 months ago

Iximiel commented 7 months ago
Description

As suggested in the review of #1001 I implemented a "memory view" for Pbc::apply. Now Pbc::apply can be used with any data stored in a "xyzxyzxyz..." fashion (aligned like a std::vector<PLMD::Vector>): Plumed will process it like a std::vector<PLMD::Vector>.

It is "zero cost abstraction" in the sense that you may pay little more compilation time but there should be no cost at runtime.

And as now the memory view is in a very crude status, with only the parts useful to make Pbc::apply work. It may need to be relocated in ad ad hoc header.

You should note that I inlined the "generic" branch of Pbc::distance into Pbc::apply. Because calling the original Pbc::distance from Pbc::apply with the new changes results in a heavy slow down.

Now Pbc::apply is 20-30% faster when using non orthogonal pbcs, as you can see in the graphs:

image note that the time scale is logarithmic, the numbers are: what: 10(ns) 100(ns) 1000(ns) 10000(ns)
master[ortho]: 4875589 31115489 292556383 2901822551
arrayLatest[ortho]: 4838736 30786646 290585504 2899403853
master[generic]: 12096329 100735040 982072095 9793919925
arrayLatest[generic]: 9092234 73282452 709869215 7111919112

master is updated to commit ed060dd96 and the bench is:

void testdistance(int operations, int repeat, const Pbc &pbc,
                  std::ostream &stream) {
  vector<Vector> distances(operations);
  Tensor box = pbc.getBox();
  box /= box.determinant();
  duration d = duration::zero();
  const double t = 1.0 / operations;
  for (auto r = 0u; r < repeat; ++r) {
    for (auto i = 0u; i < operations; ++i) {
      distances[i] = 4 * i * t * box.getRow(i % 3);
    }
    auto beg = perfclock::now();
    for (auto i = 0u; i < operations - 1; ++i) {
      pbc.distance(distances[i], distances[i + 1]);
    }
    auto end = perfclock::now();
    d += std::chrono::duration_cast<duration>(end - beg);
  }
  // stream<< " " << d.count() << " us\n";
  stream << operations << " " << d.count() << "\n";
}

NB:I think this speed up is not dependent on the memory view.

NB2:I also tried to wrote the common part between Pbc::apply and Pbc::distance as a single separate function to be called by the two, but that approach slowed down the computation.


The modification in Tools::pbc is out of scope of the original intent of the PR. The rationale is: since std::numeric_limits<int>::round_style is known at compile time (and actually is constexpr), I made the branches constexpr to guarantee that are precalculated at compile time. This unexpectedly gave a slightly (circa 0.5-1%) improvement into the calculation time also for the orthogonal case (at least with gcc 9.4.0).

Modifying Tools:pbc I studied a bit the rounding: it is possible to change the rounding at runtime, and std::numeric_limits<int>::round_style is known ONLY at compile time. Meaning that if an external code that is calling plumed changes the rounding style with #include <cfenv>, we are stuck with the compile time choice (with or without these changes accepted). I know this is a 0.000x% scenario, and even more out of scope of the PR, but it is better to be aware of it.


Target release

I would like my code to appear in release 2.10

Type of contribution
Copyright
Tests
codecov-commenter commented 7 months ago

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (ed060dd) 84.18% compared to head (19a6657) 84.20%. Report is 20 commits behind head on master.

:exclamation: Current head 19a6657 differs from pull request most recent head 5fd6ba7. Consider uploading reports for the commit 5fd6ba7 to get more accurate results

Files Patch % Lines
src/core/PlumedMain.cpp 90.90% 2 Missing :warning:
src/core/TargetDist.cpp 0.00% 2 Missing :warning:
src/tools/Pbc.cpp 95.83% 1 Missing :warning:
src/tools/Pbc.h 75.00% 1 Missing :warning:
src/tools/Tools.cpp 91.66% 1 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1022 +/- ## ========================================== + Coverage 84.18% 84.20% +0.01% ========================================== Files 612 613 +1 Lines 56497 56592 +95 ========================================== + Hits 47563 47651 +88 - Misses 8934 8941 +7 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

GiovanniBussi commented 7 months ago

So there are three unrelated points:

Iximiel commented 7 months ago
  • speed up which I don't understand where it comes from (in the benchmark you are using distance, not apply, right?). If this is confirmed, this should affect other variables immediately (such as cordination) right? This would be amazing

The two benches are in the same file, have the same shape and arguments, i copy-pasted the one with the distances, here's the one with apply

void test(int operations, int repeat, const Pbc &pbc, std::ostream &stream) {
  vector<Vector> distances(operations);
  Tensor box = pbc.getBox();
  box /= box.determinant();
  duration d = duration::zero();
  const double t = 1.0 / operations;
  for (auto r = 0u; r < repeat; ++r) {
    for (auto i = 0u; i < operations; ++i) {
      distances[i] = 4 * i * t * box.getRow(i % 3);
    }
    auto beg = perfclock::now();

    pbc.apply(distances);
    auto end = perfclock::now();
    d += std::chrono::duration_cast<duration>(end - beg);
  }
  // stream<< " " << d.count() << " us\n";
  stream << operations << " " << d.count() << "\n";
}

About the benches, the one that I listed in the staring post gives performances improvements that changes between +3% and -3% on different runs, both for on ortho and generic. So calling Pbc::distance on its own does not gets benefits (in this synthetic benchmark). The one that benches Pbc::apply, in the graph that I posted up, is more consistent between different runs.

I think that the cause is that Tool::pbc, simplified by the constexpr, is inlined by the compiler into Pbc::apply (and in Pbc::distance) and so also ortho benefits from this. But, since in the benchmark the kernel is dynamically linked Pbc::distance cannot be inlined by the compiler and on top of that there is the cost of the call, in a for loop. Whereas the loop in Pbc::apply is within the function, among its internals, and are optimized by the compiler.

Another thing that may impact is that extra int*nshifts, that in the codebase is used only in regtest/basic/rt-make-1/ (I searched with git grep 'distance(.*,.*,.*)' for a call of that signature with more than 2 arguments). But I have to measure this to be sure about that, because the branch predictor on modern cpus should work around that (and maybe there is an [[unlikely]] attribute deep inside the c++ documentation that may help with this).

  • the doubts on the rounding thing that I think we should sort out in a more robust way

in <cmath> there are a few "chose your own rounding" function, like floor, that plumed already use as the fallback choice of the Tools::pbc.

GiovanniBussi commented 7 months ago

I see, I think it makes perfectly sense. A single PBC calculation is too short and should be inlined. In addition, the if over types is done at every iteration, and the compiler might not be able to understand that the variable type is not modified during the execution.

If you think it's ready I can merge. A few more suggestions:

Iximiel commented 7 months ago

I think it is ok to merge as it is. The Views can be moved to their own headers in case are needed in other places.

  • You did try to use a single function to avoid the duplication of the code inside the Pbc.cpp source? Usually transforming these constants into template parameters, so that you write the code with if's that are resolved at runtime, works. Otherwise I can try to do it after merge

I tried to do a Pbc::calculateGeneric hoping that being in the same object would be inlined both in Pcc::apply and in Pbc::distance, but it didn't performed well, I tried only with gcc9.4. I think there is something that get optimized very well in the expression Vector s=matmul(Vector{dlist[k][0],dlist[k][1],dlist[k][2]},invReduced); that I didn't manage to grasp and make work for Pbc::calculateGeneric.

I suspect that is that int*nshifts argument, but I didn't and I should measure it to be 100% certain of it.

  • Given the quite faster results using the vectorized version I was wondering if it's worth going to expensive CVs (like coordination) and modify the source code so that:

    • we preallocate a buffer to store a bunch distances (we have to be careful with openmp though

    • we use it to store the differences and apply Pbcs on the buffer

    • then we use it to compute the rest of the function Before trying it might be a good idea to check what's the relative weight of Pbcs in coordination (with a non orthorhombic cell)

It may be worth a try. I think I have to retrieve again the flamegraphs.

  • I am still scared by the possibility to change the roundoff at runtime. Can you write a test that shows how this would fail (I mean: change the setting at runtime with <cfenv> and show that you get incorrect results).

I ~have made this pedantic example~ modified the example from cppreference when I tripped in the runtime rounding changing possibility rabbit hole.

GiovanniBussi commented 7 months ago

@Iximiel I checked the issue about rounding. The problem is with std::nearbyint(). We are using int(), which seems unaffected by std::fesetround.

See https://wandbox.org/permlink/vGmUXk2wa1tI4YKa

carlocamilloni commented 7 months ago

@GiovanniBussi @Iximiel after this merge plumed does not compile anymore on my mac also after reconfiguring it Pbc.cpp:202:37: error: no viable conversion from 'const gch::small_vector<Vector, 27>' (aka 'const small_vector<VectorGeneric<3>, 27>') to 'const std::vector' (aka 'const vector<VectorGeneric<3>>') const std::vector & myshifts(shifts[(s[0]>0?1:0)][(s[1]>0?1:0)][(s[2]>0?1:0)]);

carlocamilloni commented 7 months ago

ok I see you @GiovanniBussi you just fixed it ;-)

GiovanniBussi commented 7 months ago

@Iximiel can you have a look at branch optimize-pbc?

What I did is:

This last change makes the implementation of apply even faster. The performance of distance should be unchanged wrt master.

Maybe you can check it on your test as well

Iximiel commented 7 months ago

@GiovanniBussi I tried to run it and I get a sigsegv at the very end of the execution (valgrind suspects an invalid free). But that happen after the bechmark data is collected:

what: 10(ns) 100(ns) 1000(ns) 10000(ns)
master[ortho]: 504308 3200724 30097406 295965221
optimize-pbc[ortho]: 518536 3260674 30622721 303423997
master[generic]: 1231679 11304539 101428943 1005759963
optimize-pbc[generic]: 939713 7325162 71680727 707344598

image

I find strange that the ortho calculation now is slightly slower, because that has not been touched

GiovanniBussi commented 7 months ago

@Iximiel mm does it mean that with your compiler this optimization is useless? Performance looks very similar to what you posted at the beginning of this thread.