Eigen member function for reading fvar values and derivatives

andrjohns commented 4 years ago

Description

As first raised in this issue it would be useful to have an Eigen function for reading a matrix of fvar<T> into two matrices of type T. By implementing this through Eigen, it can be achieved through traversing the matrix only once and possibly take advantage of Eigen's SIMD vectorisation facilities

This would be used like the following:

  MatrixXd vals(100, 100);
  MatrixXd derivs(100, 100);

  fvar_mat.read_fvar(vals, derivs);

Current Version:

v3.1.0

t4c1 commented 4 years ago

What you did not specify here is that assigning to this this expression should modify the underlying matrices. That would be the advantage over to_fvar.

andrjohns commented 4 years ago

Im not sure what you mean by assigning to the expression, can you give me an example?

From: Tadej Ciglarič notifications@github.com Sent: Tuesday, February 25, 2020 8:50:08 PM To: stan-dev/math math@noreply.github.com Cc: Andrew Johnson andrew.r.johnson@postgrad.curtin.edu.au; Author author@noreply.github.com Subject: Re: [stan-dev/math] Eigen member function for constructing fvar matrix (#1745)

What you did not specify here is that assigning to this this expression should modify the underlying matrices. That would be the advantage over to_fvar.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/stan-dev/math/issues/1745?email_source=notifications&email_token=AGTPCCEGR6I6UTLRTPVYRJTREUHYBA5CNFSM4K3H2EF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM32UOY#issuecomment-590850619, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGTPCCBVFPGGKY5WVUUHKNDREUHYBANCNFSM4K3H2EFQ.

t4c1 commented 4 years ago

I meant the example I posted in the linked issue. I'm making a bit more complete here:

Eigen::MatrixXd vals(10,10), derivs(10,10);
Eigen::Matrix<fvar<double>,Eigen::Dynamic, Eigen::Dynamic> x(10,10);
vals.make_fvar(derivs) = x;

EXPECT_EQ(x(0,0).val_, vals(0,0));
EXPECT_EQ(x(0,0).d_, derivs(0,0));
...

andrjohns commented 4 years ago

Oh I misunderstood what you meant in the issue, you want to create two matrices of doubles from a single matrix of fvars, not the other way around?

t4c1 commented 4 years ago

Exactley. Though I also would not mind a generalized to_fvar that would work the other way around.

andrjohns commented 4 years ago

That makes sense, I'll look into it

andrjohns commented 4 years ago

The simplest way to add it (that I can see) would be a member function like the following:

template<typename OtherDerived>
void read_fvar(OtherDerived& other1, OtherDerived& other2) {
  for(size_t i = 0; i < derived().size(); ++i) {
    other1.derived().coeffRef(i) = derived().coeff(i).val_;
    other2.derived().coeffRef(i) = derived().coeff(i).d_;
  }
}

Which, borrowing your example above, would be used like:

Eigen::MatrixXd vals(10,10), derivs(10,10);
Eigen::Matrix<fvar<double>,Eigen::Dynamic, Eigen::Dynamic> x(10,10);
x.read_fvar(vals, derivs);

So it would just be a single loop on the backend copying the values and derivatives into separate matrices. How does that sound?

bob-carpenter commented 4 years ago

Yes, please. This'd be nice in the same way that adjoint-Jacobian-apply is nice for reverse mode. For forward-mode matrix derivatives, we'd take a matrix of fvars, produce the matrix of values and tangents, then perform the usual matrix hijinks to calculate values and tangents, then put the results back together into a new matrix of fvars.

On Feb 25, 2020, at 8:22 AM, Tadej Ciglarič notifications@github.com wrote:

Exactley. Though I also would not mind a generalized to_fvar that would work the other way around.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

t4c1 commented 4 years ago

I was hoping for a way that would let Eigen evaluate expression in a vectorized fashion. But that might be too hard to achieve.

Your suggestion has a few issues. First not all expressions support linear indexing. For example Eigen::Ref does not. Also I guess if you get some row major expression and col major destination linear indexing will address elements in different orders, producing wrong result.

So to be general we must index matrices in two dimensions. We can either use two loops or leave order up to Eigen in a somewhat hacky solution (I did not test that. I hope it works):

template<typename T_val, typename T_deriv>
void read_fvar(T_val& val, T_deriv& deriv) {
    val = T_val::NullaryExpr(val.rows(), val.cols,
        [this, &deriv](Eigen::Index i, Eigen::Index j){
            deriv.coeffRef(i, j) = this->coeff(i,j).d_;
            return this->coeff(i,j).val_;
        });
}

I got Idea from here: https://eigen.tuxfamily.org/dox/TopicCustomizing_NullaryExpr.html

bob-carpenter commented 4 years ago

On Feb 25, 2020, at 11:45 AM, Tadej Ciglarič notifications@github.com wrote:

I was hoping for a way that would let Eigen evaluate expression in a vectorized fashion. But that might be too hard to achieve.

I think it's hard to do that without paying the copy penalty.

You might be interested in how Autograd or JAX set this up, which is to have matrices which you can't assign into.

Your suggestion has a few issues. First not all expressions support linear indexing. For example Eigen::Ref does not. Also I guess if you get some row major expression and col major destination linear indexing will address elements in different orders, producing wrong result.

This isn't a problem with top level Matrix<T, -1, -1> which has column major order, but can be a problem for template expressions.

So to be general we must index matrices in two dimensions. We can either use two loops or leave order up to Eigen in a somewhat hacky solution (I did not test that. I hope it works):

template<typename T_val, typename T_deriv> void read_fvar(T_val& val, T_deriv& deriv) { val = Tval::NullaryExpr(val.rows(), val.cols, [this, &deriv](Eigen::Index i, Eigen::Index j){ deriv.coeffRef(i, j) = this->coeff(i,j).d; return this->coeff(i,j).val_; }); }

If we can specialize for when linear indexing is avaialble, it's a huge win over using binary indexing, which involves arithmetic.

andrjohns commented 4 years ago

@t4c1 Got a NullaryExpr version working, how does this look:

template <typename OtherDerived>
auto read_fvar(OtherDerived& other1, OtherDerived& other2) {
  return Derived::NullaryExpr(derived().rows(), derived().cols(),
                              [&](Index i, Index j){
                                other1.derived().coeffRef(i,j) = derived().coeff(i,j).val_;
                                other2.derived().coeffRef(i,j) = derived().coeff(i,j).d_;
                              });
}

t4c1 commented 4 years ago

I thought we need to return some value from the lambda in NullaryExpr. Can you give me an example how do you call that? Anyway if it works it is fine.

andrjohns commented 4 years ago

Doesn't seem to be an issue, which is nice. As for usage, given a matrix of fvar<double>, which is fvar_mat in the example below:

  MatrixXd val_out(100, 100);
  MatrixXd d_out(100, 100);

  fvar_mat.read_fvar(val_out, d_out);

t4c1 commented 4 years ago

Intersting. I thought all Eigen expressions use lazy evaluations. Maybe they anticiated such use case made NullaryExpr eval eagerly if its functor returns void, even if it does not seem to be documented. Anyway this makes code easyer to read than my suggestion.

andrjohns commented 4 years ago

Great, I'll write some tests and update some existing functions before I open a pull with the added method

andrjohns commented 4 years ago

I've put in an initial version for matrices of var (fvar not yet tested), comparing looping against two Eigen methods:

EigMethod (Current approach):

vals = mat.val();
adj = mat.adj();

EigMethod_New (New method):

mat.read_var(vals, adj);

Overall it looks like the new method is on-par with plain looping (which is unsurprising, since that's what its doing on the backend):

2020-04-11 12:19:03
Running ./EigCopyMethods
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.58, 0.57, 0.62
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
Matrix_EigMethod        3778165 ns      3778080 ns          172
Matrix_EigMethod_New    2610224 ns      2609719 ns          277
Matrix_LoopRMajor      19648612 ns     19652926 ns           34
Matrix_LoopCMajor       2562938 ns      2563043 ns          281

RowVec_EigMethod         160340 ns       160375 ns         4430
RowVec_EigMethod_New     128850 ns       128877 ns         5398
RowVec_Loop              130614 ns       130643 ns         5233

ColVec_EigMethod         263109 ns       263168 ns         2785
ColVec_EigMethod_New     163409 ns       163442 ns         3833
ColVec_Loop              169458 ns       169492 ns         3820

The added Eigen method is below, which checks whether the input is row- or column-major and loops appropriately:

template <typename OtherDerived>
inline decltype(auto) read_var(OtherDerived& other1, OtherDerived& other2) {
  size_t R = derived().rows();
  size_t C = derived().cols();
  if(derived().IsRowMajor) {
    for(int r = 0; r < R; ++r) {
      for(int c = 0; c < C; ++c) {
        other1.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->val_;
        other2.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->adj_;
      }
    }
  } else {
    for(int c = 0; c < C; ++c) {
      for(int r = 0; r < R; ++r) {
        other1.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->val_;
        other2.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->adj_;
      }
    }
  }
}

Overall, is looking good!

t4c1 commented 4 years ago

That looks nice! I think Eigen nullaryExpr might do some more optimizations (single loop - linear indexing, where possible), but the benefit of that is likely small.

bob-carpenter commented 4 years ago

Cool idea. The simd speedups can be really big if they have the right unfolding strides.

Would it be possible to do the same kind of thing with methods for var? I don't think the same speedup would be possible because of the vari* chasing.

t4c1 commented 4 years ago

This is not actually using simd. The speedup comes from more efficient use of cache. We only do a single pass over the matrix of var.

bob-carpenter commented 4 years ago

SSE and AVX on the chips can give you SIMD from unrolled loops that contain multiple copies of operations. Keeping the SSE and AVX pipelines full requires efficient use of cache and the right byte alignments. See the Wikipedia on SIMD.

Code written specifically to allow SSE or AVX optimizations often looks like this, say, to copy an array:

for (int i = 0; i < N; i += 4) {
  a[i] = b[i];
  a[i + 1] = b[i + 1];
  a[i + 2] = b[i + 2];
  a[i + 3] = b[i + 3];
}

It doesn't look like it'll do anything, but with SSE, AVX, etc., the compiler can target unrolling these into parallelized operations that happen 4 at a time (or even 8 at a time with the later AVX, I think). In some cases I think it can even do it automatically from a simple loop.

See, for example, Alex Barnett's C++ benchmarks for n-body problems. There are examples of using loops, OpenCL, and OpenCL with loop unrolling and Agner Fogg's vector class, plus further optimizations using specialized inverse sqrt functions (one of which is implemented in Eigen).

t4c1 commented 4 years ago

Our case is not that simple - we are not copying consecutive memory locations. We are also acessing values trough different pointers. I doubt compiler can deduce they point to consecutive locations. Even if it could, we have interleaved val_s and adj_s, which we are copying to different destinations.

bob-carpenter commented 4 years ago

Our case is not that simple - we are not copying consecutive memory locations.

Right. That's why I was wondering if it'd still be useful. In practice, the vari* pointers for a Matrix<var, R, C> will be consecutive, but that breaks as soon as we assign into the matrix.

t4c1 commented 4 years ago

Yeah. That is why this can never be as efficient as what we can do once we have those static matrices (#1805).

andrjohns commented 4 years ago

@t4c1 Thanks! I tried the NullaryExpr approach, but couldn't get it to evaluate properly (always returned 0). I've got an example up on godbolt here (bit verbose since I wrote out the full functor rather than a lambda) where you can see the issue.

@bob-carpenter I ran the program through g++ with the flag -fopt-info-vec-all, and it looks like isn't being vectorised:

math/stan/math/prim/eigen_plugins.h:210:22: missed: couldn't vectorize loop
math/stan/math/prim/eigen_plugins.h:212:72: missed: not vectorized: data ref analysis failed: _356 = _348->val_;
math/stan/math/prim/eigen_plugins.h:211:24: missed: couldn't vectorize loop
math/stan/math/prim/eigen_plugins.h:212:72: missed: versioning for alias not supported for: can't determine dependence between _348->val_ and *_355

Which I think is indicating pointer overlap (where multiple pointers point to the same location in memory), which inhibits auto vectorisation (more info here)

bob-carpenter commented 4 years ago

That option -fopt-info-vec-all is super cool! Thanks for sharing.

I'd guess it means the pointers could overlap, not that they necessarily do, because that info probably isn't available through static (compile time) analysis. The comment seems to indicate that, too: "can't determine dependence between 348->val and *_355"

andrjohns commented 4 years ago

Thats the most verbose of the flags, they have a bunch of others for specific optimizations as well: https://www.codingame.com/playgrounds/283/sse-avx-vectorization/autovectorization

t4c1 commented 4 years ago

@andrjohns I got your NullaryExpr example working here. I also removed redundant .derived() calls and added single argument operator() to the functor (Eigen will use this one if linear indexing is possible).

Since you already have benchmark, I would ask you to also benchmark this implementation. It might be a bit faster, since in most cases no index artihmetic is required.

andrjohns commented 4 years ago

Nice! I hadn't thought of getting it to evaluate that way. I added it to the benchmark and there is a slight performance edge, but it's pretty small. However the NullaryExpr would likely generalise to expressions better than my version.

2020-04-14 16:15:30
Running ./EigCopyMethods
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.72, 0.67, 0.63
-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
Matrix_EigMethod          3786747 ns      3786828 ns          178
Matrix_EigMethod_New      2529627 ns      2529648 ns          276
Matrix_EigMethod_Tadej    2466528 ns      2466532 ns          279
Matrix_LoopRMajor        18095283 ns     18094845 ns           38
Matrix_LoopCMajor         2496594 ns      2496625 ns          278

RowVec_EigMethod           168161 ns       168159 ns         4176
RowVec_EigMethod_New       136515 ns       136516 ns         5105
RowVec_EigMethod_Tadej     135394 ns       135397 ns         3931
RowVec_Loop                136652 ns       136653 ns         5057

ColVec_EigMethod           175110 ns       175113 ns         4000
ColVec_EigMethod_New       139783 ns       139783 ns         5099
ColVec_EigMethod_Tadej     138213 ns       138213 ns         4978
ColVec_Loop                132974 ns       132975 ns         5121

t4c1 commented 4 years ago

Nice! This seems ready for a PR. Will you do it, or should I?

t4c1 commented 4 years ago

Actually looking at results again differences between your and my implementations seem almost negligible (and a bit random). I still prefere NullaryExpr.

andrjohns commented 4 years ago

Yeah I'm in favour of NullaryExpr as well, since there's also the possibility of future benefits if the Eigen team further optimises nullary expressions. I'll take care of the PR, the only thing to clarify is the number of member functions.

For matrices of var, headers generally need various combinations of val/adj/vari, rather than all three. So we could have member functions like the following:

mat.read_vari_val_adj(out1, out2, out3)
mat.read_vari_val(out1, out2)
mat.read_vari_adj(out1, out3)
mat.read_val_adj(out2, out3)

Or is there a better way to handle this?

t4c1 commented 4 years ago

I don't know of any case where we need vari. That would only mean mat.read_val_adj(out2, out3). But if you know of any go ahead and implement all of those.

andrjohns commented 4 years ago

It gets used a bit in the rev matrix functions (see the multiply header where the .vi() and .val() are used separately).

Which makes me think that the naming should be read_vi_* for consistency with the other member functions

stan-dev / math

Eigen member function for reading fvar values and derivatives #1745

Description

Current Version: