Closed andrjohns closed 4 years ago
What you did not specify here is that assigning to this this expression should modify the underlying matrices. That would be the advantage over to_fvar
.
Im not sure what you mean by assigning to the expression, can you give me an example?
From: Tadej Ciglarič notifications@github.com Sent: Tuesday, February 25, 2020 8:50:08 PM To: stan-dev/math math@noreply.github.com Cc: Andrew Johnson andrew.r.johnson@postgrad.curtin.edu.au; Author author@noreply.github.com Subject: Re: [stan-dev/math] Eigen member function for constructing fvar matrix (#1745)
What you did not specify here is that assigning to this this expression should modify the underlying matrices. That would be the advantage over to_fvar.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/stan-dev/math/issues/1745?email_source=notifications&email_token=AGTPCCEGR6I6UTLRTPVYRJTREUHYBA5CNFSM4K3H2EF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM32UOY#issuecomment-590850619, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGTPCCBVFPGGKY5WVUUHKNDREUHYBANCNFSM4K3H2EFQ.
I meant the example I posted in the linked issue. I'm making a bit more complete here:
Eigen::MatrixXd vals(10,10), derivs(10,10);
Eigen::Matrix<fvar<double>,Eigen::Dynamic, Eigen::Dynamic> x(10,10);
vals.make_fvar(derivs) = x;
EXPECT_EQ(x(0,0).val_, vals(0,0));
EXPECT_EQ(x(0,0).d_, derivs(0,0));
...
Oh I misunderstood what you meant in the issue, you want to create two matrices of doubles from a single matrix of fvars, not the other way around?
Exactley. Though I also would not mind a generalized to_fvar
that would work the other way around.
That makes sense, I'll look into it
The simplest way to add it (that I can see) would be a member function like the following:
template<typename OtherDerived>
void read_fvar(OtherDerived& other1, OtherDerived& other2) {
for(size_t i = 0; i < derived().size(); ++i) {
other1.derived().coeffRef(i) = derived().coeff(i).val_;
other2.derived().coeffRef(i) = derived().coeff(i).d_;
}
}
Which, borrowing your example above, would be used like:
Eigen::MatrixXd vals(10,10), derivs(10,10);
Eigen::Matrix<fvar<double>,Eigen::Dynamic, Eigen::Dynamic> x(10,10);
x.read_fvar(vals, derivs);
So it would just be a single loop on the backend copying the values and derivatives into separate matrices. How does that sound?
Yes, please. This'd be nice in the same way that adjoint-Jacobian-apply is nice for reverse mode. For forward-mode matrix derivatives, we'd take a matrix of fvars, produce the matrix of values and tangents, then perform the usual matrix hijinks to calculate values and tangents, then put the results back together into a new matrix of fvars.
On Feb 25, 2020, at 8:22 AM, Tadej Ciglarič notifications@github.com wrote:
Exactley. Though I also would not mind a generalized to_fvar that would work the other way around.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I was hoping for a way that would let Eigen evaluate expression in a vectorized fashion. But that might be too hard to achieve.
Your suggestion has a few issues. First not all expressions support linear indexing. For example Eigen::Ref
does not. Also I guess if you get some row major expression and col major destination linear indexing will address elements in different orders, producing wrong result.
So to be general we must index matrices in two dimensions. We can either use two loops or leave order up to Eigen in a somewhat hacky solution (I did not test that. I hope it works):
template<typename T_val, typename T_deriv>
void read_fvar(T_val& val, T_deriv& deriv) {
val = T_val::NullaryExpr(val.rows(), val.cols,
[this, &deriv](Eigen::Index i, Eigen::Index j){
deriv.coeffRef(i, j) = this->coeff(i,j).d_;
return this->coeff(i,j).val_;
});
}
I got Idea from here: https://eigen.tuxfamily.org/dox/TopicCustomizing_NullaryExpr.html
On Feb 25, 2020, at 11:45 AM, Tadej Ciglarič notifications@github.com wrote:
I was hoping for a way that would let Eigen evaluate expression in a vectorized fashion. But that might be too hard to achieve.
I think it's hard to do that without paying the copy penalty.
You might be interested in how Autograd or JAX set this up, which is to have matrices which you can't assign into.
Your suggestion has a few issues. First not all expressions support linear indexing. For example Eigen::Ref does not. Also I guess if you get some row major expression and col major destination linear indexing will address elements in different orders, producing wrong result.
This isn't a problem with top level Matrix<T, -1, -1>
which has column major order, but can be a problem for template expressions.
So to be general we must index matrices in two dimensions. We can either use two loops or leave order up to Eigen in a somewhat hacky solution (I did not test that. I hope it works):
template<typename T_val, typename T_deriv> void read_fvar(T_val& val, T_deriv& deriv) { val = Tval::NullaryExpr(val.rows(), val.cols, [this, &deriv](Eigen::Index i, Eigen::Index j){ deriv.coeffRef(i, j) = this->coeff(i,j).d; return this->coeff(i,j).val_; }); }
If we can specialize for when linear indexing is avaialble, it's a huge win over using binary indexing, which involves arithmetic.
@t4c1 Got a NullaryExpr
version working, how does this look:
template <typename OtherDerived>
auto read_fvar(OtherDerived& other1, OtherDerived& other2) {
return Derived::NullaryExpr(derived().rows(), derived().cols(),
[&](Index i, Index j){
other1.derived().coeffRef(i,j) = derived().coeff(i,j).val_;
other2.derived().coeffRef(i,j) = derived().coeff(i,j).d_;
});
}
I thought we need to return some value from the lambda in NullaryExpr. Can you give me an example how do you call that? Anyway if it works it is fine.
Doesn't seem to be an issue, which is nice. As for usage, given a matrix of fvar<double>
, which is fvar_mat
in the example below:
MatrixXd val_out(100, 100);
MatrixXd d_out(100, 100);
fvar_mat.read_fvar(val_out, d_out);
Intersting. I thought all Eigen expressions use lazy evaluations. Maybe they anticiated such use case made NullaryExpr eval eagerly if its functor returns void, even if it does not seem to be documented. Anyway this makes code easyer to read than my suggestion.
Great, I'll write some tests and update some existing functions before I open a pull with the added method
I've put in an initial version for matrices of var
(fvar not yet tested), comparing looping against two Eigen methods:
EigMethod (Current approach):
vals = mat.val();
adj = mat.adj();
EigMethod_New (New method):
mat.read_var(vals, adj);
Overall it looks like the new method is on-par with plain looping (which is unsurprising, since that's what its doing on the backend):
2020-04-11 12:19:03
Running ./EigCopyMethods
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 0.58, 0.57, 0.62
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
Matrix_EigMethod 3778165 ns 3778080 ns 172
Matrix_EigMethod_New 2610224 ns 2609719 ns 277
Matrix_LoopRMajor 19648612 ns 19652926 ns 34
Matrix_LoopCMajor 2562938 ns 2563043 ns 281
RowVec_EigMethod 160340 ns 160375 ns 4430
RowVec_EigMethod_New 128850 ns 128877 ns 5398
RowVec_Loop 130614 ns 130643 ns 5233
ColVec_EigMethod 263109 ns 263168 ns 2785
ColVec_EigMethod_New 163409 ns 163442 ns 3833
ColVec_Loop 169458 ns 169492 ns 3820
The added Eigen method is below, which checks whether the input is row- or column-major and loops appropriately:
template <typename OtherDerived>
inline decltype(auto) read_var(OtherDerived& other1, OtherDerived& other2) {
size_t R = derived().rows();
size_t C = derived().cols();
if(derived().IsRowMajor) {
for(int r = 0; r < R; ++r) {
for(int c = 0; c < C; ++c) {
other1.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->val_;
other2.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->adj_;
}
}
} else {
for(int c = 0; c < C; ++c) {
for(int r = 0; r < R; ++r) {
other1.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->val_;
other2.derived().coeffRef(r, c) = derived().coeffRef(r,c).vi_->adj_;
}
}
}
}
Overall, is looking good!
That looks nice! I think Eigen nullaryExpr might do some more optimizations (single loop - linear indexing, where possible), but the benefit of that is likely small.
Cool idea. The simd speedups can be really big if they have the right unfolding strides.
Would it be possible to do the same kind of thing with methods for var
? I don't think the same speedup would be possible because of the vari*
chasing.
This is not actually using simd. The speedup comes from more efficient use of cache. We only do a single pass over the matrix of var.
SSE and AVX on the chips can give you SIMD from unrolled loops that contain multiple copies of operations. Keeping the SSE and AVX pipelines full requires efficient use of cache and the right byte alignments. See the Wikipedia on SIMD.
Code written specifically to allow SSE or AVX optimizations often looks like this, say, to copy an array:
for (int i = 0; i < N; i += 4) {
a[i] = b[i];
a[i + 1] = b[i + 1];
a[i + 2] = b[i + 2];
a[i + 3] = b[i + 3];
}
It doesn't look like it'll do anything, but with SSE, AVX, etc., the compiler can target unrolling these into parallelized operations that happen 4 at a time (or even 8 at a time with the later AVX, I think). In some cases I think it can even do it automatically from a simple loop.
See, for example, Alex Barnett's C++ benchmarks for n-body problems. There are examples of using loops, OpenCL, and OpenCL with loop unrolling and Agner Fogg's vector class, plus further optimizations using specialized inverse sqrt functions (one of which is implemented in Eigen).
Our case is not that simple - we are not copying consecutive memory locations. We are also acessing values trough different pointers. I doubt compiler can deduce they point to consecutive locations. Even if it could, we have interleaved val_
s and adj_
s, which we are copying to different destinations.
Our case is not that simple - we are not copying consecutive memory locations.
Right. That's why I was wondering if it'd still be useful. In practice, the vari*
pointers for a Matrix<var, R, C>
will be consecutive, but that breaks as soon as we assign into the matrix.
Yeah. That is why this can never be as efficient as what we can do once we have those static matrices (#1805).
@t4c1 Thanks! I tried the NullaryExpr
approach, but couldn't get it to evaluate properly (always returned 0). I've got an example up on godbolt here (bit verbose since I wrote out the full functor rather than a lambda) where you can see the issue.
@bob-carpenter I ran the program through g++ with the flag -fopt-info-vec-all
, and it looks like isn't being vectorised:
math/stan/math/prim/eigen_plugins.h:210:22: missed: couldn't vectorize loop
math/stan/math/prim/eigen_plugins.h:212:72: missed: not vectorized: data ref analysis failed: _356 = _348->val_;
math/stan/math/prim/eigen_plugins.h:211:24: missed: couldn't vectorize loop
math/stan/math/prim/eigen_plugins.h:212:72: missed: versioning for alias not supported for: can't determine dependence between _348->val_ and *_355
Which I think is indicating pointer overlap (where multiple pointers point to the same location in memory), which inhibits auto vectorisation (more info here)
That option -fopt-info-vec-all
is super cool! Thanks for sharing.
I'd guess it means the pointers could overlap, not that they necessarily do, because that info probably isn't available through static (compile time) analysis. The comment seems to indicate that, too: "can't determine dependence between 348->val and *_355"
Thats the most verbose of the flags, they have a bunch of others for specific optimizations as well: https://www.codingame.com/playgrounds/283/sse-avx-vectorization/autovectorization
@andrjohns I got your NullaryExpr
example working here. I also removed redundant .derived()
calls and added single argument operator()
to the functor (Eigen will use this one if linear indexing is possible).
Since you already have benchmark, I would ask you to also benchmark this implementation. It might be a bit faster, since in most cases no index artihmetic is required.
Nice! I hadn't thought of getting it to evaluate that way. I added it to the benchmark and there is a slight performance edge, but it's pretty small. However the NullaryExpr would likely generalise to expressions better than my version.
2020-04-14 16:15:30
Running ./EigCopyMethods
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 0.72, 0.67, 0.63
-----------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------
Matrix_EigMethod 3786747 ns 3786828 ns 178
Matrix_EigMethod_New 2529627 ns 2529648 ns 276
Matrix_EigMethod_Tadej 2466528 ns 2466532 ns 279
Matrix_LoopRMajor 18095283 ns 18094845 ns 38
Matrix_LoopCMajor 2496594 ns 2496625 ns 278
RowVec_EigMethod 168161 ns 168159 ns 4176
RowVec_EigMethod_New 136515 ns 136516 ns 5105
RowVec_EigMethod_Tadej 135394 ns 135397 ns 3931
RowVec_Loop 136652 ns 136653 ns 5057
ColVec_EigMethod 175110 ns 175113 ns 4000
ColVec_EigMethod_New 139783 ns 139783 ns 5099
ColVec_EigMethod_Tadej 138213 ns 138213 ns 4978
ColVec_Loop 132974 ns 132975 ns 5121
Nice! This seems ready for a PR. Will you do it, or should I?
Actually looking at results again differences between your and my implementations seem almost negligible (and a bit random). I still prefere NullaryExpr
.
Yeah I'm in favour of NullaryExpr
as well, since there's also the possibility of future benefits if the Eigen team further optimises nullary expressions. I'll take care of the PR, the only thing to clarify is the number of member functions.
For matrices of var
, headers generally need various combinations of val
/adj
/vari
, rather than all three. So we could have member functions like the following:
mat.read_vari_val_adj(out1, out2, out3)
mat.read_vari_val(out1, out2)
mat.read_vari_adj(out1, out3)
mat.read_val_adj(out2, out3)
Or is there a better way to handle this?
I don't know of any case where we need vari
. That would only mean mat.read_val_adj(out2, out3)
. But if you know of any go ahead and implement all of those.
It gets used a bit in the rev
matrix functions (see the multiply header where the .vi()
and .val()
are used separately).
Which makes me think that the naming should be read_vi_*
for consistency with the other member functions
Description
As first raised in this issue it would be useful to have an Eigen function for reading a matrix of
fvar<T>
into two matrices of typeT
. By implementing this through Eigen, it can be achieved through traversing the matrix only once and possibly take advantage of Eigen's SIMD vectorisation facilitiesThis would be used like the following:
Current Version:
v3.1.0