Wrong derivatives for cholesky and symmetric eigen decomposition

stan-dev / math

The Stan Math Library is a C++ template library for automatic differentiation of any order using forward, reverse, and mixed modes. It includes a range of built-in functions for probabilistic modeling, linear algebra, and equation solving.

https://mc-stan.org

BSD 3-Clause "New" or "Revised" License

743 stars 187 forks source link

Wrong derivatives for cholesky and symmetric eigen decomposition #1803

Open IvanYashchuk opened 4 years ago

IvanYashchuk commented 4 years ago

Description

Currently, derivatives of cholesky_decompose and eigenvalues_sym return lower triangular matrix. This is what finite differences return because the implementation of those functions considers only part of the input matrix. That's why this bug is ignored by the test system. I think that the correct derivative matrix should be symmetric and not lower triangular. This also affects any other function involved in the computation and the calculated autodiff derivative is not correct anymore.

Example

Consider calculating log det A when A is an invertible symmetric matrix. This can be done using log_determinant or with cholesky_decompose, eigenvalues_sym. The derivative of logdetA is the inverse of A. The code below demonstrates that a.adj() is the lower triangular part of the inverse of A, while it is expected to be the full symmetric matrix.

#include <stan/math.hpp>
#include <iostream>

int main() {
    using stan::math::eigenvalues_sym;
    using stan::math::cholesky_decompose;
    using stan::math::determinant;
    using stan::math::log_determinant;
    using stan::math::diagonal;
    using stan::math::log;
    using stan::math::sum;

    stan::math::matrix_d a(4, 4);
    // Random symmetric matrix
    a << 1.8904,  0.7204, -0.1599,  1.2028,
         0.7204,  7.3394,  2.0895, -0.6151,
        -0.1599,  2.0895,  2.4601, -1.7219,
         1.2028, -0.6151, -1.7219,  4.5260;

    stan::math::matrix_v a1(a);
    stan::math::matrix_v a2(a);
    stan::math::matrix_v a3(a);
    stan::math::matrix_v a4(a);

    std::cout << "Here is the input matrix a:\n" << a << std::endl;

    // Way 1
    stan::math::matrix_v chol_mat = cholesky_decompose(a1);
    auto logdet1 = 2 * sum(log(diagonal(chol_mat)));

    // Way 2
    auto w = eigenvalues_sym(a2);
    auto logdet2 = sum(log(w));

    // Way 3
    auto det = determinant(a3);
    auto logdet3 = log(det);

    // Way 4
    auto logdet4 = log_determinant(a4);

    std::cout << "logdet1 = " << logdet1 << std::endl;
    std::cout << "logdet2 = " << logdet2 << std::endl;
    std::cout << "logdet3 = " << logdet3 << std::endl;
    std::cout << "logdet4 = " << logdet4 << std::endl;
    std::cout << "Are these all log(det(A))? " << ( (logdet1 - logdet2 < 1e-8) && (logdet2 - logdet3 < 1e-8) ? "True" : "False") << std::endl;

    stan::math::matrix_d a_inv = stan::math::inverse(a);
    std::cout << "Here is the matrix a_inv:\n" << a_inv << std::endl;

    stan::math::set_zero_all_adjoints();
    logdet1.grad();
    std::cout << "Here is the matrix a1_adj:\n" << a1.adj() << std::endl;
    bool way_1 = a_inv.val().isApprox(a1.adj());

    stan::math::set_zero_all_adjoints();
    logdet2.grad();
    std::cout << "Here is the matrix a2_adj:\n" << a2.adj() << std::endl;
    bool way_2 = a_inv.val().isApprox(a2.adj());

    stan::math::set_zero_all_adjoints();
    logdet3.grad();
    std::cout << "Here is the matrix a3_adj:\n" << a3.adj() << std::endl;
    bool way_3 = a_inv.val().isApprox(a3.adj());

    stan::math::set_zero_all_adjoints();
    logdet4.grad();
    std::cout << "Here is the matrix a4_adj:\n" << a4.adj() << std::endl;
    bool way_4 = a_inv.val().isApprox(a4.adj());

    std::cout << "Is way 1 correct? " << (way_1 ? "True" : "False") << std::endl;
    std::cout << "Is way 2 correct? " << (way_2 ? "True" : "False") << std::endl;
    std::cout << "Is way 3 correct? " << (way_3 ? "True" : "False") << std::endl;
    std::cout << "Is way 4 correct? " << (way_4 ? "True" : "False") << std::endl;
}

Current Version:

v3.1.0

bob-carpenter commented 4 years ago

Thanks much for the areful report. This indeed looks like a problem that we should fix. I recall going over all of this when we were first building our autodiff and thinking that we didn't want to double the derivatives by copying them into the upper triangle. So I think this may require some clever design to get to work in all contexts.

bbbales2 commented 4 years ago

I recall going over all of this when we were first building our autodiff and thinking that we didn't want to double the derivatives by copying them into the upper triangle.

Oof wow, this is difficult. But it's also a really compelling example. I don't even know where to start.

Is it possible we could fix this by doing an X = (A^T + A) / 2 thing to everything that comes in (after we check the A symmetry) and then only autodiffing the lower triangle of that?

I guess that's the same as just putting half the adjoint in the upper triangle and half in the lower triangle which seems rather ad-hoc.

in all contexts.

Do you know how to enumerate the contexts?

bbbales2 commented 4 years ago

Alright I think what we're doing here is as right as we might possibly get with our current autodiff.

Maybe we can fix at least this symmetry thing with the var_type, whole matrix stuff coming along in the pipeline (I somehow doubt it).

I guess the fundamental weirdness is what @IvanYashchuk pointed out -- that somehow you get different adjoints from A when doing autodiff on log(det(A)) and sum(log(diag(cholesky(A)))).

I think the point is that in the first, A is unconstrained. In the second, A is constrained to be symmetric. So for the second, it does not make sense to compute individually the adjoint of A(i, j) (j != i) because that is a derivative in a direction out of the space in which A lives.

Because of how our autodiff is defined, we don't respect the input constraint, and so it's possible to ask for the adjoint of off diagonal A(i, j) and A(j, i) individually.

I'm not sure to what extent it's on the user to think through these constraints.

It's like, if you have f(a, b) = a + b in an unconstrained space, the adjoint of a is the adjoint of f and the adjoint of b is the adjoint of f.

If you define the constraint, a + b = 1, then f is just a constant and doesn't change (so its adjoint is zero) and so are the adjoints of a and b. So the domain matters for all this, and because these functions have different domains, things behave differently.

And because our autodiff is elementwise, even though we know adjoint A should be a symmetric matrix, we have no way to enforce that on the adjoints, and so best we can do is pretend it is parameterized by the lower triangle and roll with that.

I guess my conclusion here is that the behavior now is the correct behavior, and for the symmetric eigensolve we should keep doing lower triangle adjoints (though the custom autodiff is still sweet).

IvanYashchuk commented 4 years ago

It's the implementation detail that the routines that take the symmetric input only care about the lower/upper part of the matrix for the final result. Incidentally, current implementation seems correct as you compare against numerical derivatives in the tests, which outputs only a triangular part of the symmetric thing. I don't think that the current implementation would pass the adjoint identity test (*). Reverse mode autodiff implements the adjoint operator. If intermediate computations involve symmetric routines and in the backward pass the corresponding derivatives are not symmetric, the whole computation is not correct mathematically. And this is demonstrated in the example log(det(A)) and sum(log(diag(cholesky(A)))), two computations mathematically are equivalent, so the derivative should be also equivalent and the correct thing, in this case, is inverse(A), not triangular_part(inverse(A)). This behavior is fixable in reverse mode by implementing the correct rules for the symmetric routines, that return symmetric derivatives (as done it is done in #1878). For Cholesky, the derivative output should be symmetrized with (A^T + A) / 2. I don't know how the forward mode works though.

Note, the same issue was discussed in PyTorch #18825, #22807. They've decided to symmetrize the derivative.

(*) Adjoint identity (or adjoint relation): For any linear operator M there exists adjoint operator M such that ∀x, ∀y <x, My> = <Mx, y>

bbbales2 commented 4 years ago

Yeah I think you're right. Thanks for the links and everything.

@bob-carpenter We don't have to be worried about double counting here.

Like if we're doing a simpler example, just:

A = [ 1.0, 0.5
      0.5, 1.0 ]

If the output is log(det(A)) computed with cholesky_decompose or eigenvalues_sym, then with our current autodiff the adjoint of A is:

A_adj = [ 1.33333, 0
         -1.33333, 1.33333 ]

And the proposed adjoint would be:

A_adj = [ 1.33333, -0.666667
          -0.666667, 1.33333]

Which matches the adjoints you get from computing this using log_determinant which has no constraints on the input.

So I believe what we had in place was just an optimization so that cholesky_decompose and eigenvalues_sym were using half the number of varis.

Because of the symmetry requirements in cholesky_decompose, we know that the off diagonals have to be the same thing everywhere, and so incrementing twice by -2/3 is the same as incrementing once by -4/3.

Conceivably someone could use this autodiff in some space where they do not respect this symmetry requirement, but then the instant they move to a region where the inputs to the function aren't symmetric, everything will blow up in the forward mode evaluation, which is the intended behavior.

So we're the ones doing the double counting now lol. I think what @IvanYashchuk is proposing makes sense. We'll end up with twice the varis which will be slower, but it'll protect everyone's sanity a bit more. Also twice as many varis is an O(N^2) thing and these are both O(N^3) functions.

bob-carpenter commented 4 years ago

And this is demonstrated in the example log(det(A)) and sum(log(diag(cholesky(A)))), two computations mathematically are equivalent, so the derivative should be also equivalent

Thanks, @IvanYashchuk, that's a great diagnosis of the problem.

We'll end up with twice the varis which will be slower, but it'll protect everyone's sanity a bit more.

I agree that that's the right decision.

And thanks everyone---this issue has been bugging me ever since we first addd these functions.

bbbales2 commented 4 years ago

@IvanYashchuk One more question -- is there a similar test we can add for the eigenvectors_sym?

Something that would have failed with the previous implementation? I was trying to figure something out myself for the pull request and just got confused.

Like what's getting me now is the adjoints as written out here coming from the eigenvectors aren't necessarily symmetric.

I guess the real motivation wasn't necessarily the symmetry -- it was making sure we got the same thing with log(det(A)) and sum(log(diag(cholesky(A)))).

By the above logic that adjA_{ij} == adjA_{ji} always for us to be in the right space, it doesn't matter where we increment our adjoints, the actual space is symmetric by construction, and either solution is equivalent in a constrained space.

So I guess while we have the intuitive behavior for eigenvalues -- is there an intuitive behavior for eigenvectors that we should target?

bbbales2 commented 4 years ago

@IvanYashchuk thanks for the patience on this!

rtrangucci commented 4 years ago

I agree with @IvanYashchuk that the right behavior is probably to enforce symmetry in the adjoints, but I disagree with @bbbales2 that we're double counting on the off-diagonals. If you compare the output for the adjoints returned by cholesky_decompose through autodiff and that of finite-diff for the example log(det(A)) where

A = [1, 0.5,
     0.5, 1]

you'll see that the off-diagonal elements are -4/3. I think the correct derivative for log(det(A)) when A is symmetric is 2 A^(-1) - diagonal(A^(-1)) as shown in Minka's notes (see equation 22 on page 4):

bbbales2 commented 4 years ago

but I disagree with @bbbales2 that we're double counting on the off-diagonals

It was more of a joke to highlight a misunderstanding I had. What we had shouldn't be interpreted as the lower triangle of a symmetric matrix of adjoints, cause if you tried to do that you'd end up double counting.

rtrangucci commented 4 years ago

@bbbales2 haha sorry I misinterpreted! Just to be clear the current adjoint of A is

[4/3, 0
 -4/3, 4/3]

and the proposed adjoint should be:

[4/3, -4/3
 -4/3, 4/3]

bbbales2 commented 4 years ago

It should be A^-1:

[4/3, -2/3,
 -2/3, 4/3]

The most straightforward way to get this is if you trust out log determinant implementation :P: https://github.com/stan-dev/math/blob/develop/stan/math/rev/fun/log_determinant.hpp#L28

Otherwise: https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf

I don't understand the arguments for why this one in particular is the one you gotta go with, but I'm not exactly a mathematician. That's part of why I was asking the follow up question for the eigenvectors. Is there some particularly correct way to handle those too that we might just be accidentally getting right otherwise?

The current one is:

[4/3, 0,
 -4/3, 4/3]

And that'll get the gradients right as long as the thing you're taking gradients with respect to respects the constraints of the system.

rtrangucci commented 4 years ago

@bbbales2 Right I agree with you there, it's just that cholesky_decompose isn't defined for nonsymmetric matrices, so how could you take gradients of an argument of the choleksy that isn't symmetric?

I don't think comparing the adjoints of the log-determinant calculation of A is the right meterstick to compare the adjoints of 2 * sum(log(diagonal(cholesky_decompose(A)))) to; it'd need to be something like log_deteterminant_sym(A)

bbbales2 commented 4 years ago

so how could you take gradients of an argument of the choleksy that isn't symmetric

Well you can code something up that doesn't respect the constraints but will run:

var a = 0.5, b = 0.5;
Eigen::Matrix<var, -1, -1> A(2, 2);
A << 1.0, a,
     b, 1.0;
var c = 2 * sum(log(diagonal(cholesky_decompose(A))));

Which is sort of implicitly what I think we're doing when we print out the adjoint of A like a matrix without saying anything about the domain on which things work.

I think if you print out the adjoints in a system parameterized to respect the constraints, those gradients will be right for any variety of ways of handling the adjoint intermediates. I don't know Math well enough to say if one should be preferred over another for these intermediates. I'm pretty sure we can also mess with the intermediates in a way to make them really hard to interpret too, while still respecting the symmetries. Something like:

var a = 0.5;
var b1 = a + 1;
var b2 = b1 - 1;
Eigen::Matrix<var, -1, -1> A(2, 2);
A << 1.0, a,
     b2, 1.0;
var c = 2 * sum(log(diagonal(cholesky_decompose(A))));

Now the adjoint of A will be funky to interpret cause of those intermediates, but a.adj() will be correct still.

rtrangucci commented 4 years ago

Ok, @bbbales2 that makes sense, as long as we still return the right gradient for:

var a = 0.5;
Eigen::Matrix<var, -1, -1> A(2, 2);
A << 1.0, a,
     a, 1.0;
var c = 2 * sum(log(diagonal(cholesky_decompose(A))));

Does any of the static matrix stuff over on #1884 allow us to define a symmetric matrix type (which'd only store the lower-triangular portion of the matrix)? Then we'd potentially be able to specialize cholesky_decompose for a symmetric matrix type?

bbbales2 commented 4 years ago

Does any of the static matrix stuff over on #1884 allow us to define a symmetric matrix type

We haven't talked about it. If you want to do the thinking/coding for it, feel free to join the pull. It'll probably get left behind if nobody does it.

as long as we still return the right gradient for

Yeah the more I think on this the more conflicted I feel about the whole thing. We already had those gradients right. What this does is make the adjoints match between two situations which seem like they should be the same but kinda can't really be in the bigger scope (and by bigger scope I mean in an HMC or an optimization or something, and by can't really be I mean with regards to respecting symmetry constraints). We can change things up so they look the same at the individual gradient level at the cost of twice the varis :/.

Now I've kinda talked myself into half the number of varis is better again, bleh. This feels like a roulette wheel. Custom autodiff for the symmetric eigenvalue problem is good either way and will be way faster than what is there, but I'm not sure we should change cholesky_decompose.

rtrangucci commented 4 years ago

Yeah, I've gone back and forth myself a few times as well, and it sounds like a similar path haha.

Custom autodiff for the symmetric eigenvalue problem is good either way and will be way faster than what is there, but I'm not sure we should change cholesky_decompose.

This is where I came down too.

We haven't talked about it. If you want to do the thinking/coding for it, feel free to join the pull. It'll probably get left behind if nobody does it.

Yeah that sounds great, I'll take a look! Just getting the static matrices would be a huge win, but it really opens up the possibility of structured matrix types, which is something I've been thinking about for awhile (namely Toeplitz matrices).