Improving differentiation for algebraic solver

charlesm93 commented 5 years ago

Description

There are two ways to improve the way in which we compute the Jacobian matrix of an algebraic solver:

compute the Jacobian of the system with respect to the parameters and the unknown simultaneously, to use their shared expression tree.
compute the Jacobian-adjoint vector product directly, and thence only do one sweep of reverse-mode AD.

The two options are competing but should be fairly straightforward to implement and compare. Note the procedure here will work for any algebraic solver where the sensitivities are computed via the implicit function theorem.

Expected Output

Increased speed when solving algebraic equations and computing their sensitivities. Will test on classical PK steady state problem and find the mode of a Laplace approximation.

Discussion on forum

https://discourse.mc-stan.org/t/better-computation-of-the-jacobian-matrix-for-algebraic-solver/8593

Current Version:

v2.19.1

bob-carpenter commented 5 years ago

If f:R^N -> R^M, then we can compute a Jacobian-adjoint with N directional derivative calculations (forward mode) or with M gradient calculations (reverse mode) and an explicit Jacobian-adjoint multiply. You seem to be indicating it can be done in a single sweep of reverse mode, but each sweep of reverse mode only computes derivatives with respect to a single output component of f.

charlesm93 commented 5 years ago

The reason we require M gradient calculations in reverse mode is because of the initial cotangents we use. Suppose the output is y, with adjoint v = (a, b, c), and that the input is x.

With Jacobian(), the initial cotangents are (1, 0, 0), (0, 1, 0) and (0, 0, 1). This gives J, and then we multiply it by v, to get Jv, the adjoint of x. But instead, we could use one initial cotangent, (a, b, c), and directly get Jv in one sweep. This avoids redundant calculations. My understanding is that controlling the initial cotangent can be done with the adjoint-Jacobian method @bbbales2 developed.

charlesm93 commented 5 years ago

We have three options here.

Let f(x, y) the function for which we want the root, x in R^M is the unknown, and y parameters in R^N. Let v be the adjoint of the output. The implicit function theorem gives us dx/dy = - [df/dx]^-1 df/dy. The options are:

compute df/dx and df/dy separately --> 2M sweeps
compute df/d(x + y) together --> M sweeps
compute df/dx and df/dy * v --> M + 1 sweeps

According to @betanalpha, when N, the number of parameters, becomes large the third option becomes better. I've encountered both regimes (M > N, and M < N), but we'll need to pick. Option 2 is the simplest to implement.

bob-carpenter commented 5 years ago

The function grad(v) initialize's v's adjoint to 1, and then does the backward pass. If all we need to do is initialize other adjoints, it'd be easy to write a more general grad(v).

The adjoint-Jacobian thing from Ben is just a way to define a lazy chain() method for reverse mode. Rather than storing the Jacobian and multiplying it by the adjoint and adding it to the operand adjoints, we lazily evaluate incrementing the operand adjoints with the Jacobian-adjoint product. That allows us to get away with a lot less memory. It also saves virtual function calls on multivariate functions by managing non-propagating vari instances and putting all the work on a single vari.

wds15 commented 5 years ago

Have you, @charelsm93, considered to apply the same trick as in the ode speedup pr which you reviewed recently? I mean, it sounds as if you do similar things here which is repeated gradients to get the Jacobian. So Parameters stay always the same and you can take advantage of that. Probably this is an additional thing to do here? For non stiff ode integration this gave me 20% speed for the sie example.

charlesm93 commented 5 years ago

@bob-carpenter Indeed we would need a more general grad(v). I'm guessing it would have other applications too.

@wds15 That's a very good point. The trick applies to both options. For option (1) we use the fact the input x is constant and for option (2) the fact that both x and y remain constant. I still want to do this incrementally: first one of the above-described improvements, then your trick on top of it.

betanalpha commented 5 years ago

We don’t necessarily have to use grad(v) here — the Jacobian-adjoint product can be computed by implementing the nested sweeps by hand as I originally did in the ODE code.

On Jun 6, 2019, at 7:22 AM, Charles Margossian notifications@github.com wrote:

@bob-carpenter https://github.com/bob-carpenter Indeed we would need a more general grad(v). I'm guessing it would have other applications too.

@wds15 https://github.com/wds15 That's a very good point. The trick applies to both options. For option (1) we use the fact the input x is constant and for option (2) the fact that both x and y remain constant. I still want to do this incrementally: first one of the above-described improvements, then your trick on top of it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/math/issues/1257?email_source=notifications&email_token=AALU3FVC5DCROHM76O7DZ2TPZDXPDA5CNFSM4HS32JH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXCRHAA#issuecomment-499454848, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FU5VZKLP6OF26RTCU3PZDXPDANCNFSM4HS32JHQ.

charlesm93 commented 5 years ago

Per the discussion during yesterday's Stan meeting, I'm going with option (2).

bob-carpenter commented 5 years ago

On Jun 6, 2019, at 11:44 PM, Michael Betancourt notifications@github.com wrote:

We don’t necessarily have to use grad(v) here — the Jacobian-adjoint product can be computed by implementing the nested sweeps by hand as I originally did in the ODE code.

That's what I meant---writing a new function like grad(v) that computes an adjoint gradient. I don't think it's a good idea to try to generalize grad(v).

charlesm93 commented 5 years ago

That's what I meant---writing a new function like grad(v) that computes an adjoint gradient. I don't think it's a good idea to try to generalize grad(v).

You mean in general, or for this specific issue?

charlesm93 commented 5 years ago

I implemented option 2 -- fairly straightforward -- and ran some performance test.

The first test is based on pharma steady states, as described in the appendix of the autodiff review paper (https://arxiv.org/abs/1811.05031). The second test involves finding the mode of a conditional distribution, as required when doing INLA on a Poisson process with a latent Gaussian variable.

These tests show two things:

the time to compute the Jacobian is small next to the time required to solve the algebraic system. For the Steady State problem, solving + differentiating takes about 0.03 s for 200 states, while constructing the Jacobian matrix requires 0.0075 s. But the latter is constructed using the grad function and looping over the 200 outputs. In practice, the Jacobian matrix is only calculated once, and this takes on the order e-05 s. Solving the equations however has order e-02s. Same with the INLA problem: for 500 states, the solver needs 633 seconds, and the Jacobian calculation is e-06 s.
This is more intriguing: I did not observe any speed-up when looking at the time to calculate the Jacobian matrix. See the this graph for the steady state problem. The red line is the current solve, the blue one the new implementation. For the INLA problem I went down from 3e-6 to 2e-6, but I only ran the experiment once and the difference is likely due to noise.

Hmmm... did I do something wrong? The number of sweeps is halved and there is shared computation so there should be some speedup. In any case, this change provides at best a marginal improvement. The code is written and the PR review should not take very long, but I'm not sure that it's worth it.

bob-carpenter commented 5 years ago

I'd be happy to take a look at the code diff if you can point me to something manageable.

The trick to answering your second question is to figure out what's dominating computation. If you speed up an operation that only takes 5% of the computation, you won't see much overall gain. Often computation's dominated by memory or branch prediction (and how the code gets generated to exploit CPU features) rather than by sheer number of operations.

charlesm93 commented 5 years ago

@bob-carpenter See the code for the chain method on lines 66 - 73 in algebra_solver.hpp.

There are some changes in algebra_system, lines 61 - 74, where the () operator is modified to allow a vector containing both unknowns and parameters to be passed as the input variable. This is required for the call to Jacobian().

I think you're right: costs are dominated by memory and branch prediction, as opposed to number of operations. I'm running other tests that corroborate this conjecture.

bob-carpenter commented 5 years ago

One thing you can do is avoid [i], which checks bounds, and use .coeffRef(i). We should be doing this in a lot of places in the code.

The bigger issue is whether you can automate a lot of the stuff you built by hand using adjoint-Jacobian apply. The solver's chain() method is just doing that:

  void chain() {
    for (int j = 0; j < y_size_; j++)
      for (int i = 0; i < x_size_; i++)
        y_[j]->adj_ += theta_[i]->adj_ * Jx_y_[j * x_size_ + i];
  }

The Jacobian itself is calculated using a map and stored in J_x_y_ until the reverse pass:

   Map<MatrixXd>(&Jx_y_[0], x_size_, y_size_)
        = -mdivide_left(fx.get_jacobian(theta_dbl),
                        f_y(fs, f, theta_dbl, value_of(y), dat, dat_int, msgs)
                            .get_jacobian(value_of(y)))

Can you wait to construct the Jacobian until the chain() method is called? It should be more efficient as it'll cut down on the preallocated memory. Then you can also define it using holistic matrix operations rather than in a loop.

charlesm93 commented 5 years ago

One thing you can do is avoid [i], which checks bounds, and use .coeffRef(i). We should be doing this in a lot of places in the code.

Right... this came up last time I did a PR. Ok, I'll save that one for last.

Can you wait to construct the Jacobian until the chain() method is called?

There are actually a few complications.

First, more arguments of the constructor need to be stored in the class, since they are needed to construct the Jacobian matrix. Those elements are theta_dbl, the structures fs and f, and the arrays dat and dat_int. It is likely less memory than storing J, although I'm not sure what the cost of storing f and fs may be.
Secondly, aren't there issues running Jacobian() inside the chain() method? Based on the unit tests I conducted, this messes up the adjoints for the gradient. Specifically theta[i]->adj_ goes to 0 for all i's. I'm not sure how to best fix this, but using start_nested() doesn't seem to work.

Here's the prototype code, with some print statements I used for testing:

template <typename Fs, typename F, typename T>  //, typename Fx>
struct algebra_solver_vari : public vari {
  /** vector of parameters */
  vari** y_;
  /** number of parameters */
  int y_size_;
  /** number of unknowns */
  int x_size_;
  /** vector of solution */
  vari** theta_;
  /** double vector of solution */
  Eigen::VectorXd theta_dbl_;
  /** wrapper structure for computing Jacobians */
  Fs fs_;
  /** functor for system of algebraic equation */
  F f_;
  /** array with real data for the algebraic system */
  std::vector<double> dat_;
  /** array with integer data for the algebraic system */
  std::vector<int> dat_int_;

  Eigen::VectorXd y_dbl_;

  algebra_solver_vari(const Fs& fs, const F& f, const Eigen::VectorXd& x,
                      const Eigen::Matrix<T, Eigen::Dynamic, 1>& y,
                      const std::vector<double>& dat,
                      const std::vector<int>& dat_int,
                      const Eigen::VectorXd& theta_dbl, // Fx& fx,
                      std::ostream* msgs)
      : vari(theta_dbl(0)),
        y_(ChainableStack::instance().memalloc_.alloc_array<vari*>(y.size())),
        y_size_(y.size()),
        x_size_(x.size()),
        theta_(
            ChainableStack::instance().memalloc_.alloc_array<vari*>(x_size_)),
        theta_dbl_(theta_dbl), fs_(fs), f_(f), dat_(dat), dat_int_(dat_int),
        y_dbl_(value_of(y)) {
    using Eigen::Map;
    using Eigen::MatrixXd;
    for (int i = 0; i < y.size(); ++i)
      y_[i] = y(i).vi_;

    theta_[0] = this;
    for (int i = 1; i < x.size(); ++i)
      theta_[i] = new vari(theta_dbl(i), false);
  }

  void chain() {
    // compute Jacobian with respect to theta_dbl and y.
    typedef hybrj_functor_solver<Fs, F, double, double> f_all;
    // start_nested();
    Eigen::MatrixXd
      Jacobian_all = f_all(fs_, f_, theta_dbl_, y_dbl_,
                           dat_, dat_int_,
                           0).get_jacobian(append_row(theta_dbl_, y_dbl_));
    // recover_memory_nested();

    std::cout << "Marker 1" << std::endl;

    // start_nested();
    Eigen::MatrixXd sensitivities =
      - mdivide_left(block(Jacobian_all, 1, 1, x_size_, x_size_),
                       block(Jacobian_all, 1, x_size_ + 1,
                             x_size_, y_size_));
    // recover_memory_nested();

    std::cout << "Marker 2 " << std::endl;

    for (int i = 0; i < x_size_; i++)
      std::cout << theta_[i]->adj_ << " ";
    std::cout << std::endl;

    for (int j = 0; j < y_size_; j++)
      for (int i = 0; i < x_size_; i++)
        y_[j]->adj_ += theta_[i]->adj_ * sensitivities(i, j);
        // y_[j]->adj_ += theta_[i]->adj_ * Jx_y_[j * x_size_ + i];
  }
};

wds15 commented 5 years ago

Without too much knowledge about the details... have you considered looking into using IDAS from sundials? That package supports the forward sensitivity method and an adjoint sensitivity method. At least for ODE problems, the adjoint sensitivity approach is far better for problems with many parameters. I hope this comment is useful to point you in a useful direction...

bob-carpenter commented 5 years ago

On Jun 13, 2019, at 4:04 AM, Charles Margossian notifications@github.com wrote:

One thing you can do is avoid [i], which checks bounds, and use .coeffRef(i). We should be doing this in a lot of places in the code.

Right... this came up last time I did a PR. Ok, I'll save that one for last.

Can you wait to construct the Jacobian until the chain() method is called?

There are actually a few complications.

• First, more arguments of the constructor need to be stored in the class, since they are needed to construct the Jacobian matrix. Those elements are theta_dbl, the structures fs and f, and the arrays dat and datint. It is likely less memory than storing J, although I'm not sure what the cost of storing f and fs may be. • Secondly, aren't there issues running Jacobian() inside the chain() method? Based on the unit tests I conducted, this messes up the adjoints for the gradient. Specifically theta[i]->adj goes to 0 for all i's. I'm not sure how to best fix this, but using start_nested() doesn't seem to work.

You're right---you can't call autodiff functions during a call to chain(). Forward-mode over double values would work, but not reverse. But that's not fully general for Stan functins, even when I've finished testing.

Sorry for the confusion---I didn't realize you were using nested autodiff to compute that Jacobian.

Bob

charlesm93 commented 5 years ago

Forward-mode over double values would work, but not reverse.

That's good to know, I might use that for specialized solvers. I could also do the forward pass when instantiating the vari class if needed (the INLA problem is an example where forward diff would be much better than reverse mode, since the system is high dimensional but the model parameters are low dimensional).

@betanalpha Can you implement nested sweeps inside the chain method? I'm guessing you'd need that, else you wouldn't have had access to y[j]->adj_ when initializing cotangents.

@wds15 Thanks for the pointers. I haven't looked at IDAS yet.

betanalpha commented 5 years ago

@betanalpha https://github.com/betanalpha Can you implement nested sweeps inside the chain method? I'm guessing you'd need that, else you wouldn't have had access to y[j]->adj_ when initializing cotangents.

Did you mean to tag Bob here?

Take a look at how the grad function is implemented to see where you can access the adjoints after the forward sweep has been executed.

charlesm93 commented 5 years ago

Take a look at how the grad function is implemented to see where you can access the adjoints after the forward sweep has been executed.

Accessing the adjoints is not my problem here. Once I have these adjoints I use them to build the initial cotangent, with something like follows:

    void chain() {
      // compute the initial cotangent vector
      Eigen::vectorXd J_l_theta(theta_size_);
      for (int i = 0; i < theta_size_; i++) J_l_theta(i) = theta_[i]->adj_;
      Eigen::vectorXd
        init_cotangent = - mdivide_left(J_f_theta_, J_l_theta);

The problem is that I then want to do an autodiff call inside the chain method, but this doesn't seem possible, at least not for the reverse call. I detailed what I wanted to do on this post.

To be clear, I think this scheme is useful for the INLA solver, but I'm not convinced it is the way to go in general, i.e. for algebra_solver(), as it competes with method (2), or other approaches we might consider.

bob-carpenter commented 5 years ago

@betanalpha Can you implement nested sweeps inside the chain method? I'm guessing you'd need that, else you wouldn't have had access to y[j]->adj_ when initializing cotangents.

Given a var, you can get the adjoint of its vari. So you can initialize whatever you want. The var.grad() method initializes var's adjoint to 1 and then propagates back from there.

bob-carpenter commented 5 years ago

The problem is if you have nested autodiff inside the chain() method, you have to be careful not to pass any gradient information back to things outside the nested autodiff. So that means no simple functions of parameters for thing being computed. You'd have to pull their double values out, use those, then put everything back together.

More inline on your specifics.

On Jun 14, 2019, at 2:27 AM, Charles Margossian notifications@github.com wrote:

Take a look at how the grad function is implemented to see where you can access the adjoints after the forward sweep has been executed.

Accessing the adjoints is not my problem here. Once I have these adjoints I use them to build the initial cotangent, with something like follows:
void chain() {
  // compute the initial cotangent vector
  Eigen::vectorXd J_l_theta(theta_size_);

This is OK as it's a double.

  for (int i = 0; i < theta_size_; i++) J_l_theta(i) = theta_[i]->adj_;

I don't think theta[i]->adj_ has a value at this point if it's a component of the vari whose chain method is being defined. Only the parents have values at the point chain() is being called. That is, in reverse mode, you start from final result and work backward, at each step only calling chain() on a vari if all of its parents have already had their chain() methods called. For example, if you have y = f(x1, x2), then y is the parent of x1 and x2.

  Eigen::vectorXd
    init_cotangent = - mdivide_left(J_f_theta_, J_l_theta);
The problem is that I then want to do an autodiff call inside the chain method, but this doesn't seem possible, at least not for the reverse call.

I think you could get away with a nested autodiff call that started with fresh double values---you can't involve any existing vari in expressions or you'll wind up erroneously passing down to those in the nested evaluation.

charlesm93 commented 4 years ago

Linking to this relevant post on the forum: https://discourse.mc-stan.org/t/improved-differentiation-of-the-algebraic-solver/15040

jgaeb commented 3 years ago

@charlesm93, This issue has been resolved by #2421!

stan-dev / math