stan-dev / nomad

Fast autodiff.
Other
19 stars 2 forks source link

Templated Function Signatures and Vectorization #6

Open betanalpha opened 10 years ago

betanalpha commented 10 years ago

One of the nice things about the new autodiff system is that it incorporates operands and partials directly, which will make implementing functions and implementing distributions identical.

An immediate feature is the ability to construct all possible function signatures like Stan does for the distributions. For example, a binary function taking in two autodiff vars would be written as

template <short autodiff_order>
inline var<autodiff_order> binary_function(const var<autodiff_order>& v1,
                                           const var<autodiff_order>& v2) {

  const short partials_order = 3;
  const unsigned int n_inputs = 2;

  next_inputs_delta = n_inputs;
  next_partials_delta =
    var_body<autodiff_order, partials_order>::n_partials(n_inputs);

  new var_body<autodiff_order, partials_order>(n_inputs);

  double x = v1.first_val();
  double y = v2.first_val();

  push_dual_numbers<autodiff_order>(binary_function(x, y));

  push_inputs(v1.dual_numbers());
  push_inputs(v2.dual_numbers());

  if (autodiff_order >= 1) {
    push_partials(df_dx);
    push_partials(df_dy);
  }
  if (autodiff_order >= 2) {
    push_partials(df2_dx2);
    push_partials(df2_dxdy);
    push_partials(df2_dy2);
  }
  if (autodiff_order >= 3) {
    push_partials(df3_dx3);
    push_partials(df3_dx2dy);
    push_partials(df3_dxdy2);
    push_partials(df3_dy3);
  }
  return var<autodiff_order>(next_body_idx_ - 1);
}

and we'd have to add separate implementations for the (double, var) and (var, double) signatures. But we should be able to template the arguments out and do something like

template <short autodiff_order>
inline var<autodiff_order> binary_function(const T1& v1,
                                           const T2& v2) {

  const short partials_order = 3;
  const unsigned int n_inputs = T1.is_var() + T2.is_var();

  next_inputs_delta = n_inputs;
  next_partials_delta =
    var_body<autodiff_order, partials_order>::n_partials(n_inputs);

  new var_body<autodiff_order, partials_order>(n_inputs);

  double x = v1.first_val();
  double y = v2.first_val();

  push_dual_numbers<autodiff_order>(binary_function(x, y));

  push_inputs(v1.dual_numbers());
  push_inputs(v2.dual_numbers());

  if (autodiff_order >= 1) {
    if (T1.is_var()) push_partials(df_dx);
    if (T2.is_var()) push_partials(df_dy);
  }
  if (autodiff_order >= 2) {
    if (T1.is_var()) push_partials(df2_dx2);
    if (T1.is_var() && T2.is_var()) push_partials(df2_dxdy);
    if (T2.is_var()) push_partials(df2_dy2);
  }
  if (autodiff_order >= 3) {
    if (T1.is_var()) push_partials(df3_dx3);
    if (T1.is_var() && T2.is_var()) push_partials(df3_dx2dy);
    if (T1.is_var() && T2.is_var()) push_partials(df3_dxdy2);
    if (T2.is_var()) push_partials(df3_dy3);
  }
  return var<autodiff_order>(next_body_idx_ - 1);
}

Of course we'd have to be careful regarding some optimizations (especially if we want to use specialized var_body implementations).

Vectorization should work the same way, too, if we had a n_vars method in a template array wrapper.

Thoughts?

bob-carpenter commented 10 years ago

I'm confused about the level we're talking about here.

Is "binary_function" going to get filled in with something like "log_sum_exp" and will it need to be written four times: (double,double), (double,var), (var,double), and (var,var)?

If not, can you give a more complete example of how I write something like log_sum_exp(scalar,scalar) (or just use hypot(scalar,scalar), which is an easy C++ function, though not one with a lot going on in the derivatives).

The functional approach I was suggesting handles both cases at the expense of having to do

Of course, it all needs to push through to higher order stuff, which I haven't thought about at all.

P.S. I'm still waiting to hear about binary functions with higher order derivatives to do speed tests on.

On Jul 2, 2014, at 3:12 PM, Michael Betancourt notifications@github.com wrote:

One of the nice things about the new autodiff system is that it incorporates operands and partials directly, which will make implementing functions and implementing distributions identical.

An immediate feature is the ability to construct all possible function signatures like Stan does for the distributions. For example, a binary function taking in two autodiff vars would be written as

template inline var binary_function(const var& v1, const var& v2) {

const short partials_order = 3; const unsigned int n_inputs = 2;

next_inputs_delta = n_inputs; next_partials_delta = var_body<autodiff_order, partials_order>::n_partials(n_inputs);

new var_body<autodiff_order, partials_order>(n_inputs);

double x = v1.first_val(); double y = v2.first_val();

push_dual_numbers(binary_function(x, y));

push_inputs(v1.dual_numbers()); push_inputs(v2.dual_numbers());

if (autodiff_order >= 1) { push_partials(df_dx); push_partials(df_dy); } if (autodiff_order >= 2) { push_partials(df2_dx2); push_partials(df2_dxdy); push_partials(df2_dy2); } if (autodiff_order >= 3) { push_partials(df3_dx3); push_partials(df3_dx2dy); push_partials(df3_dxdy2); push_partials(df3_dy3); } return var(next_bodyidx - 1); }

— Reply to this email directly or view it on GitHub.

betanalpha commented 10 years ago

You mean like https://github.com/stan-dev/nomad/blob/master/scalar/functions/log_sum_exp.hpp ?

In any case, yes the idea is to have one implementation that defines the four signatures. Or at least the option to do that. Hopefully vectorization should follow a similar pattern, as well.

On Wed, Jul 2, 2014 at 2:36 PM, Bob Carpenter notifications@github.com wrote:

I'm confused about the level we're talking about here.

Is "binary_function" going to get filled in with something like "log_sum_exp" and will it need to be written four times: (double,double), (double,var), (var,double), and (var,var)?

If not, can you give a more complete example of how I write something like log_sum_exp(scalar,scalar) (or just use hypot(scalar,scalar), which is an easy C++ function, though not one with a lot going on in the derivatives).

The functional approach I was suggesting handles both cases at the expense of having to do

  • extra computation of partials that get ignored,
  • template traits as in our prob functions to decide what to compute,
  • writing the partials out separately

Of course, it all needs to push through to higher order stuff, which I haven't thought about at all.

  • Bob

P.S. I'm still waiting to hear about binary functions with higher order derivatives to do speed tests on.

On Jul 2, 2014, at 3:12 PM, Michael Betancourt notifications@github.com wrote:

One of the nice things about the new autodiff system is that it incorporates operands and partials directly, which will make implementing functions and implementing distributions identical.

An immediate feature is the ability to construct all possible function signatures like Stan does for the distributions. For example, a binary function taking in two autodiff vars would be written as

template inline var binary_function(const var& v1, const var& v2) {

const short partials_order = 3; const unsigned int n_inputs = 2;

next_inputs_delta = n_inputs; next_partials_delta = var_body<autodiff_order, partials_order>::n_partials(n_inputs);

new var_body<autodiff_order, partials_order>(n_inputs);

double x = v1.first_val(); double y = v2.first_val();

push_dual_numbers(binary_function(x, y));

push_inputs(v1.dual_numbers()); push_inputs(v2.dual_numbers());

if (autodiff_order >= 1) { push_partials(df_dx); push_partials(df_dy); } if (autodiff_order >= 2) { push_partials(df2_dx2); push_partials(df2_dxdy); push_partials(df2_dy2); } if (autodiff_order >= 3) { push_partials(df3_dx3); push_partials(df3_dx2dy); push_partials(df3_dxdy2); push_partials(df3_dy3); } return var(next_bodyidx - 1); }

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/nomad/issues/6#issuecomment-47776254.

PeterLi2016 commented 10 years ago

I'd imagine you could do something similar to what we do now with VectorView for distributions in Stan, but for scalar functions would you want to have vectors of whatever push_partials does? And for the higher order derivatives for a distribution would you want all possible combinations of d^2/dx1dx2?

betanalpha commented 10 years ago

Yeah, the current implementation in Stan was a motivation.

The interaction of higher-order derivatives and vector views is going to be very, very hard. At least if we don't assume that the inputs are all independent. What are you doing for the generic vectorization right now? Just letting the fvars chug through all of the derivatives automatically?

On Wed, Jul 2, 2014 at 3:44 PM, Peter Li notifications@github.com wrote:

I'd imagine you could do something similar to what we do now with VectorView for distributions in Stan, but for scalar functions would you want to have vectors of whatever push_partials does? And for the higher order derivatives for a distribution would you want all possible combinations of d^2/dx1dx2?

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/nomad/issues/6#issuecomment-47785004.

PeterLi2016 commented 10 years ago

Yeah. Currently, we only implement the first order derivatives but we do it generically enough so that the autodiff can compute the rest.

On Wed, Jul 2, 2014 at 10:58 AM, Michael Betancourt < notifications@github.com> wrote:

Yeah, the current implementation in Stan was a motivation.

The interaction of higher-order derivatives and vector views is going to be very, very hard. At least if we don't assume that the inputs are all independent. What are you doing for the generic vectorization right now? Just letting the fvars chug through all of the derivatives automatically?

On Wed, Jul 2, 2014 at 3:44 PM, Peter Li notifications@github.com wrote:

I'd imagine you could do something similar to what we do now with VectorView for distributions in Stan, but for scalar functions would you want to have vectors of whatever push_partials does? And for the higher order derivatives for a distribution would you want all possible combinations of d^2/dx1dx2?

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/nomad/issues/6#issuecomment-47785004.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/nomad/issues/6#issuecomment-47787129.

bob-carpenter commented 10 years ago

On Jul 2, 2014, at 11:00 AM, Peter Li notifications@github.com wrote:

Yeah. Currently, we only implement the first order derivatives but we do it generically enough so that the autodiff can compute the rest.

Exactly.

And there's nothing specialized for the fvar or fvar<fvar > implementations --- they work just like reverse mode.

As Michael has demonstrated, there's a huge gain to be had from unpacking the way these work in conjunction with reverse mode at higher orders.

betanalpha commented 10 years ago

Sure, they work provided that you can autodiff through the fvar implementations. We won't have to hand-code all of the cross derivatives in the new stuff -- we just have to figure out the underlying functional mappings to implement the derivatives automatically. For example, if f (vector) is just short hand for sum{i} f (x{i}) then we implement the latter, which is easy. Or we can use the autodiff methods directly to get automated derivatives as needed.

On Wed, Jul 2, 2014 at 4:31 PM, Bob Carpenter notifications@github.com wrote:

On Jul 2, 2014, at 11:00 AM, Peter Li notifications@github.com wrote:

Yeah. Currently, we only implement the first order derivatives but we do it generically enough so that the autodiff can compute the rest.

Exactly.

And there's nothing specialized for the fvar or fvar<fvar > implementations --- they work just like reverse mode.

As Michael has demonstrated, there's a huge gain to be had from unpacking the way these work in conjunction with reverse mode at higher orders.

  • Bob

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/nomad/issues/6#issuecomment-47791798.

bob-carpenter commented 10 years ago

In the simplest case, in the current Stan implementation, if we have implemented var and fvar according to

exp'(x) = exp(x),

then we can use fvar<fvar> to compute exp''(x) and exp'''(x).

Is there going to be a way to do something similar in the new system? (I know it won't be as fast as hand coding everything.)

In the current approach, we do it with nesting with fvar<fvar> to get third order derivatives.

On Jul 2, 2014, at 11:36 AM, Michael Betancourt notifications@github.com wrote:

Sure, they work provided that you can autodiff through the fvar implementations. We won't have to hand-code all of the cross derivatives in the new stuff -- we just have to figure out the underlying functional mappings to implement the derivatives automatically. For example, if f (vector) is just short hand for sum{i} f (x{i}) then we implement the latter, which is easy. Or we can use the autodiff methods directly to get automated derivatives as needed.

On Wed, Jul 2, 2014 at 4:31 PM, Bob Carpenter notifications@github.com wrote:

On Jul 2, 2014, at 11:00 AM, Peter Li notifications@github.com wrote:

Yeah. Currently, we only implement the first order derivatives but we do it generically enough so that the autodiff can compute the rest.

Exactly.

And there's nothing specialized for the fvar or fvar<fvar > implementations --- they work just like reverse mode.

As Michael has demonstrated, there's a huge gain to be had from unpacking the way these work in conjunction with reverse mode at higher orders.

  • Bob

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/nomad/issues/6#issuecomment-47791798.

— Reply to this email directly or view it on GitHub.

betanalpha commented 10 years ago

In the simplest case, in the current Stan implementation, if we have implemented var and fvar according to

exp'(x) = exp(x),

then we can use fvar<fvar> to compute exp''(x) and exp'''(x).

Is there going to be a way to do something similar in the new system? (I know it won't be as fast as hand coding everything.)

In the current approach, we do it with nesting with fvar<fvar> to get third order derivatives.

Recall what happens when you nest fvar<fvar> — the adjoint of the outer fvar gets templated with the fvar which autodials through the function. I maintain that this is a bad idea(TM).

That said, it can be reproduced in the new approach. We just have to define a functional and then explicitly call Hessian, etc.

bob-carpenter commented 10 years ago

On Jul 2, 2014, at 3:47 PM, Michael Betancourt notifications@github.com wrote:

In the simplest case, in the current Stan implementation, if we have implemented var and fvar according to

exp'(x) = exp(x),

then we can use fvar<fvar> to compute exp''(x) and exp'''(x).

Is there going to be a way to do something similar in the new system? (I know it won't be as fast as hand coding everything.)

In the current approach, we do it with nesting with fvar<fvar> to get third order derivatives.

Recall what happens when you nest fvar<fvar> — the adjoint of the outer fvar gets templated with the fvar which autodials through the function. I maintain that this is a bad idea(TM).

Autodials?

Yes, fvar does extra work. But it gets the right answer, at least for non-iterative algorithms.

The extra work could be mitigated to some extent with two template parameters, one for value and one for tangent. In most cases, the outer value can be double.

That said, it can be reproduced in the new approach. We just have to define a functional and then explicitly call Hessian, etc.

This is what I don't see how to do, because I don't see how to do the equivalent of nesting.

But since you (Michael) already plan to code all the derivatives by hand, it's not a big deal.

betanalpha commented 10 years ago

Recall what happens when you nest fvar<fvar> — the adjoint of the outer fvar gets templated with the fvar which autodials through the function. I maintain that this is a bad idea(TM).

Autodials?

Typo — meant autodiffs.

Yes, fvar does extra work. But it gets the right answer, at least for non-iterative algorithms.

Possibly not the most stable answer, etc.

The extra work could be mitigated to some extent with two template parameters, one for value and one for tangent. In most cases, the outer value can be double.

That said, it can be reproduced in the new approach. We just have to define a functional and then explicitly call Hessian, etc.

This is what I don't see how to do, because I don't see how to do the equivalent of nesting.

The idea is simple — when we have to push the higher-order partials we compute them by calling Hessian, etc on some predefined functional and then just pushing those values. It ends up being the same thing as the fvar nesting.

But since you (Michael) already plan to code all the derivatives by hand, it's not a big deal.

I think of it this way — right now all third-order derivatives are done with this nesting, and that means all special functions are computed by autodiffing through some iterative algorithm. I’d rather we be conservative and only have good implementations than be more careless and have lots of bad implementations.

Autodiff isn’t magic, but it will burn you if you’re not careful.