stan-dev / stan2tfp

Stan2TFP is a work-in-progress alternative backend for Stanc3 which targets TensorFlow Probability
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

TFP backend - supporting more bijectors #2

Closed adamhaber closed 2 years ago

adamhaber commented 4 years ago

Hi,

I'm interested in adding support for more bijectors in the TFP backend; as far as I understand, the current code supports lower-bounded variables, right? Does that cover "automatically" variances and standard deviations?

I think easiest bijectors to add are probably:

Other options:

Code wise, just to be sure - that means adding more patterns in this part of the code, right?

  let components =
    match trans with
    | Program.Identity -> []
    | Lower lb -> [("Exp", []); ("AffineScalar", [lb])]
    -- add more cases here --
    | _ ->
        raise_s
          [%message
            "Unsupported " (trans : Expr.Typed.t Program.transformation)]
seantalts commented 4 years ago

Yep, that's the right part of the code.

Regarding the choices for how to implement each of these: I had been thinking that it would be best to match Stan behavior and characteristics fairly closely. So finding which Stan Math functions are called and looking over the doc or code and even experiments comparing underlying Stan Math function calls to TFP function calls to make sure would be a good idea. I haven't done the necessary legwork for these yet so I can't promise that they are correct.

Let's take the LowerUpper as an example. You can see that the prefix for LowerUpper is "lub" here: https://github.com/stan-dev/stanc3/blob/master/src/frontend/Ast_to_Mir.ml#L173. This will get appended to "_constrain" and "_unconstrain" and looked up in Stan Math. So the constrain version is here: https://github.com/stan-dev/math/blob/develop/stan/math/prim/scal/fun/lub_constrain.hpp

So we can see there that it does a bunch of different, annoyingly complicated stuff depending on the lower and upper bounds and x.

Now I'm wondering - is it worth reproducing this? I assume most of that is in there for a reason and that doing something simpler would lose out on the painfully acquired years of knowledge wrapped up in Stan. Plus we'd like to run backend benchmarks and it would be nice if the models were computing the same thing.

adamhaber commented 4 years ago

I definitely agree that staying as close as possible to Stan's implementation is a good idea. I have no idea if there are any cases where the Stan Math library transforms variables in a way that's not currently implementable by TFP... I guess that at least the simple cases it should be OK. For example, the lub case isn't really complicated, they simply separate to exactly the same cases - identity, just lower, just upper, and both; in each of them the transformation is quite simple and probably "implementable" in TFP.

adamhaber commented 4 years ago

A somewhat more technical question - in | Lower lb -> [("Exp", []); ("AffineScalar", [lb])], the type of lb is Expr.Typed.t. However, in order to implement Upper ub we'll need to pass two arguments to tfb.AffineScalar - the shift ub and the scale -1. My question is - how do I convert -1 (or any other arithmetic quantity for that matter) to be of type Expr.Typed.t?

seantalts commented 4 years ago

An interesting idea would be to implement these transforms in Stan language itself and start creating the Stan-in-Stan standard library... then we could compile those to custom bijectors in TFP...

there's a helper function for ints specifically: Expr.Helpers.int

adamhaber commented 4 years ago

Regarding the helper function - would it be OK to add

  val float : float -> Typed.t

to Expr.mli, and

  let float i =
    {Fixed.meta= Typed.Meta.empty; pattern= Lit (Real, string_of_float i)}

to Expr.ml? And then

| Upper ub -> [("Exp", []); ("AffineScalar", [ub; Expr.Helpers.float (-1.)])] 

This should be identical to the Stan Math implementation, f(x) = U - \exp(x) , and

    | LowerUpper (lb, ub) ->  [("Sigmoid", []); 
                               ("AffineScalar", [lb; Expr.Helpers.binop ub Operator.Minus lb])] 

which should be almost identical to the Stan Math implementation, f(x) = L + (U - L) \mbox{logit}^{-1}(x). Almost identical, because the Stan Math code contains simple numerical checks (preventing f from returning exactly 1 or 0); I'll check how/if we can reproduce this behaviour with TFP.

adamhaber commented 4 years ago

Another thing I'm not sure how to implement (in TFP) is constraints of the form:

parameters {
  real a;
  real<lower = a> b;  // enforces a < b
}

What do you think? I'll go over the Stan manual and look for other interesting cases such as this one...

seantalts commented 4 years ago

Yep float sounds good. And yeah some of those will be tricky; for now it’s fine to stub them out and raise an exception when we see a case we haven’t figured out how to support yet.

On Thu, Nov 7, 2019 at 01:39 Adam Haber notifications@github.com wrote:

Another thing I'm not sure how to implement in TFP is constraints of the form:

parameters { real a; real b; // enforces a < b }

What do you think? I'll go over the Stan manual and look for other interesting cases such as this one...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stanc3/issues/372?email_source=notifications&email_token=AAGET3DXBCO63IW7FXDGBXTQSOZZVA5CNFSM4JJUPRZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDLAQSI#issuecomment-550897737, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGET3FK4P7DPHB2JRYZINTQSOZZVANCNFSM4JJUPRZQ .

bob-carpenter commented 4 years ago

Chaining the affine bijectors would be awesome. The question's then how to reflect that in the language. If we want an affine transform followed by a positive-constraining variable (for example, for a lognormal hierarchical prior, we could add biejctor specs left-to-right:

vector<offset = mu, multiplier = sigma><lower = 0>[N] alpha;
...
alpha ~ lognormal(mu, sigma);

Or we could do it the other way around.

vector<lower = 0><offset = mu, multiplier = sigma>[N] alpha;

In some ways this is more natural thinking of vector<lower = 0> as a unit.

bob-carpenter commented 4 years ago

Another thing I'm not sure how to implement (in TFP) is constraints of the form...

We allow constraints to depend on other parameters. It complicates extracting constrained values because we need the value of all the parameters as they might play a role in transforms (aka bijectors).

The other case that might not work with bijects is unit_vector as it has a constraint that's not bijective---it's controlled by a prior.

bob-carpenter commented 4 years ago

Regarding the choices for how to implement each of these: I had been thinking that it would be best to match Stan behavior and characteristics fairly closely. So finding which Stan Math functions are called and looking over the doc

The reference manual documents all the transforms in a dedicated chapter. That might help understand the code, which is, of course, the final arbiter of how it's actually being done.

So the constrain version is here: https://github.com/stan-dev/math/blob/develop/stan/math/prim/scal/fun/lub_constrain.hpp

These should all have custom derivatives in rev to reduce all the virtual function calls, but I don't think they do.

So we can see there that it does a bunch of different, annoyingly complicated stuff depending on the lower and upper bounds and x.

Now I'm wondering - is it worth reproducing this?

Reproducing what? Allowing constraints to depend on parameters?

Not implementing these things forces the users who want to do it to write their own transforms and Jacobian adjustment.

I assume most of that is in there for a reason

Feel free to ask on a case-by-case basis. I can probably explain why things are there, or at least what I was thinking when we added all the transforms.

and that doing something simpler would lose out on the painfully acquired years of knowledge wrapped up in Stan. Plus we'd like to run backend benchmarks and it would be nice if the models were computing the same thing.

I think compile-time errors or run-tim exceptions would be better than differing behavior.

seantalts commented 4 years ago

Reproducing what? Allowing constraints to depend on parameters?

I was wondering aloud if the first version of the TFP backend needed to be coded to run the exact same transformations that Stan does with Stan Math, given it will be numerically quite different in many other ways and that TF will have its own numerical issues to work around (e.g. Matt Hoffman found a recent mismatch between tf's log and its derivative that made him switch to another constraining function pair).

Feel free to ask on a case-by-case basis. I can probably explain why things are there, or at least what I was thinking when we added all the transforms.

Sweet :D This will be very helpful indeed, thanks.

bob-carpenter commented 4 years ago

To implement transforms, we'd need to allow a

jacobian += ...

statement in transformed parameters. Then we could recode

parameters {
  real<lower = 0> alpha;
  ...

as

parameters {
  real alpha_unc;
  ...
transformed parameters {
  real alpha = exp(alpha_unc);
  log_jacobian += alpha_unc;
  ...

Without that, we'd have to use

target += alpha_unc;

in the model block and then it wouldn't work the way built-in type does w.r.t. optimization (no Jacobian) or MCMC (yes Jacobian).

bob-carpenter commented 4 years ago

I wouldn't think you'd need to use the exact same transforms as Stan. As long as the Jacobians are right, alternatives should be OK. I've thought it would be nice to explore some alternative transforms to what we use now in some cases. There's been a lot of discussion around the difficult unit_vector case and also around simplex.

JoshLipschultz commented 4 years ago

Not sure if this the right issue to leave this note on, but I didn't think filing a separate one was necessary. I noticed that the bijectors generated for simple <lower, upper>-bounded parameters get tfb.Chain-ed together via the shift, scale, then sigmoid bijectors. There's a note on the tfb.Sigmoid documentation saying it's more numerically stable to do such a bijection using the low and high params to that bijector directly, rather than do this chaining. Was there a reason to do it manually?

adamhaber commented 4 years ago

I probably missed the comment about numerical stability - will have a look. Thanks!

bob-carpenter commented 4 years ago

In Stan internally, the constraining transform for lower = L, upper = U from unconstrained y to constrained x is

x = L + (U - L) * inv_logit(y)

Is there a more stable way to compute that? The partials are just computed as follows (where inv_logit(-y) = 1 - inv_logit(y):

d.x/d.L = inv_logit(-y)
d.x/d.U = inv_logit(y)
d.x/d.y = (U - L) * inv_logit(y) * inv_logit(-y)
adamhaber commented 4 years ago

See @axch's answer in the issue above. AFAIU, it makes sense to leave Stan's bijector as it is, but change the code generated by the TFP backend to be more consistent with TFP's "best practices" (so low and high instead of chain+scale+shift).

adamhaber commented 4 years ago

I'll fix the sigmoid parameterisation; other than that, and considering the fact that adding more bijectors and distributions is more of an "ongoing effort", I think we can close this.

rok-cesnovar commented 4 years ago

I agree with @adamhaber that we can close this and can open specific issues for additional bijectors when needed.