stan-dev / stanc3

The Stan transpiler (from Stan to C++ and beyond).
BSD 3-Clause "New" or "Revised" License
140 stars 44 forks source link

Can't mix offset/multiplier with upper/lower bounds #659

Open andrjohns opened 4 years ago

andrjohns commented 4 years ago

The following code:

data { 
  int<lower=0> N; 
  int<lower=0,upper=1> y[N];
} 
parameters {
  real<multiplier=1,lower=0,upper=1> theta;
} 
model {
  theta ~ beta(1,1);  // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}

Gives the error:

   -------------------------------------------------
     4:  } 
     5:  parameters {
     6:    real<multiplier=1,offset=0,lower=0,upper=1> theta;
                                     ^
     7:  } 
     8:  model {
   -------------------------------------------------

Expected '>' after multiplier expression.

Should this be legal syntax?

nhuurre commented 4 years ago

Related to #546 It would be nice to allow this but we need to first decide if offset/multiplier should be on the constrained or unconstrained scale. Probably makes most sense to have a constrained value as the offset and an unconstrained scale multiplier? IIRC correlation matrix unconstraining functions are not autodiffable so corr_matrix might not work.

bob-carpenter commented 4 years ago

We definitely need to figure out how to compose transformations (what TensorFlow Probability calls "bijectors"). I'd prefer to have the transformations separated with:

real<multiplier = m, offset = o><lower = l, upper = u> theta;

being read left to right as:

That way, the subsequences of the type expression denote reasonable types, and each additional transform gets added to the end.

@nhuurre --- any idea how hard this would be to do? It'd be really useful for things like non-centered lognormal parameterizations.

IIRC correlation matrix unconstraining functions are not autodiffable so corr_matrix might not work.

It's the constraining functions that need to be autodiffable, and they are here (it's an inverse logit plus stick breaking process to get each unit-length row with positive element on the diagonal; the details are in the ference manual). So we could compose an offset and correlation matrix constraint, but it'd only make sense doing the offset/multiplier first as otherwise the result won't be a correlation matrix. But I don't see an application for it.

We could stack offset/multiplier for correlation matrices, but I don't think it'd gain us anything. I don't think we could easily stack lower/upper bounds.

bob-carpenter commented 4 years ago

Sorry, forgot to answer @andrjohns question. No, it should not be legal syntax now. But we want something like it in the future.

avehtari commented 3 years ago

I just had a model for which optimizing failed as inits are far from sensible, but making the posterior closer to unit scale helps. I have these different length scale and magnitude scale paraneters

  real<lower=0> rho;   // lengthscale of f1
  real<lower=0> alpha; // scale/magnitude of f1
  real<lower=0> sigma; // residual scale

and then the priors were

  rho ~ normal(0, 1000);
  alpha ~ normal(0, 500);
  sigma ~ normal(0, 500);

For me the most logical thing was to try

  real<lower=0,multiplier=1000> rho;   // lengthscale of f1
  real<lower=0,multiplier=500> alpha; // scale/magnitude of f1
  real<lower=0,multiplier=500> sigma; // residual scale

which btw gives different error than when multiplier is before lower. @bob-carpenter suggested

real<multipler = m, offset = o><lower = l, upper = u>

to mean " start with unconstrained, multiply by m then add o, then finally apply inverse logit transform designated by", but can you confirm whether then

real<lower = l, upper = u><multipler = m, offset = o>

apply inverse logit transform and multiply constrained by m then add o?

This combined feature would be useful for me as now to get optimizing to work for my model I need to scale the data to have unit scale to get the parameters also to unit scale.

bob-carpenter commented 3 years ago

The natural way to set it up is from inside out, thinking of each < ... > as applying an inverse transform.

real<lower = l, upper = u><multipler = m, offset = o> y;

So we start with a type real and unconstrained value x. Then we apply <lower = l, upper = u> to get a variable u of type real<lower = l, upper = u> with value

w = l + (u - l) * inv_logit(x)

Then you apply the final transform <multiplier = m, offset = o> to get

y = o + m * w

The reuslting y will satisfy the constraint <lower = o + m * l, upper = o + m * u>, but there's nothing to do at compile time about that, so we don't have to worry about the algebra.

The other way around,

real<offset = o, multiplier = m><lower = l, upper = u> y;

will guarantee that y satisifes the constraint lower = l, upper = u.

avehtari commented 3 years ago

Could you use something else than u for variable, so that u is not on both left hand and right hand side?

u = l + (u - l) * inv_logit(x)
bob-carpenter commented 3 years ago

Could you use something else than u for variable,

Oops. I edited the original comment.

betanalpha commented 3 years ago

I was reading through #947 but I think the right place to comment is here. Apologies for the late addition. I'm hesitant about the claimed utility of all of this machinery given how few transformations actually compose and the relatively inaccessibility of those compositions to end users.

Right now all of our transformations map from a constrained space (a one-dimensional subspace of R in the scalar case or a lower-dimensional subspace of R^{N} in the multivariate case) to the same unconstrained space (all of R in the scalar case or all of R^{D} where D is the dimension of the subspace). This means the the only valid composition would be maps from constrained spaces to constrained spaces, of which we currently do not have any, and maps from unconstrained spaces to unconstrained spaces of which we only have the R -> R multiply/offset.

This means that there's just one valid set of compositions (1D constrained -> 1D unconstrained then 1D unconstrained -> 1D unconstrained) and, critically, the behavior of that second map depends on how the first map is implemented. Right now users only see the bounds that define the constrained spaces but the implementation of the constrained -> unconstrained maps is completely abstracted which means that there's no guarantees in the language for what any prepended unconstrained -> unconstrained maps would actually do. In particular if the constrained -> unconstrained maps was every changed, a discussion which has happened in the past, then the effective action of any prepended maps would change!

I think this also highlights a problem with how the multiplier/offset was implemented in the language. Before multiplier/offset all of the <> annotations to types simply defined a constraint, and hence a constrained space of valid values, but they did not define how the algorithms ensures that the constraint was satisfied. Again that was all abstracted behind the user-facing language. In other words the <> annotations did not define a map.

The multiplier/offset <> annotation, however, behaves fundamentally differently. Instead of defining a space it defines an explicit user-facing transformation. Because these two annotations technically define different concepts they don't actually compose naturally from the user perspective. Again one has to jump from constraint to constrained -> unconstrained map, which is abstracted from the user, before one can consider compositions with transformation defined by the multiplier/offset <> annotation.

I doubt anyone will agree with me at this point but personally I think this overloaded use of the <> annotation is too confusing to be useful as demonstrated by the creation of this issue in the first place. For example a more consistent usage would be to drop constraints entirely and instead require users to specify explicit constrained -> unconstrained maps. Then constrained -> unconstrained -> unconstrained maps could be defined without the need for a full compositional backend. To be clear this is just an example, and not one that I think would be all that useful given the cognitive burden of understanding constrained -> unconstrained maps instead of just constraints would place on users. I personally would prefer to introduce additional transformations between the space defined by the parameters block and the space considered by the algorithms in a different way, for example a new block or something.

bob-carpenter commented 3 years ago

@betanalpha I think the main issue for the language is whether to go with something that looks like composable transforms such as

real<offset=m, multiplier=s><lower = 0> a;

Or whether to just roll together the ones that work:

real<offset=m, multiplier=s, lower = 0> a;

The only other constrained types that are naturally scalable are ordered and positive_ordered.

I think Oryx, the probabilistic programming library that sits on top of JAX, does just what you're proposing in terms of specifying transforms more explicitly.

There's also a related issue of whether we let users define their own transforms somehow, but it's messy given the need for the transform, inverse transform, and inverse transform plus log absolute Jacobian determinant. We'd ideally want to validate their inverse and their derivatives as a general determinant calculation wouldn't be efficient.

betanalpha commented 3 years ago

My point is that <lower = 0> doesn't explicitly define a transformation! There are an infinite number of transformations from the positive real line to the full real line, and in order for <lower=0> to even implicitly define a transformation would require a global convention for which transformation is chosen. Right now a transformation is chosen implicitly behind the scenes so that users don't need to be aware of that convention, but if a declaration like real<offset=m, multiplier=s><lower = 0> a; is allowed then that abstraction is broken and users would have to educate themselves on that convention. Implicit implementation assumptions that have user-facing effects is asking for confusion.

The problem holds for both real<offset=m, multiplier=s><lower = 0> a; or real a;`.

If one moved to explicit transformations then one could right something like real<log><offset=m><multiply=s> a where at least each annotation is of the same object. Again I'm not advocating for this; I think it's a completely different feature than the original constraint annotations and if added should have its own syntax. This is one of the reasons why I complained about the original addition of offset/multiply, although I may not have been as clear about it.

WardBrian commented 3 years ago

I think @betanalpha brings up an interesting question that I had asked myself while working on the composable transforms, which is "why are offset and multiplier considered transforms and not \<something else>". This makes the "composition" a lot easier if there is simply some other "thing", (which is not a transform), and there are transforms. If instead of "every variable can have a transform" we had some concept of "every variable can have a transform and a Scale" (for lack of imagination) it covers probably 90% of use cases with 10% the complexity, because the transform can stay the basic non-list type it is now.

I have seen a few people request other compositions, like a postiive simplex, but I suspect what they were hoping for would be the intersection of those constraints (the variable is both a simplex and has all positive elements) rather than the composition (a vector is stick-broken and then exp'd).

Edit: This doesn't fix the other point you bring up about needing to understand which transform is occurring, or the syntax concern, both of which I agree are confusing in this context

bob-carpenter commented 3 years ago

@WardBrian: Simplexes have all positive elements by definition. I think what people were asking for is an ordered simplex. But we don't have any way to declare an ordered simplex because both are types, not something we currently write in angle brackets.

@betanalpha: I'm less worried about implementation dependency than you. It happens in every programming language that you have to break the encapsulation to understand how something's implemented in order to write efficient code, such as understanding that Stan uses column-major order for matrices or that we use autodiff which can fail in places where algebra doesn't fail or that non-centered parameterizations are more efficient because of the sampler we happen to be using, and so on. More constructively, how would you propose someone code a non-centered, positive-constrained variable with prior alpha ~ lognormal(mu, sigma)? That's what I'd like to make easier for users.

WardBrian commented 3 years ago

You're right, I meant ordered simplex. And while those are treated as types in the syntax, in the implementation both of them just declare variables with the underlying type and a transformation, so in theory one could imagine syntax which (combined with #947) would declare a variable as both ordered and a simplex.

This was brought up when I was first discussing the composable transforms idea, and it can be very tricky to explain why the output is not actually something that is both ordered and a simplex. It's not clear to me how useful the thing the result ends up being actually is. Making the offset and multiplier options orthogonal to the transformation of a variable allows the compositions we'd want, but would not allow things like this.

betanalpha commented 3 years ago

I have seen a few people request other compositions, like a postiive simplex, but I suspect what they were hoping for would be the intersection of those constraints (the variable is both a simplex and has all positive elements) rather than the composition (a vector is stick-broken and then exp'd).

As Bob mentioned I think you mean ordered simplex, but you’re correct in that the two constraints are not compatible with each other and instead people are actually looking for the intersection where both constraints are satisfied at the same time.

Again I think it helps to avoid confusion by not confounding constraints and maps but instead discussing them them as the separate concepts they are (and ideally encoding that difference in the language!).

I'm less worried about implementation dependency than you. It happens in every programming language that you have to break the encapsulation to understand how something's implemented in order to write efficient code, such as understanding that Stan uses column-major order for matrices or that we use autodiff which can fail in places where algebra doesn't fail or that non-centered parameterizations are more efficient because of the sampler we happen to be using, and so on.

I agree with the need to peak inside encapsulations to enable more efficient code in general, but I don’t think it applies here.

When appending to a constrained -> unconstrained map a unconstrained -> unconstrained transformation acts on a different unconstrained space and hence behaves differently. This isn’t a question of performance or efficiency but rather clearly understanding what the transformation is actually doing.

For example many use multiplier/offset to try to hack better initial conditions, but when that transformation is applied after an unconstraining transformation the user would have to know exactly how the unconstraining is implemented in order to set the right multiplier/offset.

More constructively, how would you propose someone code a non-centered, positive-constrained variable with prior alpha ~ lognormal(mu, sigma)? That's what I'd like to make easier for users.

The full “non-centered parameterization” doesn’t apply to positively-constrained variables. The standard “non-centering” that reduces a latent population model to a unit population model with a deterministic scaling and translation applies only to the normal and cauchy families (i.e. the stable families available in closed form) defined over the entire real line. While one can brute force this transformation to other families of densities over the real line the Jacobian will in general introduce unwelcomed couplings, and we won’t have the same centered and non-centered behaviors.

For some families of densities over the positive real line (either gamma or inverse gamma I can’t remember which, log normal) we can non-center the scale while preserving the family. In this case one would need a pure multiplier transform but — critically — it would have to be applied before the unconstraining transform to be interpretable! A side benefit of that ordering is that the implementation of the unconstraining transformation goes back to being abstracted from the user.

Again to me this reinforces the fact that these helper transformations and constraints have fundamentally different uses and should be treated as such in the language. Yes in the final implementation there will be circumstances where things reduce to composable transforms, but even advanced users will have trouble understanding that final implementation and there will be only a few compositions at a time so they’re not hard to implement with the existing code base.

bob-carpenter commented 3 years ago

Thanks, @betanalpha. I get the conceptual difference you're trying to make between transforms and constraints. I'm more of a lumper, so I see a transform like the affine transform as a constraint that doesn't impose a constraint, i.e., it maps from reals to reals.

For non-centering, I current manually do something like the following.

parameters {
  vector[N] alpha_std;
  real mu;
  real<lower = 0> sigma;
}
transformed parameters {
  vector<lower = 0>[N] alpha = exp(mu + sigma * alpha_std);
}
model {
  alpha_std ~ normal(0, 1);
  mu ~ ...;
  sigma ~ ...;
}

What I'd like to do is be able to write something like the following instead.

parameters {
  vector<offset = mu, multiplier = sigma><lower = 0>[N] alpha;
}
model {
  alpha ~ lognormal(mu, sigma);
}

With the understanding that the transform is log followed by z-score and the inverse transform is affine followed by exp.

nhuurre commented 3 years ago

Right now a transformation is chosen implicitly behind the scenes so that users don't need to be aware of that convention, but if a declaration like real<offset=m, multiplier=s><lower = 0> a; is allowed then that abstraction is broken and users would have to educate themselves on that convention.

For example many use multiplier/offset to try to hack better initial conditions, but when that transformation is applied after an unconstraining transformation the user would have to know exactly how the unconstraining is implemented in order to set the right multiplier/offset.

This is an important point. Currently the (un)constraining functions are not exposed in the language (although maybe they should be?)

In my opinion the only interpretable abstraction is to automatically unconstrain the user-supplied offset value.

real<lower=low, offset=ofs> x;

would then transpile to

var x = lb_constrain(x_raw + lb_free(ofs, low), low);

This guarantees that the offset and the parameter are comparable despite the implicit transforms.

It's not at all obvious how to handle the multiplier, though.

bob-carpenter commented 3 years ago

@nhuure: Neither the constraining or unconstraining functions are exposed. Nevertheless, it's like any compiled language in that using it at maximal efficiency requires understanding what the compiler does under the hood. For example, C++ compilers make all kinds of decisions that aren't part of the spec and part of programming efficient C++ is figuring out what those are.

The offset can be anything, so I'm not sure what you mean by "unconstraining the user-supplied offset". The multiplier needs to be positive. With our current constraints, we require the lower bound to be less than the upper bound, but otherwise don't constrain arguments.

My suggestion was to transpile as follows:

real<offset = mu, multiplier = sigma><lower = L> x;
if (sigma <= 0) throw ...;
var x_mid = offset_multiplier_constrain(x_raw, mu, sigma, lp__);
var x = lb_constrain(x_mid, L, lp__);

With vectors, we can formulate a natural lognormal hierarchical model as:

vector<offset = mu, multiplier = sigma><lower = 0>[N] tau;

tau[1:N] ~ lognormal(mu, sigma);

This is indeed relying on our understanding of how multiplier and offset are implemented as well as how lower bounds are implemented. If those things change, the program doesn't become wrong, it just becomes less efficient.

WardBrian commented 3 years ago

@bob-carpenter's suggested c++ code is very similar to what I implemented in #971 (the if (sigma <=0) throw ... is in the math library, so it's actually a direct composition x = lb_constrain(OM_constrain(x_raw, mu, sigma, lp__), L, lp__);)

nhuurre commented 3 years ago

The offset can be anything, so I'm not sure what you mean by "unconstraining the user-supplied offset".

I mean typical Stan code should look like this

data {
  int K;
  simplex[K] init;
}
parameters {
  simplex<offset=init>[K] par;
}

and not like this

data {
  int K;
  simplex[K] init;
}
parameters {
  simplex<offset=simplex_free(init)>[K] par;
}

and definitely not like this

functions {
  vector calc_offset(vector v) {
    int k = size(v);
    vector[k-1] w;
    real l = v[k];
    for (i in 1:k-1) {
      l += v[k-i];
      w[k-i] = logit(v[k-i] / l) + log(i);
    }
    return w;
  }
}
data {
  int K;
  simplex[K] init;
}
parameters {
  simplex<offset=calc_offset(init)>[K] par;
}
betanalpha commented 2 years ago

For non-centering, I current manually do something like the following.

parameters { vector[N] alpha_std; real mu; real sigma; } transformed parameters { vector[N] alpha = exp(mu + sigma * alpha_std); } model { alpha_std ~ normal(0, 1); mu ~ ...; sigma ~ ...; }

What I'd like to do is be able to write something like the following instead.

parameters { vector[N] alpha; } model { alpha ~ lognormal(mu, sigma); }

With the understanding that the transform is log followed by z-score and the inverse transform is affine followed by exp

There are many reasons why I don’t think that users should able to do this let alone encouraged to do this.

When building a hierarchal model we can’t take for granted the population model and that it’s that choice of population model that determines the natural reparameterizations. If a user want to specify a population model on log(alpha) and not alpha then the most interpretable Stan program follows that structure by specifying log_alpha in the transformed parameters block.

If instead a user wants to specify a lognormal model on alpha directly then they need to be aware of the natural reparameterizations of that constrained population model which are no longer translation and scaling but rather scaling and exponentiation. This would require something like

vector[N] alpha;

This notation is really burdened by the overloading of <> for types and transformations. It becomes much more clear when the type and any transformation are separated out, for example with

vector[N] alpha map map;

where the compiler can then verify that all transformations are compatible with the associated type, here all mapping R^{+} to R^{+}.

Either the user manages the constraints themselves, because it’s more natural to model the latent unconstrained space directly, or the user applies constraint-compatible transormations. In either case there’s no need to know how Stan handles constraints behind the hood.

Finally the use transformations to implement hierarchical reparameterizations itself will always be too limited in practice given that the reparameterizations are defined for each parameters independently of the others. In most, if not all, hierarchical models the optimal parameterization will vary from individual context to individual context.

bob-carpenter commented 2 years ago

@betanalpha I'd like to be able to write this:

parameters {
  vector<lower=0><offset=mu, multiplier=sigma>[N] a;
}
model {
  a ~ lognormal(mu, sigma);
}

instead of what I currenlty do, which is this

parameters {
  vector<offset = mu, multiplier = sigma>[N] log_a:
}
transformed parameters {
  vector<lower=0>[N] a = exp(log_a);
}
model {
  log_a ~ normal(mu, sigma);
}

or even worse, what I'd have to do without offset/multiplier, which is this:

parameters {
  vector[N] log_std_a;
}
transformed parameters {
  vector<lower = 0>[N] a = exp(mu + sigma * log_std_a);
}
model {
  log_std_a ~ normal(0, 1);
}

Not to bring up another can of worms, but this is exactly the situation where I'd like to be able to write a one-liner to keep everything together for maximal readability,

parameters {
  vector<lower=0><offset=mu, multiplier=sigma>[N] a ~ lognormal(mu, sigma);
}
nhuurre commented 2 years ago

Not to bring up another can of worms, but this is exactly the situation where I'd like to be able to write a one-liner to keep everything together for maximal readability,

If we're going to open that can of worms may I propose that

parameters {
  vector[N] a ~ lognormal(mu, sigma);
}

means quantile function application

parameters {
  vector<lower=0, upper=1>[N] a_raw;
}
transformed parameters {
  vector[N] a;
  for (i in 1:N)
    a[i] = lognormal_qf(a_raw[i] | mu, sigma);
}

No need for additional transformations, it's automatically sort of "non-centered."

betanalpha commented 2 years ago

@betanalpha I'd like to be able to write this: parameters { vector[N] a; } model { a ~ lognormal(mu, sigma); }

Yes I know that is what you want, my argument is that this is not a good idea for a variety for reasons.

If modeling a lognormal population model directly then one should implement the reparameterizations compatible with log normal family, which are scalings and exponentiations. In other words the proper analogue to non-centering the location and scale of a normal population model would be

parameters {
  vector<lower=0><multiplier=mu, exponent=sigma>[N] a;
}
model {
  a ~ lognormal(mu, sigma);
}

although as I wrote in my last mail I think that this suggests a different notation for the transformations to disambiguate them from types.

If modeling a normal population model on some latent space then

parameters {
  vector<offset = mu, multiplier = sigma>[N] log_a:
}
transformed parameters {
  vector<lower=0>[N] a = exp(log_a);
}
model {
  log_a ~ normal(mu, sigma);
}

would be the most appropriate Stan program.

That said either way the implicit transformation perspective only allows you to monolithically center or non-center all of the parameters which is the optimal parameterization only in special circumstances.

I'd like to be able to write a one-liner to keep everything together for maximal readability,

I strongly disagree that this is more readable, let alone maximally readable. You’re cramming a wealth of assumptions into one line that is extremely difficult to unpack unless you know the specifics of each assumption.

nhuurre commented 2 years ago

That said either way the implicit transformation perspective only allows you to monolithically center or non-center all of the parameters which is the optimal parameterization only in special circumstances.

You can work around it with array-like offset/multipliers, for example

data {
  int N;
  int K;
  real y[N,K];
  int<lower=0, upper=1> is_centered[N];
}
transformed data {
  int centered_idx[N];
  for (n in 1:N)
    centered_idx[n] = is_centered[n] ? 1 : 2;
}
parameters {
  real mu;
  real sigma;
  vector<offset=[0.0, mu]'[centered_idx], multiplier=[1.0, sigma]'[centered_idx]>[N] eta;
}
model {
  for (n in 1:N) {
    y[n] ~ normal(eta, 1.0);
  }
}