Closed WardBrian closed 4 months ago
AFAIK this inconsistency was inherited from stanc2, see #767 . Do you have suggestion for what to do about it?
IMO I think we should allow target+= in TParams. @bob-carpenter?
^ I agree
I also agree. My original intention was to allow target +=
in the transformed parameter block because I thought that'd be the natural place to drop Jacobians. Ideally, this should go along with exposing the boolean template parameter for whether the Jacobian is applied. That would give us enough power to implement transforms that match Stan's behavior in the language.
It was a quick change to allow target +=
(#980). The idea of a boolean to allow target adjustments to depend on a jacobian parameter is probably a larger scope change.
@bob-carpenter would you be willing to look into updating the doc accordingly? I'm not sure if the current behavior (where functions allowed you to do this) is even documented
Sure, I can update the reference manual. I'm not sure I want to start changing programs in the user's guide, but I'll check where we talk about blocks there, too.
Yeah, we don't need to update examples, I'm more just thinking that if there are any statements like "You can only use target += in the model block" they will need to change
I opened https://github.com/stan-dev/docs/issues/413 for this.
I am strongly on team “target += only in the model block”.
Allowing the target to be modified within the transformed parameter block makes Stan programs harder to parse overall. While Jacobians might seem more natural next to transformed parameters they’re actually part of the transformed density function not the transformation itself.
Mathematically a program like
transformed parameters { real eta = f(theta); target += jacobian(theta); }
model {
eta ~ prob(values);
}
doesn’t make much sense because one doesn’t add a Jacobian for any transformation and prob_lpdf(eta | values) isn’t a well-defined density function on its own. It’s only in the context of trying to assign a density on the output value that a Jacobian is needed, so the proper encapsulation is
transformed parameters { real eta = f(theta); }
model { eta ~ prob(values); // Neither this line target += jacobian(theta) ; // nor this line // are well-defined on their own; it’s the combination that is well-defined }
Allowing the target to be modified in the transformed parameters block makes Jacobians even hard to understand.
At the same time it also allows for ridiculous programs that don’t even approach making sense. What does
transformed parameters { theta ~ prob(values) }
model {
}
mean? Especially given how interfaces will change behavior if the model block is empty.
Finally allowing target += in the transformed parameters block breaks the symmetry between the transformed parameters and generated quantities block which otherwise behave exactly the same just implemented at different times.
On Sep 29, 2021, at 12:53 PM, Brian Ward @.***> wrote:
IMO I think we should allow target+= in TParams. @bob-carpenter https://github.com/bob-carpenter?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stanc3/issues/979#issuecomment-930355758, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU3FTEW5I6EAV6QAZA2VTUENAANANCNFSM5FAJTGMA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Allowing the target to be modified within the transformed parameter block makes Stan programs harder to parse overall.
I take it you mean for a human reader, not stanc3. I think that's the main place we disagree. My intention all along was to allow target increments in the transformed parameters block for Jacobians. Abusing that usage is indeed a way to write a confusing program.
It’s only in the context of trying to assign a density on the output value that a Jacobian is needed, so the proper encapsulation is
My goal was to mirror how our constraints work for parameters. The constraints are applied as soon as they're declared regardless of how the variable is used. For example,
parameters {
real<lower=0> sigma;
}
always applies the Jacobian correction to make the distribution over sigma
improper uniform on (0, infty)
. And if you apply a constraint down to a compact set, you don't need a sampling statement---just the transform will lead to a proper prior, as we see with
real<lower=0, upper=1> theta;
simplex[K] phi;
Especially given how interfaces will change behavior if the model block is empty.
I don't even see how that'd even be possible for an interface. Are you thinking about plans to run the fixed-param sampler when the parameter size is zero? That's not even going to require the parameter block to be empty, because we can declare size-zero parameters. For example, the following complete Stan program would run with HMC if N > 0
and with fixed-param if N = 0
(not that the fixed-param version would do anything).
data {
int<lower=0> N;
}
parameters {
vector<lower=0, upper=1>[N] alpha;
}
@betanalpha - would you propose that we disallow the calling of _lp
functions in transformed parameters then? I think that would be a rather large breaking change if anyone was relying on the behavior
It’s only in the context of trying to assign a density on the output value that a Jacobian is needed, so the proper encapsulation is
My goal was to mirror how our constraints work for parameters. The constraints are applied as soon as they're declared regardless of how the variable is used. For example,
parameters { real
sigma; } always applies the Jacobian correction to make the distribution over sigma improper uniform on (0, infty). And if you apply a constraint down to a compact set, you don't need a sampling statement---just the transform will lead to a proper prior, as we see with
real
theta; simplex[K] phi;
That kind of reorganization is possible only once the blocks have been collapsed by the compiler into a single C++ function that implements the entire target density function. At that point all of the expressions are associative and parts of transformed density functions can be pulled out and moved up closer to the constraints. In other words this is an implementation detail and does not respect the mathematical structure of the model captured by the Stan program.
A Stan program like
parameters {
real
model { }
or more pedantically
parameters {
real
model { sigma ~ unif; }
doesn’t have any Jacobians. It directly defines a target density function over sigma. Behind the scenes, however, an automatic transformation is applied so that the algorithms can work on an unconstrained space, the equivalent of
parameters { real log_sigma; }
model { exp(log_sigma) ~ unif; target += jacobian(log_sigma); }
The final compiled C++ implements this reparameterized Stan program, not the original Stan program. Mathematically exp(log_sigma) ~ unif
and target += jacobian(log_sigma);
are two parts of the same object and it’s only when everything is collapsed to a single function that other organizations become viable.
Any intermediate representation that respects the block structure before that collapse should, in my option, respect the mathematical structure of the model the program is representing as well. I feel that many arguments for adding functionality to the transformed parameter block are really implicit arguments for the transformed parameter block just being part of the model block, in which case the blocks themselves should change!
Especially given how interfaces will change behavior if the model block is empty.
Sorry, I confused an empty parameter block, which is checked by various interfaces at the moment (whether it should be or not), with an empty model block, which isn’t at this point checked by anything.
@betanalpha https://github.com/betanalpha - would you propose that we disallow the calling of _lp functions in transformed parameters then? I think that would be a rather large breaking change if anyone was relying on the behavior
Yes I personally would prefer that,, although we’ve had similar discussions before where few agreed with me.
@betanalpha https://github.com/betanalpha - would you propose that we disallow the calling of _lp functions in transformed parameters then? I think that would be a rather large breaking change if anyone was relying on the behavior
Yes I personally would prefer that,, although we’ve had similar discussions before where few agreed with me.
One reason I like having the _lp
and target +=
in the transformed params is that it's the only scheme I can think of that would allow for user defined constrains that can be saved. For instance thinking about something like the below where a user wants to write a ordered_simplex_constrain
function. Supposing there was a global jacobian
keyword (or whatever name/syntax we want for increment when we want a jacobian correction). Then the user writes a _constrain
and _free
function and applies it in transformed params
functions {
vector[] ordered_simplex_constrain(...) {
// constrain stuff
if (jacobian) {
target += ...;
}
};
ordered_simplex_free(...) {
// inverse of constrain stuff
}
}
...
parameters {
vector[N] blah;
}
transformed parameters {
vector[N] my_blah = ordered_simplex_constrain(blah);
}
ordered_simplex_constrain()
could be done in the model block but then the transformed param would not be saved. If there's some other way to do these user defined constraints as a function then I'm open to it. Maybe there's some middle ground for user defined constraints where these functions can access target
when called from the transformed parameters block as long as it's surrounded by if (jacobian) {}
(though that's kind of complicated logic and idk how intuitive that would be to users)
One reason I like having the _lp and target += in the transformed params is that it's the only scheme I can think of that would allow for user defined constrains that can be saved. For instance thinking about something like the below where a user wants to write a ordered_simplex_constrain function. Supposing there was a global jacobian keyword (or whatever name/syntax we want for increment when we want a jacobian correction). Then the user writes a _constrain and _free function and applies it in transformed params
functions {
vector[] ordered_simplex_constrain (...) {
// constrain stuff
if (jacobian ) {
target += ...; } };
ordered_simplex_free (...) {
// inverse of constrain stuff
} }
...
parameters {
vector[N] blah ; }
transformed parameters {
vector[N] my_blah = ordered_simplex_constrain(blah ); }
ordered_simplex_constrain() could be done in the model block but then the transformed param would not be saved. If there's some other way to do these user defined constraints as a function then I'm open to it. Maybe there's some middle ground for user defined constraints where these functions can access target when called from the transformed parameters block as long as it's surrounded by if (jacobian) {} (though that's kind of complicated logic and idk how intuitive that would be to users)
I think the problem here is a confounding of types, and the transformation that might be implied by a type, and transforms within a Stan program.
parameters {
vector[N] blah;
}
transformed parameters {
vector[N] my_blah = ordered_simplex_constrain(blah);
}
doesn’t require a Jacobian correction unless a density function depending on my_blah is introduced. In that case the situation is no different from the current status quo. The Jacobian has nothing to do with constructing the my_blah variables rather it has to do with trying to define a proper density function that depends on the my_blah variables. Specifying density functions is wholly the scope of the model block.
Now the constrained Stan types are different. Constrained Stan types allow the user to build a Stan program directly on a constrained space in which case no Jacobians are needed. Then before that Stan program is handed off to the algorithms the type information is used to transform the parameter space and target density function specified by that Stan program to an unconstrained space. When transforming the density function those Jacobians become necessary. In the Stan C++ implementation this is all implemented at once and many of the evaluations are mixed up, but that reorganization doesn’t correspond the actual mathematical operations being performed.
Steve it seems what you’re trying to do is allow for user-defined types. If there are a “type” block with
types {
my_constrained type {
constraint(unconstrained values) {
...
}
free(constrained values) {
}
jacobian(unconstrained values) {
}
}
}
then one could write
parameters {
my_constrained_type alpha;
}
model {
alpha ~ whatever();
}
and let the compiler set up the transformations and add the Jacobian just as it does for all of the predefined constrained types.
Steve it seems what you’re trying to do is allow for user-defined types.
Yes, something like that. There's an issue of how the Jacobian and inverse can share computation and where it'd get declared. And then how to parameterize transforms (like lower = L) and account for size declarations. It's going to be tricky.
Steve it seems what you’re trying to do is allow for user-defined types.
Yes I do like that!
And then how to parameterize transforms (like lower = L) and account for size declarations. It's going to be tricky.
I think for what Michael is talking about here the user would write like lb_vector
.
There's an issue of how the Jacobian and inverse can share computation
I think I'd like a middleground where if (Jacobian)
could be used in the constraint so computation can be shared. Though imo if a new type becomes very popular we can always backport it into Stan math / the language and have the more efficient computation then.
And then how to parameterize transforms (like lower = L) and account for size declarations
I was assuming that the scalar/vector/array declarations used for user-defined functions would be sufficient, with users responsible for implementing constrain and unconstrain functions with the right input/output dimensionalities. Any inconsistencies could be diagnosed by unit testing constrain \circ unconstrain = identify{constrained space} and unconstrained \circ constrained = identify{unconstrained space}.
There's an issue of how the Jacobian and inverse can share computation and where it'd get declared I think I'd like a middleground where if (Jacobian) could be used in the constraint so computation can be shared. Though imo if a new type becomes very popular we can always backport it into Stan math / the language and have the more efficient computation then.
Personally I think that sharing computation is most compatible with the direct implementation of
pi(theta_unc) = pi( constrain(theta_unc) ) * J(theta_unc).
instead of the indirect implementation in terms of types.
I think the bigger abstractional issue here is whether
or
A. A Stan program defines an unconstrained space theta_unc and a target density function pi(theta_unc). B. Computation occurs on that given space. C. The computational output is mapped to a constrained space.
To me (1) is how most users interpret a Stan program with (2-4) all happening automatically, behind the scenes. A is more compatible with the current C++ code generation, and what is handed off to the algorithm library, but at some level that’s an implementation detail.
Absolutely. I designed the language so that users could focus on the constrained space (1) and algorithms could focus on the unconstrained space (3).
Unfortunately, that abstraction is leaky when it comes to performance. For example, users have to think about the unconstrained space in moving from a centered to non-centered parameterization.
I didn't understand the comments about indirect implementation in terms of types. The key issue for performance is that we want to be able to compute the constraining transform and log absolute Jacobian determinant together. However we do that, we absolutely have to implement the adjoint-Jacobian product form for efficiency. In the math library, our reverse-mode autodiff functions compute values in the forward pass and store whatever intermediates we might need later for the adjoint-Jacobian product computed in the reverse pass.
Absolutely. I designed the language so that users could focus on the constrained space (1) and algorithms could focus on the unconstrained space (3). Unfortunately, that abstraction is leaky when it comes to performance.
I wouldn’t say that it’s a leaky abstraction but rather that it’s just not the right abstraction when optimizing purely for performance. In my opinion opening up the abstraction to allow for some special case performance optimizations makes the language in general harder to read. If maxed out performance is that much or a priority, which honestly I don’t think it is, then removing the transformed parameters block and introducing some annotation for which variables to save (which we had previously discussed many years ago) would be a much more consistent approach.
With a transformed parameters block only the intermediate representations and final C++ generated by the compiler have access to the full unconstrained model evaluation, and hence only the compiler has the proper scope to consider optimizations. There’s certainly plenty of opportunity for the compiler to eliminate/consolidate redundant calculations without break the user-facing abstraction.
For example, users have to think about the unconstrained space in moving from a centered to non-centered parameterization.
I strongly disagree.
A Stan program defines a parameterization of some ambient space and specifies a target distribution through a probability density function on that space. The centered and non-centered parameterizations are different parameterizations and hence defined by different Stan programs.
The shift/multiply implementation makes this all implicit, and in my opinion confusing. As I mentioned previously it also overloads the <> annotation to now represent both types (and implicit unconstraining transformations) and explicit transformations which reinforces the confusion.
I didn't understand the comments about indirect implementation in terms of types. The key issue for performance is that we want to be able to compute the constraining transform and log absolute Jacobian determinant together. However we do that, we absolutely have to implement the adjoint-Jacobian product form for efficiency. In the math library, our reverse-mode autodiff functions compute values in the forward pass and store whatever intermediates we might need later for the adjoint-Jacobian product computed in the reverse pass.
I think that the adjoint-Jacobian comments here are a red herring. Redundancy in a transform and its log Jacobian determinant is the same whether evaluating both the Jacobian of the unconstrained target density and direct contractions of that Jacobian with tangent or cotangent vectors. Similarly that redundancy is independent of the redundancy in these calculations and their gradients, which motivates what should be stored in the forward pass.
Regardless only a program implementing the target density on the final, unconstrained space has the scope to be able to avoid redundant calculations. Formally if X is the constrained space and Y is the unconstrained space and c : Y -> X is the constraining map then the transformed log density function is given by
logpi( c(y) ) + log J{c}(y).
If there are redundant terms in logpi \circ c(y) and log J{c}(y) then we can save computation only when evaluating both terms together, which is how the Stan compiler is currently able to avoid redundancy with the pre-defined types. The redundancy isn’t in the transformation itself it’s in the evaluation of the unconstrained target density function. In order for users to implement similar optimizations they would need access to that unconstrained target density evaluation.
The current Stan language doesn’t provide that access when implementing a model on the constrained space. In this case the Stan program specifies only
log_pi(x)
with c(y) and log J_{c}(y) both hidden under the language abstraction.
The only way users can get access to the unconstrained space is by building a Stan program directly on the unconstrained space. When an unconstrained target density is specified entirely in the model block users have full access to potential optimizations. When saving some terms in the transformed parameters block, however, intermediate calculations fall out of scope and introduce potential redundancy. This is a fundamental limitation of having multiple scopes and the only way to avoid that limitation is to break the scopes entirely in which case why even have them?
The centered and non-centered parameterizations are different parameterizations and hence defined by different Stan programs.
Exactly. So I don't see why you say the user doesn't have to think about the unconstrained spaces as those are the spaces over which the Stan programs operate for sampling.
The shift/multiply implementation makes this all implicit, and in my opinion confusing.
I agree that shift/multiply is confusing. The whole transformation from constrained to unconstrained is confusing for our users. I just think it's less confusing and far less error prone than recoding everything oneself. But it's a bit less efficient.
As I mentioned previously it also overloads the <> annotation
I don't see that as an overload so much as thinking about it as a general transform that's not necessarily constraining.
I think that the adjoint-Jacobian comments here are a red herring.
I just think of it as the way autodiff works. It's how our chain()
methods get written for efficiency for multivariate or matrix functions.
only a program implementing the target density on the final, unconstrained space has the scope to be able to avoid redundant calculations.
I agree that gives you the most generality. It's one of the reasons I'd like to roll our transforms into the language. But it's not the only way to avoid redundancy. For example, it's better to implement a lognormal distribution on a parameter as a normal distribution on the log of the parameter then exp() inverse transform back to positive. When you use lognormal, there's a bit of extra work making and then undoing the transform.
the Stan program specifies only logpi(x) with c(y) and log J{c}(y) both hidden under the language abstraction.
Yes, that was intentional. But rather than "hidden", I'd say "encapsulated" :-). Of course, you can use only unconstrained types and apply all your own transforms and Jacobian adjustments. (For those following along at home, the J in @betanalpha's notation is the absolute determinant of the Jacobian.)
The centered and non-centered parameterizations are different parameterizations and hence defined by different Stan programs.
Exactly. So I don't see why you say the user doesn't have to think about the unconstrained spaces as those are the spaces over which the Stan programs operate for sampling.
The shift/multiply implementation makes this all implicit, and in my opinion confusing.
I agree that shift/multiply is confusing. The whole transformation from constrained to unconstrained is confusing for our users. I just think it's less confusing and far less error prone than recoding everything oneself. But it's a bit less efficient.
As I mentioned previously it also overloads the <> annotation
I don't see that as an overload so much as thinking about it as a general transform that's not necessarily constraining.
I believe that the design of the language introduces a fundamental difference between unconstraining transformations, X subset R^N -> R^M and 1-1 transformations R^M -> R^M.
The type system allows Stan programs to be defined over constrained parameter spaces X subset R^N. The Hamiltonian Monte Carlo implementation in Stan, however, is most efficient when working on R^M (at least for M >=1…) and so to align the system defined by the Stan program with the sampler the compiler has to map X -> R^M. Here the type system does double duty, defining not only the constrained parameter space X but also a canonical map X -> R^M that transforms the system before the sampler runs and then maps the sampler output back afterwards. The hiding/encapsulation of this transformation (the choice of a “canonical’ unconstraining transformation and its implementation) abstracts away this process so that users can continue to interpret their model and the inferences derived from that model on the nominal parameter space X.
1-1 transformations, however, cannot be encoded in the type system as every variant of R^M is parameterized by the same types of variables (I’m using what I understand to the conventional definition of a type, i.e. https://simple.wikipedia.org/wiki/Data_type https://simple.wikipedia.org/wiki/Data_type). These transformations don’t change the structure of the parameter space and there’s no “canonical” transformation which is why it can’t be hidden away from users.
In other words for me the typical user experiences the Stan type system as defining X. The unconstraining transformation is an abstracted implementation detail. real
Something like real
I agree that gives you the most generality. It's one of the reasons I'd like to roll our transforms into the language. But it's not the only way to avoid redundancy. For example, it's better to implement a lognormal distribution on a parameter as a normal distribution on the log of the parameter then exp() inverse transform back to positive. When you use lognormal, there's a bit of extra work making and then undoing the transform.
the Stan program specifies only logpi(x) with c(y) and log J{c}(y) both hidden under the language abstraction.
Yes, that was intentional. But rather than "hidden", I'd say "encapsulated" :-). Of course, you can use only unconstrained types and apply all your own transforms and Jacobian adjustments. (For those following along at home, the J in @betanalpha https://github.com/betanalpha's notation is the absolute determinant of the Jacobian.)
Right, and if one is dealing with performance considerations where small optimizations like that matter then why is writing out the unconstrained model directly, where everything is in scope to optimize the implementation, so onerous? Hell the compiler can automatically construct an initial unconstrained Stan program for an advanced user to optimize by hand, mirroring optimization by tweaking generated ASM.
This conversation has drifted a bit, although there are multiple interwoven issues at hand. To me the beauty of the Stan language is that the types and scopes are very explicitly, and I think that convolving types with transforms and loosening the differences between the programming blocks only makes the structure of Stan programs more implicit and more confusing for users.
I believe that the design of the language introduces a fundamental difference between unconstraining transformations, X subset R^N -> R^M and 1-1 transformations R^M -> R^M.
Both the language syntax and the math back-end are agnostic about the content of transforms. There's nothing at all built into the language or the math back end that requires a range that's different than their domain.
why is writing out the unconstrained model directly, where everything is in scope to optimize the implementation, so onerous?
(1) Because it's inconveniently split over blocks and requires double declaration, and (2) the intended log density never shows up directly in the model.
parameters {
vector<offset=mu, multiplier=sigma>[N] alpha;
}
model {
alpha ~ normal(mu, sigma);
}
and
parameters {
vector[N] alpha_std
}
transformed parameters {
vector[N] alpha = mu + sigma * alpha_std;
}
model {
alpha_std ~ normal(0, 1);
}
Then there's the secondary problem of getting double the output in CmdStan or having to explicitly filter it in RStan, etc.
Consider two stan programs:
prog1:
prog2:
prog1. stan produces a type error:
prog2.stan compiles just fine.
This is because we have this check for target+= statements: https://github.com/stan-dev/stanc3/blob/095b8ef6e93c3fe029f77ad56c88d650fbd60019/src/frontend/Semantic_check.ml#L1112-L1113
And this check for the calling of _lp functions: https://github.com/stan-dev/stanc3/blob/095b8ef6e93c3fe029f77ad56c88d650fbd60019/src/frontend/Semantic_check.ml#L238-L246