During optimization the Jacobian determinant terms should not be added to lp__

mbrubake commented 11 years ago

When doing optimization, the log determinant terms that result from transformed parameters are unnecessary and will actually move the mode. We need to be able to turn off these terms when computing log_prob() and grad_log_prob() for optimization methods.

mbrubake commented 11 years ago

@bob-carpenter this has implications for issue #110 since a user will want to differentiate between increments due to a parameter transform and increments expressing a part of the posterior.

bob-carpenter commented 11 years ago

On 6/8/13 12:15 PM, Marcus Brubaker wrote:

When doing optimization, the log determinant terms that result from transformed parameters are unnecessary and will actually move the mode. We need to be able to turn off these terms when computing log_prob() and grad_log_prob() for optimization methods.

I don't understand. They're part of the log probability functions. If you don't want uniform priors on parameters, that can be changed in the model.

Let me work through an example to see if I'm following. If I write

parameters { real theta; } model { theta ~ beta(3.9,17.2); }

then the density over the unconstrained parameter defined by the log prob function theta_unc in (-inf,inf) is log p(theta_unc | 3.9, 17.2), where

p(theta_unc | 3.9, 17.2)

= inv_logit(theta_unc) * (1 - inv_logit(theta_unc))  // Jacobian

  * beta(inv_logit(theta_unc) | 3.9, 17.2)

This seems right to me. If I optimize and set

theta_unc* = ARGMAX_theta_unc p(theta_unc | 3.9, 17.2)

isn't this the same as optimizing with respect to the model:

theta* = ARGMAX_theta beta(theta | 3.9, 17.2)

       =?? inv_logit(theta_unc*)

Bob
Bob

mbrubake commented 11 years ago

The problem is that, in your example theta* != inv_logit(theta_unc*) (in general).

Consider it analytically. If you look for zeros of the gradient of the log probability you get an extra term due to the Jacobian. I.E., if

lpt(x_unc) = lp(x(x_unc)) + log_det_J(x_unc)

is the log probability of the unconstrained parameter and lp(x) is the constrained log probability

dlp/dx_unc = dlp/dx * dx/dx_unc + dlog_det_J/dx_unc

The point where dlp/dx_unc = 0 is not (necessarily) the point where dlp/dx = 0 .

bob-carpenter commented 11 years ago

On 6/9/13 9:43 PM, Marcus Brubaker wrote:

The problem is that, in your example theta* != inv_logit(theta_unc*) (in general).

Consider it analytically. If you look for zeros of the gradient of the log probability you get an extra term due to the Jacobian. I.E., if

lpt(x_unc) = lp((x_unc)) + log_det_J(x_unc)

OK, I see the problem now after working through it on my own in a concrete one-dimensional case.

It means our current approach to optimization is not optimizing the model the users are expecting.

The change is doable. The transform functions are already in place.

We don't want to incur runtime overhead or lock in decisions when the model is compiled. That leaves two options I can see,

writing a single function with a boolean template parameter indicating whether to calculate Jacobians or not, or
writing two functions that have different initial segments for the transforms but call the same function for the body of the model itself.
- Bob

mbrubake commented 11 years ago

I think option 1 would be more elegant. Also, relating to #110, once we have an increment_log_prob() function we could turn it off for calls that are in transformed parameters blocks so that people can express that some lp__ increments are part of the model and others are part of the parameterization.

stan-dev / stan

During optimization the Jacobian determinant terms should not be added to lp__ #112