pbs-assess / renewassess

Discussion board for stock assessment renewal group
0 stars 0 forks source link

When to use a Jacobian transformation #5

Open paul-vdb opened 4 months ago

paul-vdb commented 4 months ago

@catarinawor and I have been in discussion with Cole M at NOAA about when to apply the Jacobian correction after reading about it in his new publication.

I've done a bit of a deeper dive on this than I meant for my own reasons but here is the when/why.

For an example here is Nimble code for the log likelihood used for Laplace. Because they have a built in parameter transformation system you don't have to worry about it like in RMTB. See how they define the log likelihood as model$calculate(innerCalcNodes) and then the additional reTrans$logDetJacobian(reTransform) term. In this case for normally distributed random effects, reTrans$logDetJacobian(reTransform) = 0.

    inner_logLik = function(reTransform = double(1)) {
      re <- reTrans$inverseTransform(reTransform)
      values(model, randomEffectsNodes) <<- re
      ans <- model$calculate(innerCalcNodes) + reTrans$logDetJacobian(reTransform)
      return(ans)
      returnType(double())
    }

https://github.com/nimble-dev/nimble/blob/devel/packages/nimble/R/Laplace.R

Let's have a conversation if people have more to say about this or strong opinions! :-) or want to talk more about Jacobians.

seananderson commented 4 months ago

I love a good change of variables discussion. :)

I had a bunch of back and forth about this with Anders Nielsen during the RTMB course (prompted by his tmbstan example not having a Jacobian adjustment). Here is some demonstration code I worked up followed by a version where Anders added a Metropolis–Hastings example at the end, which shows this happens with non-Hamiltonian MCMC algorithms too.

This approach can also be helpful if you want to make sure you've got your Jacobian adjustment entered correctly. For one, an easy mistake is to mess up the addition/subtraction.

This code illustrates that without the Jacobian adjustment the prior is not what it appears to be. It's a prior, but it's not the desired prior. The example is a prior predictive distribution---no data---for simplicity.

Also see: https://mc-stan.org/docs/stan-users-guide/reparameterization.html#changes-of-variables https://users.aalto.fi/~ave/casestudies/Jacobian/jacobian.html https://discourse.mc-stan.org/t/do-i-need-jacobian-adjustment-here/19875/9 https://rstudio-pubs-static.s3.amazonaws.com/486816_440106f76c944734a7d4c84761e37388.html https://jsocolar.github.io/jacobians/#:~:text=Jacobian%20adjustments%20are%20necessary%20when,sampling%20statements%20in%20our%20model).

library(ggplot2)
library(RTMB)
library(tmbstan)
set.seed(1)
dat <- list(x = NULL) # not used
par <- list(logsigma = 0)
ITER <- 3000
ctrl <- list(adapt_delta = 0.98, max_treedepth = 12)

# Skipping Jacobian adjustment
nll <- function(par) {
  getAll(par, dat)
  sigma <- exp(logsigma)
  -sum(dnorm(sigma, 0, 1, TRUE))
}
obj <- MakeADFun(nll, par, silent = TRUE)
m <- tmbstan(obj, chains = 1, iter = ITER, control = ctrl)
e <- extract(m)
# With Jacobian adjustment:
nll2 <- function(par) {
  getAll(par, dat)
  sigma <- exp(logsigma)
  nll <- 0
  # https://mc-stan.org/docs/stan-users-guide/changes-of-variables.html
  # "log absolute determinant of the Jacobian of the transform"
  # log absolute derivative of the transform if univariate
  # for `exp(x)` is `log(abs(D(exp(x))))`, i.e., `x`
  nll <- nll - logsigma
  nll - sum(dnorm(sigma, 0, 1, TRUE))
}
obj2 <- MakeADFun(nll2, par, silent = TRUE)
m2 <- tmbstan(obj2, chains = 1, iter = ITER, control = ctrl)
e2 <- extract(m2)
# Standard version in Stan where parameter enters model and is not transformed:
nll3 <- function(par) {
  getAll(par, dat)
  -sum(dnorm(sigma, 0, 1, TRUE))
}
par <- list(sigma = 0)
obj3 <- MakeADFun(nll3, par, silent = TRUE)
m3 <- tmbstan(obj3,
  chains = 1, iter = ITER,
  lower = c(0), upper = c(Inf), control = ctrl
)
e3 <- extract(m3)
# half-normal(0, 1) distribution for reference:
true_prior <- rnorm(1e6, 0, 1)
true_prior <- true_prior[true_prior>0]

g <- dplyr::bind_rows(
  data.frame(sd = exp(e$logsigma), type = "No Jacobian adjustment"),
  data.frame(sd = exp(e2$logsigma), type = "With Jacobian adjustment"),
  data.frame(sd = e3$sigma, type = "Untransformed with lower bound"),
  data.frame(sd = true_prior, type = "Desired 'true' half-normal(0, 1) prior"),

) |>
  ggplot(aes(sd, colour = type)) +
  geom_density() +
  coord_cartesian(xlim = c(0, 4)) +
  scale_colour_brewer(palette = "Dark2") +
  theme_light() +
  ggtitle("Prior predictive distribution")

print(g)

Code from Anders:

library(RTMB)
library(tmbstan)
set.seed(1)
dat <- list(x = rnorm(8, 0, 2))
par <- list(mu = 0, logsigma = 0)
ITER <- 2e4

# Skipping Jacobian adjustment
nll <- function(par) {
  getAll(par, dat)
  sigma <- exp(logsigma)
  nll <- 0
  nll - sum(dnorm(x, mu, sigma, TRUE))
}
obj <- RTMB:::MakeADFun(nll, par, silent = TRUE)
m <- tmbstan(obj, chains = 6, iter = ITER)
e <- extract(m)
# With Jacobian adjustment:
nll2 <- function(par) {
  getAll(par, dat)
  sigma <- exp(logsigma)
  nll <- 0
  # https://mc-stan.org/docs/stan-users-guide/changes-of-variables.html
  # "absolute determinant of the Jacobian of the transform"
  # i.e., absolute derivative of the transform if univariate
  # for `exp(logsigma)` is `logsigma`
  nll <- nll - logsigma
  nll - sum(dnorm(x, mu, sigma, TRUE))
}
obj2 <- RTMB::MakeADFun(nll2, par, silent = TRUE)
m2 <- tmbstan(obj2, chains = 6, iter = ITER)
e2 <- extract(m2)
# Standard version in Stan where parameter enters model and is not transformed:
nll3 <- function(par) {
  getAll(par, dat)
  nll <- 0
  nll - sum(dnorm(x, mu, sigma, TRUE))
}
par <- list(mu = 0, sigma = 0)
obj3 <- RTMB::MakeADFun(nll3, par, silent = TRUE)
m3 <- tmbstan(obj3,
  chains = 6, iter = ITER,
  lower = c(-Inf, 0), upper = c(Inf, Inf)
)
e3 <- extract(m3)
## primitive mcmc like old ADMB
opt <- nlminb(obj$par, obj$fn, obj$gr)
rep <- sdreport(obj)
sdr <- summary(rep)

est <- opt$par
npar <- length(est)
cov <- rep$cov.fixed # or just # diag(npar) #

nosim <- 50000
mcmc <- matrix(NA, nrow = nosim, ncol = npar)
mcmc[1, ] <- est
colnames(mcmc) <- names(est)

steps <- MASS::mvrnorm(nosim, mu = rep(0, npar), Sigma = cov)
U <- runif(nosim)

print(system.time({
  currentValue <- obj$fn(mcmc[1, ])
  for (i in 2:nosim) {
    proposal <- mcmc[i - 1, ] + steps[i, ]
    proposalValue <- obj$fn(proposal)
    if (U[i] < exp(currentValue - proposalValue)) {
      mcmc[i, ] <- proposal
      currentValue <- proposalValue
    } else {
      mcmc[i, ] <- mcmc[i - 1, ]
    }
  }
}))
#>    user  system elapsed 
#>   0.368   0.002   0.374
plot(density(e$logsigma), main = "Densities")
lines(density(mcmc[, 2]), col = "red")
lines(density(e2$logsigma), col = "green")
lines(density(log(e3$sigma)), col = "blue")
abline(v = log(sd(dat$x)), lwd = 3)
legend("topright",
  col = c("black", "red", "green", "blue"),
  legend = c(
    "no adjust", "no adjust & no stan", "adjust",
    "flat prior on sigma (not log Sigma)"
  ), bty = "n", lwd = 1
)
text(log(sd(dat$x)), 0, "  log(sd(dat$x))", adj = 0)

nickcfisch commented 3 months ago

Thank you guys for looking into this and sharing your thoughts (and providing code)! I had been wondering about this Jacobian business as well since Coles paper.

So, for practical consideration put into simple terms for luddites such as myself, we should: -Make a correction in our likelihood calculations for each parameter estimated using a transformation --Parameter being whatever the prior is specified for?

In the simple case shown for estimating a parameter on the log scale, subtract that log scale parameter from the negative log likelihood?

catarinawor commented 3 months ago

Hey Nick! I think it is important to highlight that we only need the Jacobian adjustment if we are sampling within a Bayesian framewor. We do not need the Jacobian for MLE, never. -- I still find that a bit hard to understand. -- Something about the fact that in an MLE set up, the parameters are not random variables, therefore they do not suffer from the same distortions that the posterior suffers from. -- @seananderson and @paul-vdb can likely explains things a bit better

I think that the key is that we want the declared parameter (in the par list) to be in the same units as the thing that goes into the sampling statement. If not, then you need the Jacobian adjustment, if you are doing some sort of MS sampling. I am not sure it is clear in this example, but in this case we would declare logsigma in the par list, but put a prior on sigma. i.e., in the sampling statement, the transformed variable would be on the left e.g. sigma ~ halfN(0,1)

Maybe all of that was already implicit in your comment, but just adding for completeness.

Catarina

paul-vdb commented 3 months ago

@catarinawor @nickcfisch

Let's be careful that for Frequentist we NEVER need a Jacobian. If it's a parameter and not a random variable, we do not need a Jacobian. An example in a Frequentist framework we need the Jacobian is when our random effect needs to be transformed (see my code in the first part of the chat).

Consider the case: X is a random variable that goes from 0-infinity. We want to treat X as a random effect in our model and integrate over it using RTMB and Laplace. In the case, $log(X) \sim normal(\mu, \tau)$ it is equivalent to assuming $Y \sim lognormal(\mu, \tau)$. If we look at the log normal density function, you'll see that this distribution actually contains the jacobian, highlighted in yellow. image

Not including the Jacobian might mean you aren't actually using the distribution you think you are for the variable on its original scale so that is where you need to be careful. A uniform distribution on the log scale is very much not uniform on the real scale.

So, the clear answer, if it's a random variable (i.e. has a distribution like a prior) then it needs to have a Jacobian, if it's a fixed parameter it doesn't. Where this gets confusing is that often when we use a prior as a penalty term but are being Frequentists, then this doesn't need to have a Jacobian added, as it's just an arbitrary penalty function on a fixed parameter and not actually a distribution.