jmsigner / amt

38 stars 13 forks source link

Conf. Int with RSS #64

Closed anjankatna closed 2 years ago

anjankatna commented 2 years ago

Hi, I wanted to clarify the significance/interpretation of the CI values in the log_RSS outputs.

My data has movement data from 5 individuals, and I used the nesting function while preparing the data for the SSF analysis. Do the CI values then represent the individual variation in RSSs? log_rss lwr upr 1 -0.3016729 -0.9272444 0.3238985 2 0.2160458 -0.1960536 0.6281452 3 -1.5873831 -2.7426228 -0.4321434 4 0.2295385 -0.5318209 0.9908980 5 -12.0790214 -2733.3505573 2709.1925144 The data looks at different landcover classes, with respect to a reference class (grassland)

I am assuming there is no separate method to account if individual-level effects in the amt workflow (e.g. including a random effect)

jmsigner commented 2 years ago

Unfortunately, there isn't. You can fit a population-level model using the approach of Muff et al. (). But you would have to calculate the log RSS manually. Maybe @bsmity13 has something to add here?

bsmity13 commented 2 years ago

Hi @anjankatna.

Yes, those CIs represent the uncertainty in the parameter estimates for each individual. The uncertainty in the betas is propagated through to uncertainty in log-RSS. And as @jmsigner said, there's no implementation of an approach with random effects in amt.

I would add that you can also use a two-step procedure for population-level inference instead of random effects. I.e., you can treat the individual coefficients as a response variable in a regression, often using inverse-variance weighting (weights = 1/(std.error^2)) to account for the uncertainty in each individual's estimates.

Here's a very quick example. This is for an HSF, but the same idea works for an iSSF.

# Load packages
library(amt)
library(dplyr)
library(broom)

# Attach example data from 'amt'
data("amt_fisher")
data("amt_fisher_covar")

# Use nested data.frame to handle individuals
dat <- amt_fisher %>% 
  nest(trk = x_:t_) 

# Fit HSF to each individual's track
mods <- dat %>% 
  mutate(hsf = map(trk, function(xx) {
    xx %>% 
      random_points() %>% 
      extract_covariates(amt_fisher_covar$elevation) %>% 
      extract_covariates(amt_fisher_covar$popden) %>% 
      # Give available points large weights when fitting HSF (not iSSF)
      mutate(weight = case_when(
        case_ ~ 1,
        !case_ ~ 1e5
      )) %>% 
      glm(formula = case_ ~ elevation + popden, data = ., family = binomial,
          weights = weight)
  }))

# Extract covariates
covs <- mods %>%
  mutate(coef = map(hsf, tidy)) %>% 
  select(sex, id, name, coef) %>% 
  unnest(cols = coef) %>% 
  filter(term != "(Intercept)") %>% 
  mutate(inv_var = 1/(std.error^2))

# Model population-level elevation coefficient
# Intercept-only model = population (weighted) mean
elev <- lm(estimate ~ 1, data = covs, subset = term == "elevation",
           weights = inv_var)
summary(elev)

# Model population-level popden coefficient
# Intercept-only model = population (weighted) mean
popden <- lm(estimate ~ 1, data = covs, subset = term == "popden",
           weights = inv_var)
summary(popden)

# Now you need to calculate RSS by hand
# log-RSS is simply the difference in linear predictor between x1 and x2
x1 <- data.frame(elev = 120, popden = 1000)
x2 <- data.frame(elev = 80, popden = 500)

# Linear predictors (sometimes called g(x))
#   Note that this short code snippet only works here with intercept-only model.
#   Otherwise, you need to predict() the coefficient for elev/popden given 
#   covariates.
g1 <- coef(elev) * x1$elev + coef(popden) * x1$popden
g2 <- coef(elev) * x2$elev + coef(popden) * x2$popden

# log-RSS(x1, x2)
g1 - g2

# RSS
exp(g1 - g2)

# Note that this approach is also flexible enough to include covariates.
# E.g., do males select for elevation differently than females?
elev_sex <- lm(estimate ~ sex, data = covs, subset = term == "elevation",
           weights = inv_var)
summary(elev_sex)

# Now the beta for elevation is a function of sex.
predict(elev_sex, newdata = data.frame(sex = c("F", "M")))

Note that the best way to get CIs for log-RSS in this case is bootstrapping. Since you model the coefficients with separate linear models, you aren't estimating the covariance between them, so you can't use just the SEs to construct a confidence interval.

Hope the brevity isn't misleading. Happy to answer any follow-up questions. Hope that helps!

Imthiazz commented 2 years ago

Hello, @bsmity13.

I'm asking this question because it has to do with log RSS and confidence intervals.

In my study, I calculated the coefficient for population-level inference using the Two-Step technique (iSSF). I was able to manually obtain the log RSS, but I'm struggling as to how to determine the Confidence Interval.

Since you indicate that it's better to use bootstrapping to get CI in the Two-Step technique, I will be grateful if you could give a bit of code that demonstrates how to implement it.

Thanks in advance.

bsmity13 commented 2 years ago

Hi, @Imthiazz.

Here's a quick demo. I mostly followed the example above through fitting the models. From there, we start with the bootstrap.

I am using a parametric bootstrap, where we resample the coefficients from each individual model from a multivariate normal distribution with mean given by the fitted coefficients and Sigma given by the variance-covariance matrix from each model. A similar approach would be possible with some kind of empirical bootstrap where you resample the data with replacement.

For each iteration, we calculate the mean population-level coefficient. Notice I dropped the inverse-variance weighting because the bootstrapping will take care of that parametric uncertainty for us. The "statistic of interest" is log-RSS, but we don't have to calculate it inside our function (we could) because it's a deterministic function of the mean coefficient that we are resampling. So that means we can just calculate log-RSS after the bootstrapping is done.

Note that the mean of the bootstrap distribution is not the same as the maximum likelihood estimate of log-RSS. We still want to keep our MLE, so we use the bootstrap distribution to estimate the size of the CI, not the actual bounds. Then we apply that size to the MLE to get our confidence interval.

Here's the code.

# Load packages
library(amt)
library(dplyr)
library(tidyr) # to pivot longer
library(mvtnorm) # for multivariate normal random number generator

# Load sample data
dat <- amt_fisher
hab <- amt_fisher_covar

# Fit individual models
mods <- dat %>% 
  nest(trk = x_:t_) %>% 
  mutate(hsf = map(trk, function(xx) {
    xx %>% 
      random_points() %>% 
      extract_covariates(amt_fisher_covar$elevation) %>% 
      extract_covariates(amt_fisher_covar$popden) %>% 
      # Give available points large weights when fitting HSF (not iSSF)
      mutate(weight = case_when(
        case_ ~ 1,
        !case_ ~ 1e5
      )) %>% 
      glm(formula = case_ ~ elevation + popden, data = ., family = binomial,
          weights = weight)
  }))

## Basically the same as above until here; now we write a function we can use
## when bootstrapping. There are probably countless ways to go about this
## that would all give identical results. This is one example.

# Function to return population-level coefficients for popden and elev
# We will pass it the list of individual models so we can access the
# coefficients and variance-covariance matrix from each model object.

hsa_coefs <- function(model_list) {

  # Resample the individual coefficients from each model
  new_coef_list <- lapply(model_list, function(l) {
    # Point estimate
    b <- coef(l)
    # Variance-covariance matrix
    S <- vcov(l)
    # Return new coefficients
    b_new <- mvtnorm::rmvnorm(n = 1, mean = b, sigma = S)
  })

  # Combine in a data.frame
  new_coef <- as.data.frame(do.call(rbind, new_coef_list)) %>% 
    pivot_longer(everything(), names_to = "term", values_to = "estimate")

  # Model population-level elevation coefficient
  # Dropping the weights because bootstrapping is propagating the uncertainty
  # Intercept-only model = population mean
  # Would be equivalent to use mean(), but this allows addition of covariates
  elev <- lm(estimate ~ 1, data = new_coef, subset = term == "elevation")

  # Model population-level popden coefficient
  # Dropping the weights because bootstrapping is propagating the uncertainty
  # Intercept-only model = population mean
  # Would be equivalent to use mean(), but this allows addition of covariates
  popden <- lm(estimate ~ 1, data = new_coef, subset = term == "popden")

  # Return coefficients in data.frame
  return(data.frame(elev = coef(elev)[[1]], popden = coef(popden)[[1]]))
}

# (If you wanted to, you could calculate log-RSS inside that function)

# Now bootstrap to get many sets of new coefficients.
# Using 100 iterations for speed, but you probably want 2000 + for inference
set.seed(1)
boot <- lapply(1:100, function(i) {
  df <- hsa_coefs(mods$hsf)
  df$iter <- i
  return(df)
}) %>% 
  bind_rows()

# Now that we're done sampling, we can calculate log-RSS by hand
x1 <- data.frame(elev = 120, popden = 1000)
x2 <- data.frame(elev = 80, popden = 500)

# Still calculate point estimate for log-RSS using original model fit
# Linear predictors (sometimes called g(x))
g1 <- coef(elev)[[1]] * x1$elev + coef(popden)[[1]] * x1$popden
g2 <- coef(elev)[[1]] * x2$elev + coef(popden)[[1]] * x2$popden

# log-RSS
log_rss <- g1 - g2

# Now get confidence interval from bootstrap sample

# Linear predictors for bootstrap iterations
#   Note that this short code snippet only works with nrow(x1) == 1
#   Otherwise, iterate over rows of x1.
boot$g1 <- boot$elev * x1$elev + boot$popden * x1$popden
boot$g2 <- boot$elev * x2$elev + boot$popden * x2$popden

# log-RSS for bootstrap iterations
boot$log_rss <- boot$g1 - boot$g2

# Mean of bootstrap sample
mean_boot <- mean(boot$log_rss)

# For 95% confidence interval:
# 97.5th quantile
upr_boot <- quantile(boot$log_rss, 0.975)
# 2.5th quantile
lwr_boot <- quantile(boot$log_rss, 0.025)

# Distance from bootstrap mean to lwr/upr bounds gives *size* of the CI
# (Note that mean_boot != log_rss)
upr_dist <- upr_boot - mean_boot
lwr_dist <- lwr_boot - mean_boot

# Construct confidence interval by adding distances to MLE ('log_rss')
c(log_rss + lwr_dist, 
  est = log_rss,
  log_rss + upr_dist)

Hope it helps!