statdivlab / corncob

Count Regression for Correlated Observations with the Beta-binomial
102 stars 22 forks source link

Use of `offset()` in formulas to specify fixed coefficients #83

Open mikemc opened 4 years ago

mikemc commented 4 years ago

Background: R fitting functions like lm() and nnet::multinom()) support wrapping a term in offset() to indicate that the term should have a fixed coefficient of 1. I would like to use such an offset in corncob in order to implement the "frequency-based" decontamination method of Davis et al 2018.

I have DNA concentrations in a variable called dna_conc in a phyloseq object. If I run

mod <- bbdml(
  formula = ASV1 ~ 1 + offset(-log(dna_conc)),
  phi.formula = ~ 1 + log(dna_conc),
  data = physeq
)
mod_null <- bbdml(
  formula = ASV1 ~ 1,
  phi.formula = ~ 1 + log(dna_conc),
  data = physeq
)

corncob does not complain about the offset terms, but the fitted models are identical in mod and mod_null, which makes me think the offset term is being silently dropped before model fitting. Is that correct, and is there an alternate way to fit a model with a fixed coefficient?

bryandmartin commented 4 years ago

Hi Mike,

Thanks for this issue. I'll need to check how offset works within corncob right now. I definitely will need to do some testing with offset, this is not something I had considered before.

bryandmartin commented 4 years ago

Hi @mikemc ,

I've been reading about this, and actually now I believe that an offset term would not be possible for this model using a logit link. One possibility is to add a different link that will enable an offset, such as a log link, though I'm not sure that would be ideal for this type of data.

A summary of why this doesn't work for logit links is found here. I found this after I was doing some of the math to figure this out and realized it is difficult (or impossible?) to get a constant offset term with a logit link.

Does this make sense to you? Do you agree? If there is an alternative here I'm happy to discuss it but my current understanding is that this won't be possible. I'm going to keep reading about this however because I'm still a bit confused as to how multinom is handling the offset term and would like to understand that better before I make any conclusions for certain.

Bryan

mikemc commented 3 years ago

Thanks for looking into this @bryandmartin. I think I might be asking for something simpler than what you're thinking - I just want to be able to add offset terms to the right-hand-side of Equation 3.9 for g(u_i) in the Corncob manuscript (before the inverse logit is applied), so that g(mu_i) = \beta_0 + X_i^T(\beta) + offset_i, where offset_i is the value of the offset term for sample i. By my understanding, offsets would typically be added like this to the linear equation before the link function is applied and so implementing support for offset() should be independent of which link is used, though the link would change the interpretation of the offset. Perhaps it would help if I give an example for binomial regression with glm() for the use case I'm interested in. Let me know if this doesn't help and it still seems impossible to have something like this with corncob.

Simulate 100 samples sequenced to read depth of 1000 reads, where each sample has a varying concentration, to which a constant concentration of a contaminant ASV of 0.1 is added during sample processing. Suppose that the number of sequenced reads of the contaminant is multinomial given its expected frequency in the sample.

library(tidyverse)
set.seed(1234)
n <- 100
read_depth <- 1e3
tb <- tibble(sample_conc = 10^rnorm(n), contam_conc = 0.1) %>%
  mutate(
    total_conc = sample_conc + contam_conc,
    contam_freq = contam_conc / total_conc,
    contam_count_obs = rbinom(n, read_depth, contam_freq),
    contam_freq_obs = contam_count_obs / read_depth,
    log_sample_conc = log(sample_conc)
  )
# matrix of contam and non-contam counts for glm()
obs_mat <- cbind(tb$contam_count_obs, read_depth - tb$contam_count_obs)

Can see that logit(contam_freq_obs) goes like -log(sample_conc)

qplot(tb$log_sample_conc, qlogis(tb$contam_freq_obs))

Goal is to compare the following two models, each with 1 coefficient,

fit0 <- glm(obs_mat ~ 1, 
  family = binomial(link = "logit"), data = tb)
fit1 <- glm(obs_mat ~ 1 + offset(-log_sample_conc), 
  family = binomial(link = "logit"), data = tb)
deviance(fit0)
deviance(fit1)

Can verify that the offset term worked as expected by comparing the results of predict() with the actual contaminant frequencies,

tb0 <- add_column(tb, predicted = predict(fit1, tb))
tb0 %>%
  ggplot(aes(log_sample_conc)) +
  geom_point(aes(y = qlogis(contam_freq))) +
  geom_point(aes(y = predicted), color = "darkred")