statdivlab / corncob

Count Regression for Correlated Observations with the Beta-binomial
102 stars 22 forks source link

Error with nesting of abundance models #112

Closed mirpie closed 3 years ago

mirpie commented 3 years ago

Hello!

First of all, thanks so much for designing such a wonderful package for differential abundance analysis :) I was hoping to get some clarification regarding an error I am encountering with my data

I am using as input a subsetted phyloseq object that upon inspection seems to have the same structure as the subsetted soil_phylo object implemented in the vignette. My code below is attempting to test for differential abundance across Genotype, controlling for the effect of Age on abundance:

ex1 <- differentialTest(formula = ~ Genotype, phi.formula = ~ 1, formula_null = ~ Age, phi.formula_null = ~ 1, data = dat_pr, test = "Wald", boot = FALSE, fdr_cutoff = 0.05) ex1$significant_taxa

However this returns an error as follows:

Error in checkNested(mod, mod_null) : Models for abundance are not nested!

I am unsure what this error is trying to tell me, seeing as Age and Genotype are both variables present in my Sample data and there are members of both genotypes at each age timepoint (nested design).

When I run the code without any additional controls on dispersion or abundance it seems to work, but I want to create a model that can control for the age of the mouse as well as the sex when making a comparison across genotypes.

Any help is appreciated!

bryandmartin commented 3 years ago

Hi @mirpie ,

Thanks for the kind words and detailed write up. Good news: your fix here is a very quick one! To test for differential abundance across genotype while controlling for the effect of Age, you just need:

ex1 <- differentialTest(formula = ~ Genotype + Age, phi.formula = ~ 1, formula_null = ~ Age, phi.formula_null = ~ 1, data = dat_pr, test = "Wald", boot = FALSE, fdr_cutoff = 0.05) ex1$significant_taxa

The only difference is that I added the Age parameter to the formula object. For differentialTest(), if something is in the *_null parameter, it needs to also be included in the main parameter.

Does this make sense? Bryan

mirpie commented 3 years ago

Amazing thank you so much! Yes that makes sense- so the "nested" characteristic essentially requires that the parameter being accounted for as a covariate in the initial model? I suppose my only other confusion now is regarding the theory behind setting up the model. In the vignette it is stated that:

Normally, when testing for differential variability across a covariate, we recommend always controlling for the effect of that covariate on the abundance

Is this to say that if I want to look at differential taxa across genotypes while controlling for other covariates like Age, I should also be controlling for the effects of these covariates on dispersion even if I am not testing for differential dispersion? Furthermore, should I be controlling for the effect of Genotype on dispersion if I want to ensure the taxa that are significant are truly differentially abundant between groups?

Thanks so much again!

bryandmartin commented 3 years ago

@mirpie

Is this to say that if I want to look at differential taxa across genotypes while controlling for other covariates like Age, I should also be controlling for the effects of these covariates on dispersion even if I am not testing for differential dispersion?

Actually, the opposite. If you were interested in differential variability (which it sounds like you are not), then you should also control for the effect of your covariate(s) of interest in differential abundance model. Otherwise, any difference in the mean will likely be absorbed into the variability model, and you will likely have several false positives for differential variability that are actually just measuring the impact of not controlling for the covariate on abundance.

Furthermore, should I be controlling for the effect of Genotype on dispersion if I want to ensure the taxa that are significant are truly differentially abundant between groups?

Not necessarily, the relationship mentioned above doesn't go the other way. i.e. In order to measure differential abundance, you do not need to control for the effect on dispersion. However, you may still want to! Especially if you have reason to believe that Genotype has a relationship with dispersion in general.

Cheers, Bryan

mirpie commented 3 years ago

@bryandmartin

Okay great, thanks for the clarification! In that case, what explains the difference in the output from the corrected code above and the following code:

ex5 <- differentialTest(formula = ~ Genotype + Age , phi.formula = ~ Genotype, formula_null = ~ Age, phi.formula_null = ~ Genotype, data = dat_pr_cecal, test = "Wald", boot = FALSE, fdr_cutoff = 0.05)

This code only returns ~90 differential taxa, while the code chunk from earlier (which does not test or correct for variability) returns ~190 differentially abundant taxa. What exactly is the key difference causing the discrepancy in the output of these models? Is the first model ignoring differences in dispersion and thus returning "false positives", whereas the second is removing the effects of differential dispersion and thus more stringent in a sense? I understand what you said above that you shouldn't have to control for the effects of dispersion to test differential abundance, but given it seems to make such a large difference in this case I would appreciate some insight into why!

Thanks again for all your help!

Best,

Miranda

bryandmartin commented 3 years ago

Hi @mirpie ,

Sorry I missed this one!

The key difference here is whether you model the dispersion as a function of Genotype. In the previous code, you are not, in this code, you are.

It's not that the first model are "false positives" and the latter are correct, it's that the two models make different underlying assumptions about the population. This is where context knowledge is key, because both of these models are perfectly valid statistically. In terms of choosing between them, what you need to decide is do you want to model the dispersion as a function of Genotype or not? In other words, do you have reason to believe the dispersion of any given taxa will vary according to Genotype?

Does this all make sense? The key takeaway is that you shouldn't consider one set of results "right" and the other "wrong". Both are right given the underlying assumptions. At a very high-level these two results you can interpret as telling you that when you allow for dispersion to vary be Genotype, this new parameter absorbs some of what was previously attributed to differential abundance across Genotype. I would want to investigate a few of the taxa that dropped off of the significant list to better understand why, and what the actual abundances look like. You may find, for example, that most of these taxa don't actually appear to be differentially abundant and the new model seems like a better fit.