statdivlab / corncob

Count Regression for Correlated Observations with the Beta-binomial
102 stars 22 forks source link

Model failing to converge -- small sample size #148

Closed cbahern57 closed 10 months ago

cbahern57 commented 1 year ago

Hello!

I am trying to use corncob to look at 16S differential abundance in compost samples that have been exposed to different kinds of substrates. I have an NMDS plot that looks like this:

qiimeforumnmds

I am most interested in looking at the difference between Group 6 Day 90 (pink triangles) and Groups 2 Day 90, 4 Day 90, and 5 Day 90. This is because Group 6 represents a substrate that showed low levels of biodegradation in compost, whereas Groups 4 and 5 represent substrates that did not biodegrade at all (Group 2 represents compost with no substrate added). Ideally, I would like to see comparisons of each of these groups to each other; however, I think that my small sample size (3 replicates per group) is causing issues with the model.

I created a phyloseq object named "pswd" from my QIIME2 output files. pswd@sam_data gives me information about the Substrate (1-6) and Day (0, 90) for each sample that I tested.

pswd <- qza_to_phyloseq(features="Analysis2/16S-table-noplant-rarefied-10000_filtered.qza",
                        tree="Analysis2/rooted-16S-tree-filteredSVs.qza", 
                        taxonomy = "Analysis2/16S-rep-seqs-taxonomy.qza", 
                        metadata= "Analysis2/metadata2.txt")

pswd_corncobgenus <- pswd %>% tax_glom("Genus")

To run corncob, I set the following conditions:

set.seed(1)
difftest_genus <- differentialTest(formula = ~ Substrate*Day, 
                                   formula_null = ~ Day, 
                                   phi_formula = ~ Substrate*Day, 
                                   phi.formula_null = ~ Substrate*Day, 
                                   data = pswd_corncobgenus, 
                                   test = "LRT", 
                                   boot = FALSE, 
                                   fdr_cutoff = 0.05)

However, I received the following error when it ran:

Error in differentialTest(formula = ~Day, formula_null = ~1, phi_formula = ~1,  : 
  All models failed to converge! 

           If you are seeing this, it is likely that your model is overspecified. This occurs when your sample size is not large enough to estimate all the parameters of your model. This is most commonly due to categorical variables that include many categories. 

           Alternatively, double-check your values for the arguments `link`, `phi.link`, and `method` to makes sure that they follow the specified options. 

           To confirm you have fixed the issue, try running a model for a single taxon with bbdml.

I tried to ease the conditions I set for the test:

set.seed(1)
difftest_genus <- differentialTest(formula = ~ Substrate, 
                                   formula_null = ~ 1, 
                                   phi_formula = ~ 1, 
                                   phi.formula_null = ~ 1, 
                                   data = pswd_corncobgenus, 
                                   test = "LRT", 
                                   boot = FALSE, 
                                   fdr_cutoff = 0.05)

However, the same error message appears.

Is this an issue solely because I only have 3 samples per group? Is there any way to circumnavigate this? Thank you for any help you can provide!

Stats: corncob version: 0.3.1 R version: 4.2.0 (2022-04-22)

adw96 commented 11 months ago

Hi @cbahern57 -- thank you for the detailed issue, which allowed me to diagnose the issue fairly easily.

You have misspelled the argument phi.formula as phi_formula, which is causing all models to fail. I believe everything should work as intended once this is corrected.

adw96 commented 11 months ago

@svteichman - would you be able to add some sort of check to flag an issue like this? I expect that because arguments from differentialTest are passed straight to bbdml, and they all fail, the error message that bbdml would normally return to help the user debug is bypassed.

I leave it to you how best to implement this check -- but two options that occur to me include an explicit name-matching check (less general), or a closer look at the error message for bbdml (which I believe should distinguish between overparametrization, and argument-type errors)

Another option is to return to the user a single bbdml error output along with the "All models failed to converge!" doomsday message 😸 This may have helped @cbahern57 find the issue more easily

adw96 commented 11 months ago

Oh, and I wrote a MWE for you, @svteichman --

set.seed(1)
ncat <- 10
nreps <- 3
nn <- ncat*nreps
ww <- data.frame("tax1" = rpois(nn, 3000), "tax2" = rpois(nn, 3000), "tax3" = rpois(nn, 3000))
df <- data.frame("Substrate" = rep(paste("subs", 1:ncat), nreps))

ps <- phyloseq(otu_table(ww, taxa_are_rows=FALSE), 
               sample_data(df))

## fails in a hard-to-diagnose way
difftest_genus <- differentialTest(formula = ~ Substrate, 
                                   formula_null = ~ 1, 
                                   phi_formula = ~ 1, 
                                   phi.formula_null = ~ 1, 
                                   data = ps, 
                                   test = "LRT", 
                                   boot = FALSE, 
                                   fdr_cutoff = 0.05)

difftest_genus <- differentialTest(formula = ~ Substrate, 
                                   formula_null = ~ 1, 
                                   phi.formula = ~ 1, ### note!
                                   phi.formula_null = ~ 1, 
                                   data = ps, 
                                   test = "LRT", 
                                   boot = FALSE, 
                                   fdr_cutoff = 0.05)
svteichman commented 10 months ago

I'm closing this issue because pull request #151 has been merged.