statdivlab / radEmu

Other
49 stars 8 forks source link

Too many questions... :) #49

Closed tappituffi closed 5 months ago

tappituffi commented 5 months ago

Hi statdivlab,

With great interest, I followed the recent seminar on the EMBL platform: Statistical thinking of microbial ecology First of all, thank you all very much for developing tools for microbiome research and also creating vignettes and tutorials. That's awesome, and we need more of it :)

I have a couple of questions and issues I ran into when trying to use radEmu:

1) I use my MacBook Air with an M2 chip and the most basic setup (8Gb RAM, 8-core). -> when I run the emuFit on my dataset it takes ages to calculate

the code I use:

all_fit <- emuFit(formula = ~ PN_cat ,
                  cluster = DA_meta$num_sub,
                  data = DA_meta, 
                  Y = filtered_data)

PN_cat: categorical predictor with low/high based on a continuous measure cluster: number for each subject of the study to account for correlated data -> i.e. the same column that I would use to set a random intercept for subject DA_meta: Meta data table with samples in rows filtered_data: 359 species as counts in columns and 69 samples in rows

-> in total, I have 69 samples coming from 18 subjects (it is a longitudinal study) and 359 species for which I want to do differential abundance analysis -> I want to find species that are higher in abundance in samples with PN_cat = high compared to PN_cat = low and account for samples coming from the same subject

-> when I tried to parallelize, I used this code:

emuTest <- function(category) {
  score_res <- emuFit(formula = ~ PN_cat ,
                  cluster = DA_meta$num_sub,
                  data = DA_meta, 
                  Y = filtered_data)  
return(score_res)
}

if (.Platform$OS.type != "windows") {
  # run if we are on a Mac or Linux machine
  score_res <- mclapply(1:length(filtered_data),
                      emuTest,
                      mc.cores = ncores)
} else {
  # don't run if we are on a Windows machine
  score_res <- NULL
}

-> first, I set the ncores to 6, after waiting 2h I set it to 5, then 4, etc. Whenever I tested the parallel calculation I did not work properly. I see in the activity monitor that it works, but the calculation is way slower than when I don't use paralllel calculation. I tested it with a subset of my data (only 10 species and 40 samples). -> so I guess my question is: is there anything wrong with my code or is it just my laptop? I am not a bioinformatician, so it could very well be that I am missing something

2) I had to categorise my original continuous measure PN (goes from 0 to 100) to a categorical variable of low/high (PN_cat) in order to run emuFit for my data. The reason was that the emuFit gave me an error when I used PN as a continuous predictor. The error had something to do with a too small distance (another error I got: Error in .local(x, ...) : LU factorization of .gCMatrix failed: out of memory or near-singular). -> so my questions are: can emuFit be used with a continuous predictor or only for categorical? -> could the issue be that I have some subjects that have the same PN throughout the study (for example, subject 2 has PN values of only 100)

3) Does computing time increase when I add covariates in my model? I want to include two categorical variables with 2 and 3-levels and I am not sure if this will increase my computational time considerably.

4) How much sparsity can emuFit handle? My dataset is very sparse. The highest prevalence for a species is 80%, the next has only a prevalence of 50% and then it goes down to 20%. I guess the method does not work with this high amount of 0s?

5) For my species table, do I have to use the raw counts or could I also use the relative abundance instead? Also, would you advice to put the total counts within a sample as a covariate in the model or does emuFit considers this automatically? I also have pathway information as counts per million (CPM) values and I wondered if I could use emuFit on this data as well? This would be better as I don't have many 0s for the pathway data. I have around 400 pathways in my dataframe.

6) I have a second study which investigates the effect of a dietary intervention on the microbiome. However, the trial was not randomized so I want to include the baseline value as a covariate in my models. So, for differential abundance, I would use something like this: lm( clr-transformed-species_endline ~ intervention + clr-transformed-species_baseline) -> I would then write a loop for each individual species and then adjust p-values with Benjamini Hochberg

-> I wonder how I could code this for emuFit (i.e. how to incorporate my baseline counts in the formula for emuFit as this would be a matrix)? -> As far as I understood, emuFit needs the whole dataset for its calculation, or can I test each individual species separately?

Let me know if I need to clarify my questions or what would help you in order to answer them. Thank you very much in advance and looking forward to your response.

Best, Niklas

ailurophilia commented 5 months ago

Hi Niklas,

Thanks for your interest in radEmu! I'm going to take your questions in order in a few separate comments.

To address parallelization, it looks like you have perhaps taken a look at the parallelization vignette, but if not, that is a great place to start.

Here's the function defined to test a single null in that vignette:

emuTest <- function(category) {
   score_res <- emuFit(formula = ~ Group,
                        data = wirbel_sample[ch_study_obs, ],
                        fitted_model = ch_fit,
                        refit = FALSE,
                        test_kj = data.frame(k = covariate_to_test, 
                                             j = category), 
                        Y = as.matrix(wirbel_otu[ch_study_obs, restricted_mOTU_names]))
   return(score_res)
 }

The test_kj argument is very important -- it specifies what test you are running. If you don't pass test_kj to emuFit, by default emuFit will run tests on all parameters (save the intercepts), so my guess is that your parallel code may be attempting to test hypotheses for all 359 taxa you've included in your model in each parallel instance. That will take a long time!

ailurophilia commented 5 months ago

Continuing to question 2: radEmu can absolutely handle continuous predictors. I can't tell you what may have happened without more information about the error you received. If you're able to provide a minimal reproducible example in which this error occurs, that would be ideal -- thanks!

It should not be a problem that some of your study participants have constant values of your predictor -- clustering is taken into account in hypothesis testing (after the model is fit).

ailurophilia commented 5 months ago
  1. Yes, compute time will increase because the computational task is harder with more parameters in the model, but I wouldn't expect that fitting the main model (without score tests, which you can run in parallel) would take a prohibitively long amount of time in this case. Make sure, though that you specify not to run score tests when you fit the main model by passing run_score_tests = FALSE to emuFit -- otherwise, emuFit will attempt to run score tests (for all of your parameters, not just PN) sequentially, which could take a very long time potentially.
ailurophilia commented 5 months ago
  1. radEmu can handle a very high degree of sparsity -- the estimator is guaranteed to exist even when a taxon is not observed at all in one of your groups. With that said, if you don't observe a taxon at all in any group, I wouldn't include it in analysis.
ailurophilia commented 5 months ago
  1. I highly recommend you use raw read counts and not read relative abundances. radEmu will give you a result if you use read relative abundances, but you lose potentially important information about precision in this case.

I would need to know more about your pathway data and how it was generated to answer your question about whether radEmu is appropriate.

ailurophilia commented 5 months ago
  1. My recommendation in this situation would be to fit a radEmu model to your baseline and post-intervention data (i.e., for each participant, include a row of Y for baseline measurement and another row for post-intervention measurement -- I'm assuming each participant gets only a single intervention). You could then model change from baseline with something along the lines of
ch_fit <- [emuFit(formula = ~ timepoint*intervention, 
                 data = [[your covariate data here]],
                 Y = [[your sequencing count data here]],
                 cluster = [[vector with your study participant ids in it identifying which sample came from which participant]]], 
                 run_score_tests = FALSE)

Here, timepoint would be a categorical variable giving the timing (baseline or post-intervention) that each sample was taken. (I've set run_score_tests = FALSE so you can run them later in parallel to save time, not because it's not a good idea to run them.)

ailurophilia commented 5 months ago

Hope this is helpful!

tappituffi commented 5 months ago

Hi,

Thanks for all the responses. I really appreciate it.

Continuing to question 2: radEmu can absolutely handle continuous predictors. I can't tell you what may have happened without more information about the error you received. If you're able to provide a minimal reproducible example in which this error occurs, that would be ideal -- thanks!

It should not be a problem that some of your study participants have constant values of your predictor -- clustering is taken into account in hypothesis testing (after the model is fit).

-> this is the error I got: "Error in .solve.checkCondBound(a@U, tol) : \n 'a' is computationally singular, min(d)/max(d)=1.72472e-109, d=abs(diag(U))\n" attr(,"class") [1] "try-error" attr(,"condition") <simpleError in .solve.checkCondBound(a@U, tol): 'a' is computationally singular, min(d)/max(d)=1.72472e-109, d=abs(diag(U))>

So, for some taxa (categories), the calculation failed. Maybe it suggests that for those taxa I shouldn't include the clustering (i.e. not account for samples from the same subject)?


  1. I highly recommend you use raw read counts and not read relative abundances. radEmu will give you a result if you use read relative abundances, but you lose potentially important information about precision in this case.

I would need to know more about your pathway data and how it was generated to answer your question about whether radEmu is appropriate.

-> so when I use the raw counts, would I have the total raw counts (sequencing depths) as a covariate in the formula or does emuFit account for it automically?

For the pathway question: We used the humann3 pipeline to derive pathway information for our shotgun seq reads. https://github.com/biobakery/biobakery/wiki/humann3 -> see: 3. Manipulating HUMAnN output tables -> the result of this is a datatable with RPKM for each of the pathways -> so those reads would be normalized and not the raw counts anymore


For 6., oh yes, that should work. Thanks!

One follow-up question on the run_score_tests -> this gives me just the p-value and I need to adjust for multiple testing separately, correct?

Thank you very much in advance and best, Niklas

adw96 commented 5 months ago

Hi @tappituffi !

I agree with @ailurophilia on all of the above, including to use raw counts rather than proportions. I definitely recommend that you do not adjust for sequencing depth as a covariate – indeed the model accounts for variations in depth automatically 😇

Confirming p-values are unadjusted. If you'd like to provide false discovery or family wise error rate control, you'll have to adjust for them yourself.

I have no idea about the error you found. If you're able to strip your code & data down to the bare minimum needed to reproduce this error (even just once), we can debug it for you. You're welcome to email me your stripped-down code and data at adwillis@uw.edu and we'll see what we can do.

svteichman commented 5 months ago

Code to address the error caused from running all_fit <- emuFit(formula = ~ pred_5, cluster = meta$num_sub, data = meta, Y = test, run_score_tests = T) is addressed in PR #51. There is an uninvertible information matrix for the 5th column of the test matrix causing a NA score p-value for that category, but other categories can be run with the fix implemented in the PR.

adw96 commented 5 months ago

Thanks so much @svteichman ! I’ll try to review all the PRs this afternoon/night.

In the meantime, @tappituffi , you can pull the PR and work from there.