Closed tappituffi closed 5 months ago
Hi Niklas,
Thanks for your interest in radEmu! I'm going to take your questions in order in a few separate comments.
To address parallelization, it looks like you have perhaps taken a look at the parallelization vignette, but if not, that is a great place to start.
Here's the function defined to test a single null in that vignette:
emuTest <- function(category) {
score_res <- emuFit(formula = ~ Group,
data = wirbel_sample[ch_study_obs, ],
fitted_model = ch_fit,
refit = FALSE,
test_kj = data.frame(k = covariate_to_test,
j = category),
Y = as.matrix(wirbel_otu[ch_study_obs, restricted_mOTU_names]))
return(score_res)
}
The test_kj
argument is very important -- it specifies what test you are running. If you don't pass test_kj
to emuFit
, by default emuFit
will run tests on all parameters (save the intercepts), so my guess is that your parallel code may be attempting to test hypotheses for all 359 taxa you've included in your model in each parallel instance. That will take a long time!
Continuing to question 2: radEmu can absolutely handle continuous predictors. I can't tell you what may have happened without more information about the error you received. If you're able to provide a minimal reproducible example in which this error occurs, that would be ideal -- thanks!
It should not be a problem that some of your study participants have constant values of your predictor -- clustering is taken into account in hypothesis testing (after the model is fit).
run_score_tests = FALSE
to emuFit
-- otherwise, emuFit
will attempt to run score tests (for all of your parameters, not just PN) sequentially, which could take a very long time potentially.I would need to know more about your pathway data and how it was generated to answer your question about whether radEmu is appropriate.
ch_fit <- [emuFit(formula = ~ timepoint*intervention,
data = [[your covariate data here]],
Y = [[your sequencing count data here]],
cluster = [[vector with your study participant ids in it identifying which sample came from which participant]]],
run_score_tests = FALSE)
Here, timepoint would be a categorical variable giving the timing (baseline or post-intervention) that each sample was taken. (I've set run_score_tests = FALSE
so you can run them later in parallel to save time, not because it's not a good idea to run them.)
Hope this is helpful!
Hi,
Thanks for all the responses. I really appreciate it.
Continuing to question 2: radEmu can absolutely handle continuous predictors. I can't tell you what may have happened without more information about the error you received. If you're able to provide a minimal reproducible example in which this error occurs, that would be ideal -- thanks!
It should not be a problem that some of your study participants have constant values of your predictor -- clustering is taken into account in hypothesis testing (after the model is fit).
-> this is the error I got: "Error in .solve.checkCondBound(a@U, tol) : \n 'a' is computationally singular, min(d)/max(d)=1.72472e-109, d=abs(diag(U))\n" attr(,"class") [1] "try-error" attr(,"condition") <simpleError in .solve.checkCondBound(a@U, tol): 'a' is computationally singular, min(d)/max(d)=1.72472e-109, d=abs(diag(U))>
So, for some taxa (categories), the calculation failed. Maybe it suggests that for those taxa I shouldn't include the clustering (i.e. not account for samples from the same subject)?
- I highly recommend you use raw read counts and not read relative abundances. radEmu will give you a result if you use read relative abundances, but you lose potentially important information about precision in this case.
I would need to know more about your pathway data and how it was generated to answer your question about whether radEmu is appropriate.
-> so when I use the raw counts, would I have the total raw counts (sequencing depths) as a covariate in the formula or does emuFit account for it automically?
For the pathway question: We used the humann3 pipeline to derive pathway information for our shotgun seq reads. https://github.com/biobakery/biobakery/wiki/humann3 -> see: 3. Manipulating HUMAnN output tables -> the result of this is a datatable with RPKM for each of the pathways -> so those reads would be normalized and not the raw counts anymore
For 6., oh yes, that should work. Thanks!
One follow-up question on the run_score_tests -> this gives me just the p-value and I need to adjust for multiple testing separately, correct?
Thank you very much in advance and best, Niklas
Hi @tappituffi !
I agree with @ailurophilia on all of the above, including to use raw counts rather than proportions. I definitely recommend that you do not adjust for sequencing depth as a covariate – indeed the model accounts for variations in depth automatically 😇
Confirming p-values are unadjusted. If you'd like to provide false discovery or family wise error rate control, you'll have to adjust for them yourself.
I have no idea about the error you found. If you're able to strip your code & data down to the bare minimum needed to reproduce this error (even just once), we can debug it for you. You're welcome to email me your stripped-down code and data at adwillis@uw.edu and we'll see what we can do.
Code to address the error caused from running all_fit <- emuFit(formula = ~ pred_5, cluster = meta$num_sub, data = meta, Y = test, run_score_tests = T)
is addressed in PR #51. There is an uninvertible information matrix for the 5th column of the test matrix causing a NA score p-value for that category, but other categories can be run with the fix implemented in the PR.
Thanks so much @svteichman ! I’ll try to review all the PRs this afternoon/night.
In the meantime, @tappituffi , you can pull the PR and work from there.
Hi statdivlab,
With great interest, I followed the recent seminar on the EMBL platform: Statistical thinking of microbial ecology First of all, thank you all very much for developing tools for microbiome research and also creating vignettes and tutorials. That's awesome, and we need more of it :)
I have a couple of questions and issues I ran into when trying to use radEmu:
1) I use my MacBook Air with an M2 chip and the most basic setup (8Gb RAM, 8-core). -> when I run the emuFit on my dataset it takes ages to calculate
the code I use:
PN_cat: categorical predictor with low/high based on a continuous measure cluster: number for each subject of the study to account for correlated data -> i.e. the same column that I would use to set a random intercept for subject DA_meta: Meta data table with samples in rows filtered_data: 359 species as counts in columns and 69 samples in rows
-> in total, I have 69 samples coming from 18 subjects (it is a longitudinal study) and 359 species for which I want to do differential abundance analysis -> I want to find species that are higher in abundance in samples with PN_cat = high compared to PN_cat = low and account for samples coming from the same subject
-> when I tried to parallelize, I used this code:
-> first, I set the ncores to 6, after waiting 2h I set it to 5, then 4, etc. Whenever I tested the parallel calculation I did not work properly. I see in the activity monitor that it works, but the calculation is way slower than when I don't use paralllel calculation. I tested it with a subset of my data (only 10 species and 40 samples). -> so I guess my question is: is there anything wrong with my code or is it just my laptop? I am not a bioinformatician, so it could very well be that I am missing something
2) I had to categorise my original continuous measure PN (goes from 0 to 100) to a categorical variable of low/high (PN_cat) in order to run emuFit for my data. The reason was that the emuFit gave me an error when I used PN as a continuous predictor. The error had something to do with a too small distance (another error I got: Error in .local(x, ...) : LU factorization of .gCMatrix failed: out of memory or near-singular). -> so my questions are: can emuFit be used with a continuous predictor or only for categorical? -> could the issue be that I have some subjects that have the same PN throughout the study (for example, subject 2 has PN values of only 100)
3) Does computing time increase when I add covariates in my model? I want to include two categorical variables with 2 and 3-levels and I am not sure if this will increase my computational time considerably.
4) How much sparsity can emuFit handle? My dataset is very sparse. The highest prevalence for a species is 80%, the next has only a prevalence of 50% and then it goes down to 20%. I guess the method does not work with this high amount of 0s?
5) For my species table, do I have to use the raw counts or could I also use the relative abundance instead? Also, would you advice to put the total counts within a sample as a covariate in the model or does emuFit considers this automatically? I also have pathway information as counts per million (CPM) values and I wondered if I could use emuFit on this data as well? This would be better as I don't have many 0s for the pathway data. I have around 400 pathways in my dataframe.
6) I have a second study which investigates the effect of a dietary intervention on the microbiome. However, the trial was not randomized so I want to include the baseline value as a covariate in my models. So, for differential abundance, I would use something like this: lm( clr-transformed-species_endline ~ intervention + clr-transformed-species_baseline) -> I would then write a loop for each individual species and then adjust p-values with Benjamini Hochberg
-> I wonder how I could code this for emuFit (i.e. how to incorporate my baseline counts in the formula for emuFit as this would be a matrix)? -> As far as I understood, emuFit needs the whole dataset for its calculation, or can I test each individual species separately?
Let me know if I need to clarify my questions or what would help you in order to answer them. Thank you very much in advance and looking forward to your response.
Best, Niklas