Closed chassenr closed 8 years ago
Christiane
We do plan on adding more complex designs to ALDEx2, but are constrained by manpower. If you have a particular analysis I could see if we could incorporate it into the codebase. We just need to get the appropriate data out of the aldex.clr object.
g
On Oct 26, 2015, at 2:06 PM, Christiane Hassenrück notifications@github.com wrote:
Hi, I was wondering whether you are planning on including the option to work with more complicated glm models in aldex.glm, e.g. 2 factors, interaction terms, numerical factors, etc.? I have tried to find my own solution for this issue using anova() with the option test = "F" instead of drop1 to get the p-values for each term in the model and adjusting the p-values per term over all genes. Do you think this approach is viable?
Thanks a lot!
Cheers, Christiane
— Reply to this email directly or view it on GitHub.
Hi, I was using the same basic structure that you already set up for aldex.glm. Below is my solution to the problem. I was working on a dataset where I had a 2-factorial design.
Cheers, Christiane
aldex.glm2 <- function(data.clr, formulaGLM, env) {
mc.instances <- numMCInstances(data.clr) feature.number <- numFeatures(data.clr) vars <- attributes(terms(as.formula(formulaGLM)))$term.labels nvars <- length(vars) feature.names <- getFeatureNames(data.clr) output <- list(p.glm = data.frame(matrix(NA, feature.number, nvars)), BH.glm = data.frame(matrix(NA, feature.number, nvars))) colnames(output$p.glm) <- vars colnames(output$BH.glm) <- vars rownames(output$p.glm) <- feature.names rownames(output$BH.glm) <- feature.names output.mci <- vector("list", length = nvars) names(output.mci) <- vars
for (i in 1:nvars) { output.mci[[i]] <- matrix(NA, feature.number, mc.instances) } for (mc.i in 1:mc.instances) { print(mc.i) t.input <- sapply(getMonteCarloInstances(data.clr), function(y) { y[, mc.i] }) for (i in 1:nrow(t.input)) { Y <- t.input[i, ] F1 <- as.formula(paste("Y", formulaGLM)) GLM <- glm(F1, data = env) AOV <- anova(GLM,test = "F") for (j in 2:nrow(AOV)) { output.mci[names(output.mci) == rownames(AOV)[j]][[1]][i, mc.i] <- AOV$"Pr(>F)"[j] } } } for(i in names(output.mci)) { output$p.glm[, colnames(output$p.glm) == i] <- apply(output.mci[[i]],1,mean) output$BH.glm[, colnames(output$p.glm) == i] <- p.adjust(output$p.glm[, colnames(output$p.glm) == i], method = "BH") } return(output) }
sorry for the weird formatting... I am new to github. I just copy pasted the R script (including comments).
no worries, I'm learning as well. So I'm guessing your code did not give you the result you needed?
The code is working and the results seem to make sense. What I am worried about is whether the way I implemented the glm function (and p value adjustment) is correct. Was there a specific (statistical) reason why you only allowed for one categorical variable in aldex.glm and used drop1 to get the p value? Do you think that anova.glm with test = "F" will also give reliable p values? Is it valid to adjust the p values for each term in the model formula separately? I want to make sure that the genes (or in my case rather OTUs) that are detected as differentially abundant are not biased by false positives and that the test is robust.
Thanks!
Christiane
Ah, now I see your question.
Great. There is no (statistical) reason why we implemented the glm with only one categorical variable. But, I’m not sure why drop1 was used. Arianne Albert (Arianne.Albert@cw.bc.ca) wrote the glm function.
What I can say, is that using the approach of estimating technical variation using Dirichlet Monte-Carlo instances, and converting all values to log ratios does three things.
First, normalization of read count is taken care of by the use of a prior and the Dir instances. Second, you are weeding out components that have a large stochastic contribution. Third, the log-ratio makes the differences between the components linear. These make the analyses much more robust than point estimates.
g
On Oct 29, 2015, at 4:56 PM, Christiane Hassenrück notifications@github.com wrote:
The code is working and the results seem to make sense. What I am worried about is whether the way I implemented the glm function (and p value adjustment) is correct. Was there a specific (statistical) reason why you only allowed for one categorical variable in aldex.glm and used drop1 to get the p value? Do you think that anova.glm with test = "F" will also give reliable p values? Is it valid to adjust the p values for each term in the model formula separately? I want to make sure that the genes (or in my case rather OTUs) that are detected as differentially abundant are not biased by false positives and that the test is robust.
Thanks!
— Reply to this email directly or view it on GitHub.
Hi again, I tried emailing Arianne, but I received a message that my email couldn't be delivered because her inbox was full. Do you have another way of contacting her?
Thanks!
Cheers, Christiane
Hi, I was wondering whether you are planning on including the option to work with more complicated glm models in aldex.glm, e.g. 2 factors, interaction terms, numerical factors, etc.? I have tried to find my own solution for this issue using anova() with the option test = "F" instead of drop1 to get the p-values for each term in the model and adjusting the p-values per term over all genes. Do you think this approach is viable?
Thanks a lot!
Cheers, Christiane