ggloor / ALDEx2_dev

ALDEx tool to examine compositional high-throughput sequence data with Welch's t-test
GNU Affero General Public License v3.0
12 stars 6 forks source link

aldex.glm with more complicated glm models #3

Closed chassenr closed 8 years ago

chassenr commented 9 years ago

Hi, I was wondering whether you are planning on including the option to work with more complicated glm models in aldex.glm, e.g. 2 factors, interaction terms, numerical factors, etc.? I have tried to find my own solution for this issue using anova() with the option test = "F" instead of drop1 to get the p-values for each term in the model and adjusting the p-values per term over all genes. Do you think this approach is viable?

Thanks a lot!

Cheers, Christiane

ggloor commented 9 years ago

Christiane

We do plan on adding more complex designs to ALDEx2, but are constrained by manpower. If you have a particular analysis I could see if we could incorporate it into the codebase. We just need to get the appropriate data out of the aldex.clr object.

g

On Oct 26, 2015, at 2:06 PM, Christiane Hassenrück notifications@github.com wrote:

Hi, I was wondering whether you are planning on including the option to work with more complicated glm models in aldex.glm, e.g. 2 factors, interaction terms, numerical factors, etc.? I have tried to find my own solution for this issue using anova() with the option test = "F" instead of drop1 to get the p-values for each term in the model and adjusting the p-values per term over all genes. Do you think this approach is viable?

Thanks a lot!

Cheers, Christiane

— Reply to this email directly or view it on GitHub.

chassenr commented 9 years ago

Hi, I was using the same basic structure that you already set up for aldex.glm. Below is my solution to the problem. I was working on a dataset where I had a 2-factorial design.

Cheers, Christiane

aldex.glm2 <- function(data.clr, formulaGLM, env) {

input:

data.clr - output from aldex.clr

formulaGLM - glm formula to be used, of the form e.g. "~ var1 + var2" or "~ var1 * var2"

env - data.frame containig the input variables for formulaGLM (samples are rows)

mc.instances <- numMCInstances(data.clr) feature.number <- numFeatures(data.clr) vars <- attributes(terms(as.formula(formulaGLM)))$term.labels nvars <- length(vars) feature.names <- getFeatureNames(data.clr) output <- list(p.glm = data.frame(matrix(NA, feature.number, nvars)), BH.glm = data.frame(matrix(NA, feature.number, nvars))) colnames(output$p.glm) <- vars colnames(output$BH.glm) <- vars rownames(output$p.glm) <- feature.names rownames(output$BH.glm) <- feature.names output.mci <- vector("list", length = nvars) names(output.mci) <- vars

for (i in 1:nvars) { output.mci[[i]] <- matrix(NA, feature.number, mc.instances) } for (mc.i in 1:mc.instances) { print(mc.i) t.input <- sapply(getMonteCarloInstances(data.clr), function(y) { y[, mc.i] }) for (i in 1:nrow(t.input)) { Y <- t.input[i, ] F1 <- as.formula(paste("Y", formulaGLM)) GLM <- glm(F1, data = env) AOV <- anova(GLM,test = "F") for (j in 2:nrow(AOV)) { output.mci[names(output.mci) == rownames(AOV)[j]][[1]][i, mc.i] <- AOV$"Pr(>F)"[j] } } } for(i in names(output.mci)) { output$p.glm[, colnames(output$p.glm) == i] <- apply(output.mci[[i]],1,mean) output$BH.glm[, colnames(output$p.glm) == i] <- p.adjust(output$p.glm[, colnames(output$p.glm) == i], method = "BH") } return(output) }

output:

list with 2 data.frames containing p-values and BH-adjusted p-values for each term in the model formula

(p.adjust run separately for each term)

chassenr commented 9 years ago

sorry for the weird formatting... I am new to github. I just copy pasted the R script (including comments).

ggloor commented 9 years ago

no worries, I'm learning as well. So I'm guessing your code did not give you the result you needed?

chassenr commented 9 years ago

The code is working and the results seem to make sense. What I am worried about is whether the way I implemented the glm function (and p value adjustment) is correct. Was there a specific (statistical) reason why you only allowed for one categorical variable in aldex.glm and used drop1 to get the p value? Do you think that anova.glm with test = "F" will also give reliable p values? Is it valid to adjust the p values for each term in the model formula separately? I want to make sure that the genes (or in my case rather OTUs) that are detected as differentially abundant are not biased by false positives and that the test is robust.

Thanks!

ggloor commented 9 years ago

Christiane

Ah, now I see your question.

Great. There is no (statistical) reason why we implemented the glm with only one categorical variable. But, I’m not sure why drop1 was used. Arianne Albert (Arianne.Albert@cw.bc.ca) wrote the glm function.

What I can say, is that using the approach of estimating technical variation using Dirichlet Monte-Carlo instances, and converting all values to log ratios does three things.

First, normalization of read count is taken care of by the use of a prior and the Dir instances. Second, you are weeding out components that have a large stochastic contribution. Third, the log-ratio makes the differences between the components linear. These make the analyses much more robust than point estimates.

g

On Oct 29, 2015, at 4:56 PM, Christiane Hassenrück notifications@github.com wrote:

The code is working and the results seem to make sense. What I am worried about is whether the way I implemented the glm function (and p value adjustment) is correct. Was there a specific (statistical) reason why you only allowed for one categorical variable in aldex.glm and used drop1 to get the p value? Do you think that anova.glm with test = "F" will also give reliable p values? Is it valid to adjust the p values for each term in the model formula separately? I want to make sure that the genes (or in my case rather OTUs) that are detected as differentially abundant are not biased by false positives and that the test is robust.

Thanks!

— Reply to this email directly or view it on GitHub.

chassenr commented 9 years ago

Hi again, I tried emailing Arianne, but I received a message that my email couldn't be delivered because her inbox was full. Do you have another way of contacting her?

Thanks!

Cheers, Christiane