DMA- multiple testing - Githubissues

ChristinaSchmidt1 commented 1 year ago

One good point that was raised is that we do not have an option for a multiple comparsion test such as annova, cruckal walles or limma (like pearson correlation). Often a user has multiple conditions they want to compare against each other, so that could be a valuable functionality to add.

Especially usefull for timeseries

ChristinaSchmidt1 commented 1 year ago

Might be helpful: https://stat.ethz.ch/~meier/teaching/anova/random-and-mixed-effects-models.html

dprymidis commented 1 year ago

After a meeting we concluded to add 3 options for DMA. A) Comparison of 2 Conditions B) Multiple comparison all against one C) Multipe comparison all against all

What I thought we could do is the following: Add 'denominator_condition' and 'conditions' (If we go for C2) (or numerator_conditions if we go for C1) in Input_SettingsInfo

For: A) add the conditions as string vectors Input_SettingsInfo = c( denominator_condition="denominator_condition", conditions = "numerator_condition" ) ie. Input_SettingsInfo = c( denominator_condition='WT',conditions = 'KO')

B) add the denominator_condition as string vector and the conditions as lists of strings Input_SettingsInfo = c( denominator_condition="denominator_condition", conditions = "all_numerator_conditions" ) ie. Input_SettingsInfo = c( denominator_condition='WT', conditions = list('KO', 'DKO', 'TKO') )

C) Here I think 2 things: 1) add all conditions as lists of strings in both the denominator and numerator Input_SettingsInfo = c( denominator_condition= list("all_conditions"), conditions = list("all_conditions") ) ie. Input_SettingsInfo = c( denominator_condition=list('WT', 'KO', 'DKO', 'TKO') , conditions = list('WT', 'KO', 'DKO', 'TKO') )

2) or just a list of all conditions as strings Input_SettingsInfo = c( conditions= list("all_conditions") ) ie. Input_SettingsInfo = c( conditions=list('WT', 'KO', 'DKO', 'TKO') )

What do you think?

ChristinaSchmidt1 commented 1 year ago

A and B make sense, and It is the best option. We only should add another one, which will enable to pass the column name that includes the conditions: e.g. Input_SettingsInfo = c(column="Conditions", denominator_condition="denominator_condition", conditions = "numerator_condition" )

For option C: If the user does not specify denominator_condition and/or conditions we just use the column="Conditions" and compare everything with each other. Otherwise if the user does denominator_condition=NULL and conditions=NULL.

Would that makes sense?

dprymidis commented 1 year ago

Yes it does make sense. Now I am thinking:

1-vs-1 A) Input_SettingsInfo = c(column="Conditions", denominator_condition="denominator_condition", numerator_condition = "numerator_condition" ) 1-vs-all B) Input_SettingsInfo = c(column="Conditions", denominator_condition="denominator_condition") all-vs-all C) Input_SettingsInfo = c(column="Conditions")

Is this good then?

ChristinaSchmidt1 commented 1 year ago

yes :) Makes sense to me!

By default we could do: Input_SettingsInfo = c(column="Conditions", denominator_condition=NULL, conditions = NULL) or Input_SettingsInfo = c(column="Conditions") , as this is the same and just do multiple comparisons.

One thing we still need to consider: The choice of test. If the user runs default, but there are more than 2 conditions, we need to throw error for t-test/wilcox-test as the way we have them implemented is not for multiple comparsion.

dprymidis commented 1 year ago

The Syntax has been changed. Now the function has arguments Input_SettingsFile and Input_SettingsInfo = c(conditions="My_Conditions", denominator="786-O", numerator = "HK2") and 1-to-1 comparison is done

Multiple comparison all-vs-all added for both parametric and non parametric

dprymidis commented 1 year ago

For comparison of 1-vs-all and all-vs-all we implement first anova and the tukeys** test to get the "pairwise" pvalues and I will add the pairwise log2FCs. For non parametric we use first kruskal.test and then dunn.test for multiple pairwise comparisons.

So for 3 sample groups (A, B,C) we will have these columns,

all-vs-all:
Metabolite, p.adj_Anova, Log2FC_A-B, p.adj_A-B, t.val_A-B ,Log2FC_C-B, p.adj_C-B, t.val_C-B, Log2FC_C-A, p.adj_C-A. t.val_C-A

1-vs-all: if we used B as denominator: Metabolite, p.adj_Anova, Log2FC_A-B, p.adj_A-B, t.val_A-B , Log2FC_C-B, p.adj_C-B, t.val_C-B

**However, I found this which is interesting https://www.mdpi.com/2073-8994/13/8/1387 . It states that Anova and tukeys are not consistent with each other, which is troubling.

dprymidis commented 1 year ago

Issue No1.

with Anova. It doesnt work if the metabolite names contain any special characters or spaces. For now I am using this code to "bypass" this issue, but it changes the resulting names.

for (i in 1:length(colnames(Input_data))){ metabolite_name <- colnames(Input_data)[i] metabolite_name <- gsub("-", "", metabolite_name) metabolite_name <- gsub("/", "", metabolite_name) metabolite_name <- gsub(" ", "", metabolite_name) metabolite_name <- gsub("\*", "", metabolite_name) metabolite_name <- gsub("\+", "", metabolite_name) metabolite_name <- gsub(",", "", metabolite_name) metabolite_name <- gsub("$", "", metabolite_name) metabolite_name <- gsub("$", "", metabolite_name) colnames(Input_data)[i] <- metabolite_name }

No3. Dunn test doesnt work is a metabolite starts with number. I bypassed the problem by saving the name and then reintroducing it afte the test.

No3. Its interesting that I find that you correct for multiple comparison for Anova but not for kruskal test. I have to look at this a bit more. https://www.graphpad.com/guides/prism/latest/statistics/stat_correcting-the-main-anova-p-va.htm https://stats.stackexchange.com/questions/133444/bonferroni-correction-on-multiple-kruskal-wallis-tests

3b. I would like to note that for keeping a chunk of code the same I used only the p.adjusted for anova and kruskal and not report the p.value.

Note for me About the plots. For multiple comparison plot all the individual plots

ChristinaSchmidt1 commented 1 year ago

Thanks for the great summary! Maybe it would make sense that we discuss those points in a brief meeting as this might be easier.

dprymidis commented 1 year ago

After todays meeting we decided to do the following:

Make the equal variance/or not tests for anova assumptions. Levens or bartless test https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
Check output of Tukeys and Dunns. Do we get pvalues or p.adj and how we correct for multiple testing
Do not report the Anova pvalues
Check limma what they use. how do they do hypotheis testing? Do the consider data parameters? Do they need a certain amount of features?
About names make a column with Unique IDs and use that for the analysis and i nthe end map back the original names.

ChristinaSchmidt1 commented 1 year ago

I think this is quite usefull as well: https://lindeloev.github.io/tests-as-linear/

ChristinaSchmidt1 commented 1 year ago

We havent been able to get trough the whole discussion yet, but for sure limma is a great solution for us and Aurelien will share some code with us which we can adapt for the function. Limma is also working with missing values (NAs), so thats great. I will keep you posted on this part.

ChristinaSchmidt1 commented 1 year ago

I was just going trough the function and your latest changes (adding of multiple testing). Moreover, I got a lot of feedback over the past days, so I am summarising my thoughts and the feedback below:

I noticed that the new paramters are not defined yet and I started to do this. Maybe you could check if thats all ok now. Could you add a sentence about denominator and numerator?
If the user just has two conditions (KO and WT), the user can choose if the result is KO versus WT or WT versus KO. That works with the numerator/denominator, but we should ensure this is clear from the parameter definition. Are there maybe better names for this or why did we choose numerator/denominator (We can think about this)?
At the moment the shapiro test for data normality is done per feature across all samples, yet we need to change this to be done across the sample of the same condition. Meaning we need to do this test for each condition and report the percentages for each condition. Moreover, the feedback we give to the user can just be a parametric or non-paramteric test was choosen and your data has this distribution for C1: X% and Y%, for C2: X% and Y%, for C3: X% and Y%. Like a QC we report back. We can also consider a QC plot here with those information. Lastly, I have also gotten the feedback that some people tend to do data transformation prior to testing. For instance it is common that people would want to do log transformation to enforce data normality (we can debate of course how well this works) of the data prior to doing e.g. t.test. So the suggestion would be to add a parameter for data_transformation and if TRUE to perform log transformation prior to proceeding to Log2FC and stats.
The way annova is used at the moment is by checking the numerator and denumerator, but we should rather have the user choose pval="aov" or "anova".
Given that with the multiple testing we are adding, the function will get quite complex and long. Hence, we should start using helper functions for the stats options. For instance, I have moved the stats for a single comparison (t.test,...) into a helper function below (line 393 is the code within the DMA function and the helper function is below starting from line 569), which is not exported with the package (You could get it using MetaProViz:::Helper though). We should make another one for multiple testing (annova, kruskal walles,...) and a last one for limma. Here we can also add the specific tests for anova,... you mentioned in your previous comment. In terms of the metabolite names that cause an issue, we can make a the column with Unique IDs (M1 to Mn) and use that for the analysis and in the end map back the original names within the helper function of the test.
I had a good discussion about the CoRe calculation with the lab and I will implement some changes here.
If CoRe=TRUE, we should colour code the volcano plot for the CoRe info (Released, Consumed, Released-Consumed, Consumed-Released).
If CoRe=TRUE, we need to add CoRe into the file names. Otherwise we end up overwriting the files if someone analyses both intra and CoRe samples together.
The parameter Input_Pathways should be changed to something metadata, since any infromation available about the metabolites can be added (e.g. class, retention time,...)

As action points, maybe you could check the parameter definitions (1. and 2.) and move the multiple testing out of the main function into a helper function (5.). For the latter, check also point 4. . I will work on the CoRe related things and the limma helper function.

In regards to the multiple testing helper function:

In R the most basic and common functions we can use to perform ANOVA are aov() and lm(). Since ANOVA is a type of linear model, we also can use the lm() function. Here is a good summary of the https://bookdown.org/steve_midway/DAR/understanding-anova-in-r.html
As you mentioned, in ANOVA test, aov(), a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different. It’s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant using e.g. TukeyHSD().
Using lm() we can obtain p.val and lsmeans() we can obtain the p.adj. Below is some example code:
```
lmFormula<- "KO ~ WT" 
lmFormula<-as.formula(lmFormula)
```

pairwiseFormula<- "pairwise ~ KO" ) pairwiseFormula<-as.formula(pairwiseFormula)

result_p.val <- lsmeans::lsmeans((lm(lmFormula, data = DF)), pairwiseFormula , adjust=NULL)# adjust=NULL leads to the p-value! result_p.adj <- lsmeans::lsmeans((lm(lmFormula, data = DF)), pairwiseFormula, adjust="fdr") #adjust can be: statistical test used for multiple comparison "Tukey", "BH", "fdr",... result <- as.data.frame(result_p.val$contrasts) result$p.adj <- as.data.frame(result_p.adj$contrasts)[,"p.value"]



* Please pull from GitHub to see the latest changes in DMA I made.

dprymidis commented 1 year ago

about the points

Ah yeah I didnt make any changes to the @params indeed. I thought of doing that in the end since things are still changing.
I chose the numerator and denominator for 2 reasons. A) these names are also used somewhere else (in the Compound Discoverer), and B) it seemed pretty self explanatry to me with regards to know what we see on the volcano plot. ie. in the volcano plot on the right (positive side) you see what is higher in the numerator and on the left negative side what is high on the denominator.
Yes you are right, Shapiro test should be done by condition adn we can export a plot or a table. Also about the data transformation I am kinda against it because its not always the case that the log transformation does what it is expected to do. Since we are using ANOVA or kruskal test I wouldnt log transform the data in order to enforce the anova model. Also the main reason data transformations like log are applied is to correct the heteroscedacticity which is indeed a problem. Here, I would propose to use a variance stabilizing transformation always, or at least before using linear models like anova and kruskal.
this is partially correct. Multiple comparison is TRUE by checking the groups in the numerator. then we check if the data are parametric or no parametric and check the STAT_pval ="kruskal.test" or "anova" . So we have the STAT_pval which is either ttest wilcoxon anova or kruskal. SO regardless of the shapiro results the user can select anova or kruskal. Now if the user selected anova and there are 1 vs 1 samples when they will get an error but I think this is not added yet.
to 9. Ok

Regarding the code, I am using the same thing. I use aov to make the model and get a pvalue for the total and then tukeys for the pairwise. I can change it to lm but it should be the same.

ChristinaSchmidt1 commented 1 year ago

No need to change to lm(), I just wanted to mention it here as we had been talking about it in our meeting.

No need to change this, I was just wondering if this would also be clear to a biologist.
I agree and I also would not do log transformation. Yet I have received some feedback from people who would do it. So I just thought to give this functionality. On the contrary, we can also opt for discussing the matter in the vignette and leave it to the user to do data transformation prior to using the DMA function. Probably thats the better solution
Ok, I see. and we can add the option limma once I have made the helper function.

ChristinaSchmidt1 commented 1 year ago

I have started to work on some of the points (1-9 in my comment above) and thought to convert it into a task list, so we can tick what has been done:

[x] 1. I noticed that the new paramters are not defined yet and I started to do this. Maybe you could check if thats all ok now. Could you add a sentence about denominator and numerator?
[x] 2. If the user just has two conditions (KO and WT), the user can choose if the result is KO versus WT or WT versus KO. That works with the numerator/denominator, but we should ensure this is clear from the parameter definition. Are there maybe better names for this or why did we choose numerator/denominator (We can think about this)? Here we said to just add a good definition into the paramter explanation.
- Added #' @param Input_SettingsInfo \emph{Optional: } Named vector including the information about the conditions column c(conditions="ColumnName_Plot_SettingsFile"). Can additionally pass information on numerator or denominator c(numerator = "ColumnName_Plot_SettingsFile", denumerator = "ColumnName_Plot_SettingsFile") for specifying comparisons. Log2Fold changes are obtained by dividing the numerator by the denominator, thus positive log2FC values mean higher expression in the numerator and are presented in the right side on the Volcano plot. Using =NULL selects all the condition and performs multiple comparison. \strong{Default = c(conditions="Conditions", numerator = NULL, denumerator = NULL)}
[x] 3. Shapiro test:
[x] 3.1. At the moment the shapiro test for data normality is done per feature across all samples, yet we need to change this to be done across the sample of the same condition. Meaning we need to do this test for each condition and report the percentages for each condition. Moreover, the feedback we give to the user can just be a parametric or non-paramteric test was choosen and your data has this distribution for C1: X% and Y%, for C2: X% and Y%, for C3: X% and Y%. Like a QC we report back.
[x] 3.2. We need to add 3.1. results to be saved into the folder. At the moment its just returned into the global environment via invisible return.
- Added the results are returned and now there is a folder called Shapiro which also has the density plots
[x] 3.3. We can also consider a QC plot here with those information. i Added: For 3.3 I added Density plots for each group and QQplots for each metabolite in each sample. However, to make all the qq plots it takes significantly longer for the function to run so I silenced it for the moment.

[x] 3.4. Lastly, I have also gotten the feedback that some people tend to do data transformation prior to testing. For instance it is common that people would want to do log transformation to enforce data normality (we can debate of course how well this works) of the data prior to doing e.g. t.test. So the suggestion would be to add a parameter for data_transformation and if TRUE to perform log transformation prior to proceeding to Log2FC and stats. We opt for discussing the matter in the vignette and leave it to the user to do data transformation prior to using the DMA function.
[x] 4. The way annova is used at the moment is by checking the numerator and denumerator, but we should rather have the user choose pval="aov" or "anova".
[x] 5. Given that with the multiple testing we are adding, the function will get quite complex and long. Hence, we should start using helper functions for the stats options. For instance, I have moved the stats for a single comparison (t.test,...) into a helper function below (line 393 is the code within the DMA function and the helper function is below starting from line 569), which is not exported with the package (You could get it using MetaProViz:::Helper though).
[x] 5.1. We should make another one for multiple testing (annova, kruskal walles,...) and
[x] 5.2. One for limma. Here we can also add the specific tests for anova,... you mentioned in your previous comment. In terms of the metabolite names that cause an issue, we can make a the column with Unique IDs (M1 to Mn) and use that for the analysis and in the end map back the original names within the helper function of the test.
[x] 5.3. The formally cleaner way of calculating the tvalue (or t-statistics) is to take it from the base R function as it will be calculated prior to the p-value. So lets do this when writing the functions for 5.1 and 5.2!
[x] 6. I had a good discussion about the CoRe calculation with the lab and I will implement some changes here.
[x] 6.1. Now we need to update the MCA for CoRe to use distance to make groups.
[x] 7. If CoRe=TRUE, we should colour code the volcano plot for the CoRe info (Released, Consumed, Released-Consumed, Consumed-Released).

or we can use shapes to keep both the information of the thresholds and CoRe. I left it at color for now.

[x] 8. If CoRe=TRUE, we need to add CoRe into the file names. Otherwise we end up overwriting the files if someone analyses both intra and CoRe samples together.
[x] 9. The parameter Input_Pathways should be changed to something metadata, since any infromation available about the metabolites can be added (e.g. class, retention time,...)

ChristinaSchmidt1 commented 12 months ago

I have just learned that in fact the formally cleaner way of calculating the tvalue (or t-statistics) is to take it from the base R function as it will be calculated prior to the p-value (compares sample means to the null-hypothesis and incooporates both, the sample size and the sample variability in the data). This will not affect the t-value ranking, but may affect the spread compared to how we do it at the moment (using Log2FC and p-value). In limma this is done automatically as part of limma and for the other base R functions (like t.test or wilcox.test,...) we should also be able to extract this. So lets do this whilst working on the DMA helper functions anyways.

FYI: I have changed this already in the DMA_Stat_single() helper function and I added it into our list above as 5.3..

ChristinaSchmidt1 commented 11 months ago

[x] Dont plot or print the QQ plots as part of DMA since it takes a lot of computing time.
[x] Move to separate function with shapiro
[x] Christina: Check 2. point of the parameter addition.

dprymidis commented 11 months ago

[ ] Check variance assumptions for models. (equal variances, heteroscedasticity)
[ ] add limma
[ ] update the MCA for CoRe to use distance to make groups
[x] metabolite names are changed for e.g. aov - do this more systematically and add back the original metabolite names (In terms of the metabolite names that cause an issue, we can make a the column with Unique IDs (M1 to Mn) and use that for the analysis and in the end map back the original names within the helper function of the test before returning.)

ChristinaSchmidt1 commented 11 months ago

Thanks for all the updates! I am back and currently working on limma ;)

ChristinaSchmidt1 commented 11 months ago

I have added the limma function, but need to finish some small points (e.g. make contrast matrix 1 versus all) and retunr it into the main function. I also added some bullet points to your comment above. As this list has now all remaining points I ticked off everything above.

I have now done a list with several DFs named after the comparison that is performed and within each DF there are the classic columns like Log2FC, ....

ChristinaSchmidt1 commented 10 months ago

I notiuced that when we have CoRe=TRUE and we have multiple comparison, the addition of the extra columns (released, consumed,...) doesnt work. I am also wondering if we should add the input data columns into the results DF. Meaning additionally to metabolite name, Log2FC, stats and e.g. pathways, we have the Samples with their metabolite values. At this points it might be better to update the helper functions for multiple comparisons to return a list with DFs named after the comparison and to each list add the above information. What do you think?

I also noticed that if we do all versus all, we either have HK2 versus X or X versus HK2 as it is the same. Maybe something to comment on.

Here is what we sill need to do:

[x] Check variance assumptions for models. (equal variances, heteroscedasticity) --> moved to list below
[x] limma return for CoRe=TRUE
[x] CoRe=TRUE and multiple comparison: add the info of consumed, released,... to each compariosn
[x] Add the input sample values to the output DFs.
[x] update the MCA for CoRe to use distance to make groups --> moved to list below

Have a list of DFs, named after the comparison

[x] Log2FC calculations
[x] limma
[x] aov
[x] kruskal
[x] volcano plots
[x] return
[x] add previous metabolite names

Make sure that all_vs_all and one_vs_all returns the right comparisons:

[x] limma
[x] kruskal
[x] aov

dprymidis commented 10 months ago

I wouldnt put the input values on the result. In case of multiple comparison it will get very big. Users can get them from the excel. I dont know if having them adds something more.

About the all versus all. I would say show only 1 of these. They should be the same with opossite log2FC right? I think I checked this and it was only returning of of the 2 but i guess i am wrong. In any case, I would say give only 1 of the 2 comparissons.

ChristinaSchmidt1 commented 10 months ago

I have added the samples now, but we can remove them later as well. I only added them for the conditions that where compared.

For the comparison, at least for limma, it does it alphabetically. Meaning, even if we want HK2 versus X, we may get X versus HK2. So this is soemthing we will need to fix I think (especially for one_vs_one).

This is left now:

[x] update the MCA for CoRe to use distance to make groups
[ ] Check variance assumptions for models. (equal variances, heteroscedasticity)
[x] enable the right direction of comparison when denominator is provided for limma
[x] check/test the other stats (aov, kruskal) for the right direction of comparison when denominator is provided
[ ] kurskal.test do not work with denominator that is e.g. 786-M1A.
[x] aov do not work with denominator that is e.g. 786-M1A.

ChristinaSchmidt1 commented 10 months ago

I found that kurskal.test and aov do not work with denominator that is e.g. 786-M1A. This is because it has a special character in the name. Hence we need to fix this. I added it to the list above.

ChristinaSchmidt1 commented 10 months ago

Last to things that are open:

[x] kurskal.test do not work with denominator that is e.g. 786-M1A.
[ ] Check variance assumptions for models. (equal variances, heteroscedasticity)

ChristinaSchmidt1 commented 10 months ago

I will close this issue now and move the last point into a separate issue as this is one we can add also later with intermediate priority as this is about communicating to the user weather their choices of stats are ok in this regards.

saezlab / MetaProViz

DMA- multiple testing #45