saezlab / MetaProViz

R-package to perform metabolomics pre-processing, differential metabolite analysis, metabolite clustering and custom visualisations.
https://saezlab.github.io/MetaProViz/
GNU General Public License v3.0
11 stars 1 forks source link

Viz: Merge Bar/Violin/box plots and superplots #30

Closed ChristinaSchmidt1 closed 10 months ago

ChristinaSchmidt1 commented 1 year ago

As discussed yesterday please combine these four functions into one:

  1. Bargraph
  2. Boxplot
  3. Violinplot
  4. superplots

Give the user the parameter GraphStyle= "Bar", "Box" or "Violin" Give the user the parameter Superplots = TRUE or FALSE (if TRUE the user needs to provide the column name they want to use to colour code the superplots for.

ChristinaSchmidt1 commented 1 year ago

One additional note: When saving the generated plots, we need to figure out how to:

  1. Keep the y-axis length always the same independent if there is a title and subtitle plotted or not/ how long the sample names are
  2. adjust the x-axis size dependent on the number of samples (e.g. are there 4 bars its 2cm if there is 6 bars its 3cm). Here we also need to ensure that the figure legend on the right side does not influence this.
  3. The margins the figure is saved as need to be adjusted automatically as well, so we can accomondate the above
  4. Lets use arial for all fonds and adjust the fond size as one would want it for a publication (10 for x-y legend, 11 for headline, 10 for sub headline, 11 for title of legend and 10 for legend names)
dprymidis commented 1 year ago

The functions are merged. The new function is named VizPlots and it has:

parameter: Graprh_Style = "Bar", # options: Bar, Box, Violin Superplot = NULL, # or a column name from the Experimental Design to be used for superplot

I used the above instead of Superplot = TRUE/FALSE because like this we do not need another parameter with the vector we want to use. Now the user has to input the name of a column they want to use from the Exp design. ie Superplot = "Biological_Replicates"

I still left the Output_plots = "Together", or "individual but I think to change this and make a parameter Individual_plots = TRUE/FALSE with default = FALSE. What do you think?

There is also the parameter Selected_Conditions, where the used can input a vector of Conditions they want to keep in the plot. All the other conditions are removed. ie Selected_Conditions = c("HK2", "786-M2A","786-M1A" )

and Selected_Comparisons which is a list of vector containing the named of the conditions we want to make a t-test. ie Selected_Comparisons = list(c("HK2","786-M2A"), c("HK2","786-M1A"))

The additonal note is yet to be implemented.

ChristinaSchmidt1 commented 1 year ago

Amazing, nice job!

Some points: Superplot = NULL, # or a column name from the Experimental Design to be used for superplot

I think we should use the same syntax as for lollipop and volcano. So we could rename Experimental Design to Plot_SettingsFile and we can use Plot_SettingsInfo= c(SuperPlot="ColumnName_Plot_SettingsFile")

Individual_plots = TRUE/FALSE with default = FALSE. --> Yes thats a great idea! This refers to saving them in a scrollable PDF document or individual file correct? Here we can also look into using facetting (https://ggplot2.tidyverse.org/reference/facet_grid.html), so plots are on one sheet nicely ordered. This order could even be for pathways or cl;usters. So that metabolites of a pathway are printed on one sheet.

For Selected_Conditions and Selected_Comparisons we need a column Condition at the moment. I think we could add this as well to the Plot_SettingsInfo=c(Conditions ="ColumnName_Plot_SettingsFile").

For Selected_Comparisons, we should still do an anova multiple comparison test (at least if there are more than two conditions on the plot). But than just label the selected compariosns on the plot, yet the information would come from multiple comparison test. Also, if we go down this route we probably need to offer both parametric and non-parametric tests?

About the additional note on saving the figures: This is something we will need to do for all the plots. I started doing this for the heatmaps, but this is a different graph object (pheatmap object), whilst most other graphs will be ggplot. So the syntax will be a bit different I think. But I guess if we have figured it out for one its applicable for most of the plots. Moreover, we should retrun a plot object to the environment including all the plots. I have added this for example into the pre-processing function. In this way, someone can still add things or make changes using ggplot syntax in most cases. Lastly, we should add save_as=NULL, in which case the figures are not saved but only the list of plot objects returned.

dprymidis commented 1 year ago

Yes, I will add the PlotSettingsFile and Info. Initially, I tried to make it like this but since there was the Experimental_design as a parameter it didnt make sense to me to add also the PlotSettingInfo, but of course the Experimental_design will be renamed into PlotSettingFile and add the SuperPlot in the plotSettingsInfo.

The Selected_Conditions is a subset of the total Conditions. Therefore, it cannot be something like a column name in the Plot_SettingsInfo. But it could be a vector of condition names in the PlotSettingInfo like this: Plot_SettingsInfo=c(Selected_Conditions = c("HK2", "786-M2A","786-M1A" )). The same goes for Selected_Comparisons which should be a subset of the Conditions Selected. Now that I am thinking about it we could remove the Selected_Comparisons completely and always do t-tests or anova between the Selected_Conditions.

Regarding ANOVA and the parametric/non parametric tests you have a point but doesnt that go too far? I say this because in my head the idea was to make the plots and maybe add a t-test for some statistics on the side. On the other hand to check distributions and do parametric or non parametric tests is not that difficult since we already did this in the DMA.

About the additional Note, yeah I agree that if we manage to do this for one plot when we should be ok since most plots are ggplots. I tried some things yesterday but didnt manage a lot.

ChristinaSchmidt1 commented 1 year ago

Sorry if it wasnt clear, I did not mean to change Selected_Conditions, but I meant to pass the col,umn name that includes the information of the conditions. In case sopmeone did not label it conditions, but samples or patients or tissue. Does this makes sense? For Selected_Comparisons, yes we could always do the test, but this would be relevant to decide what should be shown on the plot in terms of statistics - or is the result of the stats not on the plot?

Yes with all the tests thats a lot of work, we can log this as an enhancement issue for the future, but nothing we implement in the first package version. Btw. there is a nice shiny app with those bargraphs arrangements and stats, which I think does a great job. https://cancerandmetabolism.biomedcentral.com/articles/10.1186/s40170-020-00220-x

Additional note: Yeah I think this part will inevitably take up some time, but in the end makes a huge difference for usability.

dprymidis commented 1 year ago

For the function to work we must have a column named "Conditions" in the Experimental_design now renamed to PlotSettingsInfo. We could add a Conditions in the PlotSettingsInfo and make it ="Conditions" as a default, so the user can change it.

The Selected_Conditions selects only those conditions speficied to be on the plot. The Selected_Comparison makes and plots statistics only for those conditions specified in the Selected_Comparison.

My remark was to remove the Selected_Comparison and when we have Selected_Conditions = "Something" also do and plot the statistics between the "something". Now this does not happen. The user has to specifically select the comparisons thy want to do through Selected_Comparison.

Yes, I agree adding the different tests for the plot would be super cool, we can put this as an enhancment for the future. Also, the app looks very nice. I see you can select spesific dimentions for the plots. Maybe I will try to find their code and see how they do this.

ChristinaSchmidt1 commented 1 year ago

Lets discuss the point about conditions later in person :)

Yeah I know the app is great - would be nice if you can find the source code. This one could also be helpful for this: https://www.cedricscherer.com/2019/08/05/a-ggplot2-tutorial-for-beautiful-plotting-in-r/#axes

ChristinaSchmidt1 commented 1 year ago

Thanks for the great discussion, I think we have a good plan now :) Could you add the points here that you noted down on your to do list?

dprymidis commented 1 year ago

After todays meeting we decided to do (I feel something is wrong here. Please correct me were needed):

  1. to a t -test only if we have 2 selected_Conditions
  2. Keep the Selected_Comparison ?? look below
  3. add the test used in a caption in the plot (bottom right)
  4. finish the visuals of the plot (The additional note above, axis length, labels etc.)
  5. As an enhancment add the multiple comparison testing ANOVA along with the non parametric one as in DMA.

Now that I look at it, again i think the Selected_Comparison is not needed. I dont remember the reason we decided to keep it.

I think this is how is should be: Generally we will plot all the conditions and no stats. If someone wants to plot only some conditions then they select those conditions is Selected_Conditions. If they also want the stats then they could do (a new param) add_stats ==TRUE and stats would be on the plot. Is there are 2 Selected_Conditions then a t-test if more the ANOVA. Like this we dont need the Selected_Comparison. and a True or FALSE parameter is ieasier than the Selected_Comparison. Does this make sense? I know its different from what we said but I dont remember the reasoning for keeping the Selected_Comparison since we will add the ANOVA.

ChristinaSchmidt1 commented 1 year ago

What we discussed was the following: If only 2 selected_Conditions, we can ignore Selected_Comparison as we can just add the results. Yet, if the user selected 3 or more conditions, we need to do a multiple comparison test (like anova). In this case we may not want to plot all the results as the plot could be super crowded, but only the once of interest based on Selected_Comparison (I think this point was the one that lead to confusion). Yet, what you describe makes sense. If we can fit all the stats nicely on the plot (?), we dont need the parameter Selected_Comparison since we always plot all comparions stats on the plot.

One note: Lets always plot the exact value, e.g. p=0.06 or 0.9. If the values become to small we can plot p=7E-15.

ChristinaSchmidt1 commented 1 year ago

Here we need to:

  1. Check that everything works and implement the testing of the input parameters
  2. Add into the vignette
  3. Add the helper function for nice plotting and saving
  4. Here is could be relevant to enable patchwork::wrap_plots, to save plots in a panel.
dprymidis commented 1 year ago

Ok so it was not working indeed. There was a double comma somewhere :'( Anyway now its working like this.

VizSuperplot(Input_data = Intra_Preprocessed[,-c(1:3)], Input_SettingsFile = Intra_Preprocessed[,c(1:2)], Graprh_Style = "Box", # Bar, Box, Violin Superplot = NULL, OutputPlotName = "Box", Output_plots = "Individual", Selected_Conditions = NULL, # not added yet Selected_Comparisons = NULL, # not added yet Theme = theme_classic(), Save_as_Plot = "svg") # for together it always pdf

Note 1: There seems to be an issue with the error bars. I think that they use interquartile ranges from the median in ggplot instead of the mean. We encountered this before but I dont exactly remember.

Note 2 : Stats are not added yet

ChristinaSchmidt1 commented 1 year ago

I just had a look at the function and some points I noticed:

dprymidis commented 1 year ago

You call the vizSuperplots like this

VizSuperplot(Input_data = Intra_Preprocessed[,-c(1:3, 30:182)], Input_SettingsFile = Intra_Preprocessed[,c(1:2)], Input_SettingsInfo = c(conditions="Conditions", superplot = "Biological_Replicates"), Graph_Style = "Box", # Bar, Box, Violin

Superplot = NULL,

         OutputPlotName = "",
         Individual_plots = TRUE,
         Selected_Conditions = c("786-M1A", "786-O", "HK2"), 
         Selected_Comparisons = list(c(1,2), c(1,3), c(2,3)),
         Theme = theme_classic(),
         Save_as_Plot = "svg") # for together it always pdf

Now about Selected_Conditions . if NULL then all groups are plotted. If some are selected then only those are plotted with the same order as in the Selected_Conditions vector. The Selected_Comparisons. If NULL then no stats are added. if one pair is added then t.test, if more than 1 pairs are added then anova

ChristinaSchmidt1 commented 1 year ago

Amazing, thanks for the update. Shall I start with the helper function for the plotting or are you still working on some of the points?

dprymidis commented 1 year ago

I did not double check if everything is working as it should with no problems. But yes, you can start on the helper function

ChristinaSchmidt1 commented 12 months ago

Ok I had a look at the function and tested the different functionalities. A couple of points:

The above are the things I noticed when going trough the function thus far.

ChristinaSchmidt1 commented 10 months ago

Done. ~The facet-grid is moved to the general function.