saezlab / MetaProViz

R-package to perform metabolomics pre-processing, differential metabolite analysis, metabolite clustering and custom visualisations.
https://saezlab.github.io/MetaProViz/
GNU General Public License v3.0
8 stars 0 forks source link

MetaProViz::Pool_Estimation - Add QC plots and invisible assign #60

Closed ChristinaSchmidt1 closed 11 months ago

ChristinaSchmidt1 commented 1 year ago

Some points that came up during discussions:

  1. Do invisible assign and dont return to gloabal environment
  2. Add save results (TRUE FALSE) and save .csv file with CVs
  3. Rename parameter Therhold_cv to threshold_cv
  4. Check threshold cutoff - I think it shoudl be 10%, but defiently not more than 40%
  5. Add QC plots: More details below
  6. Add save plots (TRUE FALSE)

About the QC plots:

  1. Make a histogram plot (y-axis = Frequency and x-axis = CV). Ideally this should low CVs. We can draw a dotted line where the threshold was chosen.
  2. Violin plot or box plot (y-axis = CV and x-axis = All Metabolites as one sample) --> The highest CVs above the threhold chosen can be labelled with metabolite names
ChristinaSchmidt1 commented 1 year ago

QC plots: image

image

dprymidis commented 12 months ago

For 1. This means that we dont return anything and nothing is printed excect the messages/warnings

2 and 6. I am using Save_as_Results and Save_as_Plot as we used these for other functions. The user can put them to NULL if they dont want the results. I will also add a folder named Pool_Estimation in the Preprocessing (?) and if any of these parameters is not NULL I will put the results in there.

  1. Done
  2. Generally, I understood that 1 ok quite ok. 0.1 (10%) seems very small and always have some metabolites above the threshold.
  3. Done
ChristinaSchmidt1 commented 12 months ago
  1. Yes this is the main feedback I got. And if there are multiple results we just return them in a list as we do with the PreProcessing_res
  2. and 6. Yes thats perfect. We just add the results to the preprocessing folder.
  3. Nice :)
  4. Wait isnt 1 = 100%? I thought 0.3 (=30%) would be an accepted threshold for CV.
  5. Thanks :)
dprymidis commented 12 months ago

So for 1. There are 2 things. 1 is to return the "filtered dataset" if the Unstable_featre=TRUE and the second is to maybe return the CV table and the 3 plots(PCA, Hist, Violin). How do we go about it? We return a list with first the dataset and then the plots?

  1. yes 1 is 100% , its a bit arbitrary and indeed it seems that 1 is a lot. But generally I find this: https://stats.stackexchange.com/questions/566534/what-does-it-mean-that-the-coefficient-of-variation-equals-100#:~:text=A%20CV%20at%20or%20above,equal%20to%20its%20standard%20deviation.
ChristinaSchmidt1 commented 12 months ago
  1. Yeah we could make a list with two lists (Plots and DFs) and then within the list of Plots we have the QC plots and within the list of DFs we have the result DFs.

  2. How can this get higher than 1, if 1 is supposed to be 100%? I think I am missing something.

dprymidis commented 12 months ago
  1. Ok understood
  2. CV = sd/mean

set.seed(789) random_values <- runif(3, min = 100, max = 2000) mean_value <- mean(random_values) sd_value <- sd(random_values) cv_value <- sd_value / mean_value random_values [1] 1429.7993 277.6479 122.5850 mean_value [1] 610.0107 sd_value [1] 714.1786 cv_value [1] 1.170764

It can get higher than 1 if some value is much different. But now I noticed that it also has to do with the number of samples you have.

ChristinaSchmidt1 commented 12 months ago

So I was just checking on the CV and indeed that can happen when SD is greater than the mean value. In this case the CV will be more than 100% which means that on an average, data points are very distant from the mean. For the threshold I always thought people would use 0.3 = 30%, but in the end its something the user can change. We can check the metabolites that are above 0.3 and check if they truly appear variable across samples. Also, we can check the mean value as the CV may be high at extremely low concentrations and low at large values.

dprymidis commented 12 months ago

This is done.

However there is still an issue when we run the PCA the plot gets printed and the grid is saved. So in order to get the plot you have to plot(PoolEstimation_res[["Plots"]][[1]]) and just running the PoolEstimation_res[["Plots"]] gives the PCA plot grid and the other 2 plots. This needs fixing.

Also as for preprocessing we do this assign("PreProcessing_res", preprocessing_output_list, envir=.GlobalEnv) I did this here assign("PoolEstimation_res", Pool_Estimation_res_list, envir=.GlobalEnv) Maybe for this we have to make an agreement of when to assign and when to return

I did this here* Yes indeed this check has to be done

ChristinaSchmidt1 commented 12 months ago

Just wanted to let you know that the vignette throws this error:

Error in ggplot(Pool_Estimation_result, aes(CV)) : object 'Pool_Estimation_result' not found

dprymidis commented 12 months ago

Should be working now.. sorry

ChristinaSchmidt1 commented 12 months ago

no worries, just wanted to note it down. Thanks!

ChristinaSchmidt1 commented 11 months ago

I was just checking the paramters in the Pool_Estimation function. Could you please:

  1. Input_SettingsInfo: c(Conditions=ColumnNameConditions, Name= NamePoolCondition). If Input_SettingsInfo=NULL we assume that column is called Conditions
  2. Input_SettingsFile: If provided we do require Name and if Conditions column is not called Conditions, they need to provide the name.

In this way we make it more flexible and the user could pass any column name for Conditions as in the other functions.

dprymidis commented 11 months ago