Preprocessing - Add `Pool_Estimation`

dprymidis commented 1 year ago

Add a function to check the dispersion of each metabolite in the pooled samples. Report and possibly remove high variant metabolites.

ChristinaSchmidt1 commented 1 year ago

We can return a vector with the metabolites of high variance. So the user can easily remove them :)

ChristinaSchmidt1 commented 1 year ago

I would probably call this: metabolite detection estimation by pool sample dispersion

ChristinaSchmidt1 commented 1 year ago

lets use this name for the function: Pool_Estimation

dprymidis commented 1 year ago

After todays meeting we concluded to do the following:

Calculate the Coefficient of variation for each metabolite
Log transform the data. use shapiro test to check normality, measure Standard error of the mean and/or median as a more robust measure to outliers. 2.5* maybe check about the other statistical measures available for a better option.
Make a table with column with the scores for each metaboltie and a total with the consensus result.
Add the table in the globla environment.
The Input data can be the whole dataset or only the pooled samples. IF the Input is just the pooled samples then just do the above. If the input is only the whole dataset then also do the following:
add parameter, ~unstable_feature_remove = T/F to automatically remove or not the metabolties found as having high variance. If the parameter remove = TRUE, then the user has to save the object returned by the function in a variable in order to get the filtered Input_dataset ie. x<- function(x) (x is the filtered input dataset).
If the user uses the whle dataset as input, they also have to add the InputSettingsfile and pass the name of the pooled samples in the conditions ie. InputSettingInfo = c(Conditions= "pooled Samples")

dprymidis commented 1 year ago

Note, What do we do with NAs? NAs shouldnt exist as the pooled samples are used for metabolite identification (for example in compound discoverer). For the calculation of CV and SE we ignore the NAs just in case. Maybe not?

for 2. in the vignette data (as an example) without log 93.41 are normally distriuted and 6.59 are not. When we take the log of that 92.31 are normally distributed and 7.69 are not. In this case, taking the log actually is making the data "worse". However, by doing this we "ensure" the general normality.

Again for 2. SEM = sd/sqrt(sample_size). taking the log or not of the data we get different SEM. It is affected by the sample mean. So we cannot have a standard threshold. By taking the ratio of SEM/mean we get a value not dependent on the sample mean, which means that we can have a standard threshold. I used this.

I added both SEMean and SEMedian. It turned out that SEMedian is a scaler of SEMean so SEMedian does not actually provide something addittional. Its just a little more strict than the mean. Also, it seems that SEMean and CV give "similar results with thresholds of CV=1 and SEM_ratio = 0.1 Again the SEM_ratio is a scaler of CV. So it makes sense. However, the SEM takes into account also the sample size.

Also assignment of 2 "things" in the global enviroment worked for me. This worked:

I also found this for this issue: https://stackoverflow.com/questions/9726705/assign-multiple-objects-to-globalenv-from-within-a-function

dprymidis commented 1 year ago

This is Done.

Need to check the param and output file names.
I still need to check about the other statistical measures to maybe find a better option.

dprymidis commented 1 year ago

Compound Discoverer uses Group-wise coefficient of variation with threshold of 20. I dont know exactly what it does with this yet.

dprymidis commented 1 year ago

This is done . I kept a personal list of papers I am going through for the measures of dispersion. I will fill you in at some point.

ChristinaSchmidt1 commented 1 year ago

Thank you, thats great! You can also drop some comments/links into the vignette (the standard one).

saezlab / MetaProViz

Preprocessing - Add `Pool_Estimation` #25