saezlab / MetaProViz

R-package to perform metabolomics pre-processing, differential metabolite analysis, metabolite clustering and custom visualisations.
https://saezlab.github.io/MetaProViz/
GNU General Public License v3.0
8 stars 0 forks source link

pre-processing MVI #21

Closed ChristinaSchmidt1 closed 1 year ago

ChristinaSchmidt1 commented 1 year ago

I want to apply the missing value imputation (half minimum) per feature and not on the whole DF sinc ethe features have a huge range of numeric values in metabolomics.

Moreover, after discussing with Aurelien I noticed that we should use NA instead of 0 when returning the DF (if a user did not choose to do missing value imputation). He also raised a good point that we should make a disclaimer that doing missing value imputation using HM is just recommended if the user is aware why the features are mising (missing not at random with a biological reason. We should give them the option to choose if they want to use half minimum (50%), or more/less of the smallest value detected for this feature.

dprymidis commented 1 year ago

Yes I agree, feature-wise HM seems better approach. About using something else than 50%, I dont know with what reasons anyone would select another percentage.

ChristinaSchmidt1 commented 1 year ago

Nice, so we are on the same page. The reason on the different percentages is that one needs this if you want to test that the imputation is not random, but this is something I would not do right away, but comment on in the discussion.

ChristinaSchmidt1 commented 1 year ago
  1. implement MVI per feature
  2. implement parameter for MVI_Percentage=50 (default)
ChristinaSchmidt1 commented 1 year ago

I was pointed towards a package proDA (https://www.bioconductor.org/packages/release/bioc/html/proDA.html) for proteomics data, that does MVI as part of the Log2FC calculation. So this is something good to check and see how they have done this.

dprymidis commented 1 year ago
  1. Done.
  2. Done.
  3. When MVI=FALSE the NAs are re-added in tne output processed data.

I check the proDA a little bit it seems it uses Bayesian statistics to fit models that take into concideration the probability of missing values to occus at a certain variable adundance. I have to go through the paper. I will put this on my list to check after the other issues are dealt with.

ChristinaSchmidt1 commented 1 year ago

Amazing job, thanks!

About proDA, if I have understood this correctly, what you describe is actually done as part of the Log2FC calculation. So it is not a MVI as such, but if someone would want to also calculate Log2FC and stats from metabolites where some missing values are present. If this is the case we can move the proDA as an enhancement to DMA.

Just double checking: When MVI=FALSE the NAs are re-added in tne output processed data. You mean as we change NAs to 0 in the function and you change this back?!

dprymidis commented 1 year ago

about MVI. Yes, Initially we change all zeros to NAs. When MVI==FALSE I save the position of the NAs in the dataset

image

and in the end when we make the output processed dataset I put NA is the stored positions.

image

About proDA yes it sais this:

image

So its not really doing imputation here. But here: https://support.bioconductor.org/p/122916/

I undestood that they make the model and then use that model to make a full matrix from one with missing values.

ChristinaSchmidt1 commented 1 year ago

Ok perfect, so then MVI is done :)

Lets make proDA its own issue and we can come back to it later as other things have higher priority now. As I know some people who have developed proDA, we can also try to talk to them if needed.