Closed ChristinaSchmidt1 closed 1 year ago
Yes I agree, feature-wise HM seems better approach. About using something else than 50%, I dont know with what reasons anyone would select another percentage.
Nice, so we are on the same page. The reason on the different percentages is that one needs this if you want to test that the imputation is not random, but this is something I would not do right away, but comment on in the discussion.
I was pointed towards a package proDA (https://www.bioconductor.org/packages/release/bioc/html/proDA.html) for proteomics data, that does MVI as part of the Log2FC calculation. So this is something good to check and see how they have done this.
I check the proDA a little bit it seems it uses Bayesian statistics to fit models that take into concideration the probability of missing values to occus at a certain variable adundance. I have to go through the paper. I will put this on my list to check after the other issues are dealt with.
Amazing job, thanks!
About proDA, if I have understood this correctly, what you describe is actually done as part of the Log2FC calculation. So it is not a MVI as such, but if someone would want to also calculate Log2FC and stats from metabolites where some missing values are present. If this is the case we can move the proDA as an enhancement to DMA.
Just double checking: When MVI=FALSE the NAs are re-added in tne output processed data. You mean as we change NAs to 0 in the function and you change this back?!
about MVI. Yes, Initially we change all zeros to NAs. When MVI==FALSE I save the position of the NAs in the dataset
and in the end when we make the output processed dataset I put NA is the stored positions.
About proDA yes it sais this:
So its not really doing imputation here. But here: https://support.bioconductor.org/p/122916/
I undestood that they make the model and then use that model to make a full matrix from one with missing values.
Ok perfect, so then MVI is done :)
Lets make proDA its own issue and we can come back to it later as other things have higher priority now. As I know some people who have developed proDA, we can also try to talk to them if needed.
I want to apply the missing value imputation (half minimum) per feature and not on the whole DF sinc ethe features have a huge range of numeric values in metabolomics.
Moreover, after discussing with Aurelien I noticed that we should use NA instead of 0 when returning the DF (if a user did not choose to do missing value imputation). He also raised a good point that we should make a disclaimer that doing missing value imputation using HM is just recommended if the user is aware why the features are mising (missing not at random with a biological reason. We should give them the option to choose if they want to use half minimum (50%), or more/less of the smallest value detected for this feature.