lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
123 stars 50 forks source link

Mixed imputation #513

Closed Mahmoudhallal closed 6 months ago

Mahmoudhallal commented 4 years ago

I followed the example of mixed imputation on the naset and my own dataset therefore I have 2 questions/possible bugs: 1) In the following example: x <- impute(naset, method = "mixed", randna = fData(naset)$randna, mar = "knn", mnar = "min")

the MNAR values are replaced by 0.029 which is not the dataset minimum (0.014). The value 0.029 is the minimum of the subset of rows with MNAR only. This was not clear for me, could you elaborate please?

2) In the naset fData, randna is a logical vector indicating the MAR such that the missing values of every row are MAR (TRUE) or MNAR (FALSE) where the rows which don’t need imputation are also TRUE (no missing values). I wonder why the rows with no missing values are included as TRUE? Since if you try imputing the MNAR with MinProb, it will fail with an error "[1] NA There were 16 warnings (use warnings() to see them)" with NaNs introduced. Assigning the no missing value rows as FALSE solves the problem. I was wondering what is the logic in including these rows as MNAR or MAR.

Thank you.

lgatto commented 3 years ago
  1. Indeed, in the implementation, the data are split into two subsets based on randna. The min value used in the mnar subset is the the minumum value in the naset[!fData(naset)$randna, ] subset, and not the min from the whole data.

  2. The randna argument is expected to have length equal to nrow(naset) and is used to split the data as described above, and I can indeed reproduce your example. There was no specific reason the set the features without missing values to TRUE, and I did not anticipate this specific issue with MinProb (and it's not clear whether there's a reason for this or whether it's a bug in imputeLCMD::impute.MinProb()). Setting randna to FALSE to the features without missing values doesn't trigger the buggy line in imputeLCMD::impute.MinProb().

    As your example indicates, the randna value for these proteins without missing value is relevant, and one could argue whether they should be consider in either, both or none of the mar and mnar subsets. I am open to comments.