MLE for proteomics data imputation

rformassspectrometry / MsCoreUtils

Core Utils for Mass Spectrometry Data

https://rformassspectrometry.github.io/MsCoreUtils/

16 stars 11 forks source link

MLE for proteomics data imputation #109

Open ginnyintifa opened 1 year ago

ginnyintifa commented 1 year ago

Dear Team,

MLE is one of the imputation options, which calls the em.norm and imp.normfunctions from the norm package. And implemented by Margin ==2 .

I think Margin ==2 is a reasonable setting since the p*n original data matrix (features in rows and samples in columns) would be transposed before sending to the EM algorithm. Therefore when doing EM each feature would be the actual genes/proteins/peptides.

But the issue is proteomics data is always p>>n. We would have ~20000 proteins and a dozen of samples in TMT global proteome data set for example. Then with as good number of features, EM algorithm is so expensive.

I am trying this data set (10k * 24) with the impute_mle function and haven't got any results yet.

dtmt = fread("ccRCC_prot_abundance_MD_3plex.tsv",
          stringsAsFactors = F, data.table = F)
dd = as.matrix(dtmt[,-c(1:5)])
dtmt_res = MsCoreUtils::impute_mle(dd)

Do you have any insights on this issue?

Thank you very much!

lgatto commented 1 year ago

I don't have any suggestion in terms of speeding up the underlying implementation. You could possibly try to split your data in chunks and parallelise the imputation.
There's also impute_mle2() function (see #100). I'll update the documentation, as I now see that it isn't explicitly mentioned in the MLE imputation paragraph.
Setting MARGIN == 2 imputes along the columns. If you want to impute along the features, you need to set it to 1. If you see a different behaviour, it's a bug and please do let me know. The discussion about the margins is actually more involved, I think, and will also depend on downstream applications.
As for imputation in general, I do think it's not straightforward, and my advice would be to (1) filter features that have too many missing values and (2) not to impute, unless you have to.

lgatto commented 1 year ago

By the way, if you are processing quantitative proteomics data, I highly advise to consider giving the QFeatures package a go.

hsiaoyi0504 commented 2 days ago

@lgatto Is there any recent change of MLE? We are actually in a class using imputation from MSnbase. What we noticed is that it looks like something change from versions and the data takes forever to be imputed using MLE.

lgatto commented 2 days ago

@hsiaoyi0504 - there have been changes in the past, such as adding support for the norm2 package (about 2 years ago), and then dropping it again last year because it was removed from CRAN. About 2 years ago, we also added a MARGIN argument that defines if rows or columns-wise imputation should be done.