rformassspectrometry / MsCoreUtils

Core Utils for Mass Spectrometry Data
https://rformassspectrometry.github.io/MsCoreUtils/
16 stars 11 forks source link

MLE for proteomics data imputation #109

Open ginnyintifa opened 1 year ago

ginnyintifa commented 1 year ago

Dear Team,

MLE is one of the imputation options, which calls the em.norm and imp.normfunctions from the norm package. And implemented by Margin ==2 .

I think Margin ==2 is a reasonable setting since the p*n original data matrix (features in rows and samples in columns) would be transposed before sending to the EM algorithm. Therefore when doing EM each feature would be the actual genes/proteins/peptides.

But the issue is proteomics data is always p>>n. We would have ~20000 proteins and a dozen of samples in TMT global proteome data set for example. Then with as good number of features, EM algorithm is so expensive.

I am trying this data set (10k * 24) with the impute_mle function and haven't got any results yet.

dtmt = fread("ccRCC_prot_abundance_MD_3plex.tsv",
          stringsAsFactors = F, data.table = F)
dd = as.matrix(dtmt[,-c(1:5)])
dtmt_res = MsCoreUtils::impute_mle(dd)

Do you have any insights on this issue?

Thank you very much!

lgatto commented 1 year ago
lgatto commented 1 year ago

By the way, if you are processing quantitative proteomics data, I highly advise to consider giving the QFeatures package a go.

hsiaoyi0504 commented 2 days ago

@lgatto Is there any recent change of MLE? We are actually in a class using imputation from MSnbase. What we noticed is that it looks like something change from versions and the data takes forever to be imputed using MLE.

lgatto commented 2 days ago

@hsiaoyi0504 - there have been changes in the past, such as adding support for the norm2 package (about 2 years ago), and then dropping it again last year because it was removed from CRAN. About 2 years ago, we also added a MARGIN argument that defines if rows or columns-wise imputation should be done.