kreutz-lab / DIMAR

Data-driven selction of an imputation algorithm in R
4 stars 0 forks source link

Steps to recreate learning of missingness pattern #1

Closed lincoln-harris closed 2 years ago

lincoln-harris commented 3 years ago

Hi there Very cool paper! I'm curious about the details of how you learned the pattern of missingness in the initial peptides-by-samples matrices. I'm referring to this step of the implementation outline:

Can you perhaps provide detailed pseudocode for how you accomplished this? We're looking to take a similar approach for a related project. Due credit will certainly be given.

Thanks!

clemenskreutz commented 3 years ago

I can explain it in words: The occurence of a missing value can be coded as a logical observation (1 = missing, false = non-missing). Logistic regression can be used to learn how the probability of observing 1 depends on predictor variables. In our case, the column, row and the average intensity over rows has been used as predictor variable (row and column as factorial variable). Learning corresponds to fitting these parameters. Functions for logistic regression typically require an array of observations and a design matrix. The observations are given after reformatted the proteomics data matrix into a single column (dimension: nrowncol x 1). nrow and ncol are the dimensions of the proteomics data matrix. You also have to properly define the design matrix (dimension: (ncol nrow) x nparameters) with npara = nrow + ncol + 1. Each column of the design matrix conrresponds to one parameter of the logistic regression model. The first ncol*nrow columns of the design matrix contain 0 and 1 (entry X_ij indicates the row and column where the i'th observation comes from). The last column contains average of log2-intensities.