hredestig / pcaMethods

Perform PCA on data with missing values in R
GNU General Public License v2.0
45 stars 10 forks source link

What are observations (rows) and what variables (columns) #25

Closed Tobias314 closed 11 months ago

Tobias314 commented 11 months ago

Hello,

If I interpret the documentation correctly it is assumed that the input matrix to any of the PCA methods contains samples/observations as rows and variables/genes/proteins as columns. Is this correct?

Further, there is the "metaboliteData" dataset usually used for examples. The documentation (?metaboliteData) says about this dataset: "A matrix containing 154 observations (rows) and 52 metabolites (columns)."

Doing dim(metabiloteData) gives the correct dimensions (154 rows, 52 columns). However, looking at the dataset it seems like rows are metabolites and columns are samples. I am no expert but row names like e.g. "Tyramine", "Glycine", and "Glycerol" for me sound a lot like metabolites and not like the names of samples/observations.

Am I missing something?

For my use case, I am interested in imputing proteomics data (abundance values for several thousand proteins measured across a small number of samples). Given the documentation, I would assume that the matrix used as input to pcaMethods should have samples as rows and proteins as columns. However, so far I have mostly seen pcaMethods being used the other way around (matrix with proteins as rows and samples as columns). Could you shed some light on this?

hredestig commented 11 months ago

Hi, yes, the metabolite data is given in "ExpressionSet" form where metabolites (genes) are in rows. If you see examples where it is used, a transform is applied first. E.g., ?bpca

  ## Load a sample metabolite dataset with 5\% missig values (metaboliteData)e
     data(metaboliteData)
     ## Perform Bayesian PCA with 2 components
     pc <- pca(t(metaboliteData), method="bpca", nPcs=2)

So yes, variables: columns, observations: rows.

Tobias314 commented 11 months ago

Thank you for the fast response!

Given what you said, would it make sense to change the documentation of metaboliteData? Especially here and here. So change it from "A matrix containing 154 observations (rows) and 52 metabolites (columns)" to "A matrix containing 154 metabolites (rows) and 52 observations (columns)".

In addition, there is this vignette specifically focused on missing value imputation which also does not transpose metaboliteData. So I think it might also make sense to change it there (see here) since people interested in missing value imputation might look at this vignette first.

hredestig commented 10 months ago

Sure, I might get to that but I have unfortunately very little time to spare for this project so might take a while - happy to review a merge request if you care to give a stab! Note, for missing value imputation, you can get different results depending transpose or not and there is not a strong theoretical case there for either - whatever gives lower error wins.