kevinblighe / PCAtools

PCAtools: everything Principal Components Analysis
328 stars 67 forks source link

Setting scale=TRUE produced error from svd() #39

Closed chainorato closed 3 years ago

chainorato commented 3 years ago

Hi, Thank you very much for developing this package.

I have just set up PCAtools today and I tried to make a PCA plot of expression profiles (transformed by vst function in DESeq2) by the following code.

vsd <- assay(vst(dds, blind=TRUE))
p <- pca(vsd, metadata=samples, center=TRUE, scale=FALSE)
biplot(p, colby="condition", shape="condition", labSize = 0, pointSize = 6)

It smoothly produced the following nice plot.

Screen Shot 2021-05-19 at 0 16 48

However, when I set the scale parameter to TRUE for standardization. p <- pca(vsd, metadata=samples, center=TRUE, scale=TRUE)

There is an error.

Error in svd(x, nu = nu, nv = nv) : infinite or missing values in 'x'

The documentation (?svd) states that it is a numeric or complex matrix whose SVD decomposition is to be computed. However, I could not understand how infinite or missing values popped up in 'x' which I am unsure which matrix it is.

My questions are

  1. It seems that data points already vary around (0,0). Does this mean they are already scaled by DESeq vst() function?
  2. If the parameter scale also has to be set as TRUE, could you please help me figure out what is wrong here? Could I send you the data through email if it is necessary?
kevinblighe commented 3 years ago

Hi, scaling variance-stabilised data from DESeq2 should not be necessary for PCA. However, if you simple try t(scale(t(vsd))), does it produce an error? Have you pre-filtered your raw counts for genes with 0 counts across all samples?

chainorato commented 3 years ago

Hi, Thank you for your suggestion.

Setting t(scale(t(vsd))) still produced the same error.

However, I didn't pre-filtered raw counts for genes with 0 counts across all samples. So I pre-filtered them by following codes.

keep <- rowSums(counts(dds)) > 0
dds <- dds[keep,]

Then I created the plot again. Setting scale=False returned the exact same plot, and setting scale=TRUE became worked. Here is the plot (scale=TRUE) that was produced.

Screen Shot 2021-05-20 at 20 20 16

% variation of PC1 decreased After that, I tried to use t(scale(t(vsd))) instead of vsd. The resulting plot was exactly same whether I set scale=TRUE or scale=FALSE.

Could you confirm for me whether scaling variance-stabilised data from DESeq2 is still necessary for PCA? Because scaling decreased % variation of PC1 from 57.5% to 36.71%, even though the plot looks the same.

kevinblighe commented 3 years ago

Cool, so, the issue was related to genes of constant variance, i.e., all zeros. I don't think that scaling the data is necessary when coming via the DESeq2::vst()route. If the data was more 'difficult', like, e.g., proteomics data, then I would scale it. It is a question with no real concrete answer, though. It may have been asked a few times before on, e.g., StackExchange. Please re-open the issue if problems still occur!