satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.23k stars 900 forks source link

PCA looks different when plot on Seurat vs prcomp, even after adjusting for gene list input #751

Closed MeganS92 closed 5 years ago

MeganS92 commented 6 years ago

Hi there (especially shouting out to @satijalab),

First of all, thanks to the Seurat community and the builders of Seurat- it has been a useful toolkit.

I have been trying to compare the usage of Seurat to plot out PCA on my dataset generated using SMART-Seq2, versus just using prcomp.

I am very confused of the different results I am getting, even after adjusting for different things like gene input.

When using prcomp, this is my code:

> pcaRes <-  prcomp(t(log(TPM2+1)), center = T, scale. = T) 

Note: my dataset is in terms of raw TPM, no log-transformation etc, until prcomp.

When I run this on Seurat,

I first normalised my TPM matrix, based on:

> TPM2<- NormalizeData(object=TPM2, normalization.method="LogNormalize", scale.factor= 10000)

And then scale it for percent of mito:

> TPM2<- ScaleData(TPM2, vars.to.regress="percent.mito")

Then run my PCA this way:

> TPM2<- RunPCA(object = TPM2, pc.genes = LIST_TPM2$V1, do.print = TRUE, pcs.print = 1:5, genes.print = 5)

** Note that to account for the fact that my prcomp has no "variable gene detection", I therefore pulled out the same gene list used for input into prcomp, and use it as input in Seurat.

My PCA from prcomp looks like this: screen shot 2018-09-04 at 12 29 55 pm

Meanwhile, my Seurat PCA looks like this: screen shot 2018-09-04 at 12 30 09 pm

One of my guess is maybe due to the fact that my Seurat object has regressed out for percentage of mitochondria, but I don't think that should make much of a difference due to the fact that the amount of mitochondrial genes are quite consistent in all my cells that is used for this analysis.

Can someone explain to me the discrepancy in how Seurat calculates the PCA within the pacakage? I am highly confused as I do not know which data to follow.

Thanks so much for your time!

Kind regards, Megan.

benslack19 commented 5 years ago

Hi Megan,

Thanks for opening this issue as I'm trying to do something similar (comparing Seurat PCA with prcomp). If you want to make sure the mitochondrial percentage isn't a factor in why they're different, I think you could simply not include the "vars.to.regress" parameter in your ScaleData call. Other than that, I'm still trying to understand all the normalizing and scaling steps but I'll follow along with other responses.

Best, Ben

leonfodoulian commented 5 years ago

Hi @MeganS92 and @benslack19,

While waiting for an answer from @satijalab, you can maybe try the following while computing PCA via stats::prcomp().

# Use Seurat's scaled data to compare prcomp and irlba outputs 
# using the same input matrix
pcaRes <-  prcomp(t(x = object@scale.data), center = F, scale. = F) 

Also, Seurat uses irlba::irlba() to compute PCA (links to paper and package vignette). Typical output difference between prcomp and irlba with my data is that irlba results in PCs with lower cell embedding values (and a flip in the sign for some PCs) compared to prcomp.

Best, Leon

satijalab commented 5 years ago

As Leon suggested, if you regress out mitochondrial percentage, you will see a different PCA result. You can remove this and you should get a (nearly) identical result to using prcomp.

MeganS92 commented 5 years ago

Hi Leon, Ben and Team Satija,

Thanks for all of this input. They were super useful... I thought I had try re-scaling my data without mitochondrial content regression. But I think I did not do it right prior to this. Anyway, I re-ran my data without the regression and compared it using the RunPCA function in Seurat vs the prcomp code that Leon suggested, and I did get a near identical plot.

Thanks also to Leon to explain the differences you normally see... I agree, I saw the flipped sign and slightly lower cell embeddings.

I am curious over one last thing.. I compared my gene loadings (ordered from top to bottom using absolute values) pulled out from prcomp, vs IRLBA from Seurat, and saw that they were quite different. For instance, PC1 had 36 genes similar, PC2 had 29 and PC3 had none, out of a list of 50. @leonfodoulian have you compared this before, and what are your thoughts regarding this matter? If you could quickly comment on this, I would be extremely grateful.

Thank you, Megan.

leonfodoulian commented 5 years ago

Hi @MeganS92,

I haven't seen much of a difference in the gene loadings between stats::prcomp() and irlba::irlba() (only slight differences at higher decimal place values). Do you have an idea of how different they are?

n.pc <-  seq_len(length.out = ncol(x = object@dr$pca@gene.loadings)) # Get total number of PCs computed with Seurat
x <- pcaRes$rotation[, seq_along(along.with = n.pc)]
y <- object@dr$pca@gene.loadings
rel.diff <- (abs(x = x) - abs(x = y)) / abs(x = y) # Get relative differences in gene loadings
apply(X = rel.diff, MARGIN = 2, FUN = max) # Get maximum values of relative differences per PC

Best, Leon

leonfodoulian commented 5 years ago

Edited code:

n.pc <- seq_len(length.out = ncol(x = object@dr$pca@gene.loadings)) # Get total number of PCs computed with Seurat
x <- pcaRes$rotation[, n.pc]
y <- object@dr$pca@gene.loadings
rel.diff <- (abs(x = x) - abs(x = y)) / abs(x = y) # Get relative differences in gene loadings
apply(X = rel.diff, MARGIN = 2, FUN = max) # Get maximum values of relative differences per PC
MeganS92 commented 5 years ago

Thanks Leon for this. Will run this ASAP and let you know!

MeganS92 commented 5 years ago

Hi @leonfodoulian doulian,

Thanks for your tip. These were the codes I ran when trying:

First, reversing any regression.. object <- ScaleData(object, do.scale= TRUE, do.center = TRUE) Then, running prcomp as you advised.. pcaRes <- prcomp(t(x=object@scale.data), center= F, scale. = F) #no centering or scaling, since we have already done it using scaled data. Then, running PCA using the IRLBA package: object <- RunPCA(object = object, pc.genes = rownames(object@data), do.print = TRUE, pcs.print = 1:5, genes.print = 5) #using all genes to compare, since prcomp in this case took all genes as well.

Then, I ran the codes to compare PC loadings as you previously commented, however when I reached the "rel.diff" bit, it says that I have non-conformable arrays. I then explored the dimensions of both "x" and "y" and found that "x" had 16541 rows, 20 columns; "y" had 16379 rows, 20 columns; which I am still uncertain why, and am concerned as to why there are fewer genes when using RunPCA. Also, does this code only take the first 20 PCs to compare, because my understanding is that I should 1798 PCs from a matrix of 16541 genes x 1798 cells of mine.

I tried forcing them into the equation by some tips I found online, using this code: rel.diff <- (abs(x= x)- abs(x= c(y))) / abs(x= c(y)) Obviously, I got a warning output:

In (abs(x = x) - abs(x = c(y)))/abs(x = c(y)) : longer object length is not a multiple of shorter object length

But if I ignored that, and finished running the code, with: value <- apply(X= rel.diff, MARGIN=2, FUN= max)

I get this sort of output: screen shot 2018-09-11 at 10 38 00 am in which I am uncertain how to interpret... Is this saying that my biggest differences are in the first 20PCs? Also, how does this difference compare to what you have seen?

One last thing, is there a way to input the gene selection into the prcomp code you have? Usually I filter it directly from the gene expression matrix, BEFORE running prcomp... so for example: I filter my matrix for a number of genes that are expressed above 100TPM, in 10 cells: TPM2 <- TPM_SALINE[which(rowSums(TPM_SALINE >=100) >=10),] Then I run PRCOMP using this.

But since I am using the scaled matrix data within the Seurat object to run prcomp directly within Seurat, I am uncertain how to do that. I want to test this out because the differences in gene loadings are a lot more apparent when I restrict the gene list to a certain list.

Thank you so much for your time and patience in answering all my questions...

I am very grateful for this help since I think some of these questions require users who actually build/ use Seurat often to help solve.

@satijalab I don't know if this is appropriate, but I don't think this issue is fully solved yet. Getting a similar plot is just a start, but I am hoping to resolve even aspects of the loading genes pulled out using both prcomp and IRLBA. Perhaps you should re-open this issue.

Kind regards, Megan.

leonfodoulian commented 5 years ago

Hi @MeganS92,

Then, I ran the codes to compare PC loadings as you previously commented, however when I reached the "rel.diff" bit, it says that I have non-conformable arrays. I then explored the dimensions of both "x" and "y" and found that "x" had 16541 rows, 20 columns; "y" had 16379 rows, 20 columns; which I am still uncertain why, and am concerned as to why there are fewer genes when using RunPCA.

It seems that object@scale.data and object@data have different numbers of rows (i.e. genes), which I find strange since it seems that object@scale.data has more rows than object@data. Can you check the following:

dim(x = object@data)
dim(x = object@scale.data)

When running Seurat::RunPCA(), try replacing pc.genes = rownames(object@data) by pc.genes = rownames(object@scale.data). Although this shouldn't change at all if your scaled data contains all the genes in your dataset, but for some reason this seems not to be the case.

Also, does this code only take the first 20 PCs to compare, because my understanding is that I should 1798 PCs from a matrix of 16541 genes x 1798 cells of mine.

By default, Seurat::RunPCA() computes the first 20 PCs. You can compute more PCs by changing the pcs.compute argument input. For more details, please check the documentation of the function (?Seurat::RunPCA).

Best, Leon

MeganS92 commented 5 years ago

Hi Leon,

I have ran the two quick codes to look at the dimension, but that was also the first thing I looked when I saw that the differences in PCs previously. My raw.data, data and scale.data has exactly the same number of dimensions: 16541 rows and 1798 columns.

Also, I agree.. it is true that since my scaled data has the same number of genes, that shouldn't change when I run the code of RunPCA using pc.genes=rownames(object@scale.data). i just tried this- and this was the same, the RunPCA code decreases my number of genes to 16379, that is 162 genes missing, and I am unsure why. @satijalab any idea? (EDIT: I will try to find out which are these missing genes soon, perhaps that will give us a clue as to why Seurat takes them out)

Thanks so much again for your time and for trying to help me out here. I appreciate it a lot. Let me know if anything pops up to your attention...

Thanks, Megan.

leonfodoulian commented 5 years ago

Hi @MeganS92,

Maybe provide the first few lines of each matrix:

head(x = x)
head(x = y)

It is quite hard to troubleshoot without having a reproducible example, but this should give a hint. Maybe also you can share your full code as a bulk.

Best, Leon

MeganS92 commented 5 years ago

Hi @leonfodoulian,

Sorry for the late response. I was kept very busy in the lab last few days.

This is what I get for my my data, for x (PRCOMP): screen shot 2018-09-14 at 2 20 01 pm

For y (IRLBA RunPCA) screen shot 2018-09-14 at 2 20 23 pm

Sorry I don't have a reproducible example, I am not even sure how to provide that as I think it's the dataset itself that is complicated.

But what I can see even from the first 5 lines is that genes are being reordered/ excluded in the IRLBA PCA output (y).

This is my code used to reach this point:

library(Seurat)
library(dplyr)
object <- readRDS("object_all.rds")

##### SUBSETTING DATA: ONLY WANT ALL SALINE GROUPS ############
OBJECT_saline <- SubsetData(object,cells.use=rownames(object@meta.data[object@meta.data$sampleDes %in% c("D0", "D7", "D10-(saline)", "D14-(saline)", "D17-(saline)", "D21-(saline)","D28-(saline)"),]))

object_filtered<- FilterCells(object=OBJECT_saline, subset.names = c("mappedcount"), low.thresholds = c(100000), high.thresholds = c(Inf))

#### scale and center the data again, without any regression of variables... Making sure scaling and centering is done!
object_filtered <- ScaleData(object_filtered, do.scale= TRUE, do.center = TRUE)

#using seurat's scaled data to compare prcomp vs irlba output.. 
pcaRes <- prcomp(t(x=object_filtered@scale.data), center= F, scale. = F) 

object_filtered <- RunPCA(object = object_filtered, pc.genes = rownames(object_filtered@scale.data), do.print = TRUE, pcs.print = 1:5, genes.print = 5)     #using all genes to compare, since prcomp took all genes as well. 

### COMPARING FULL GENE LOADINGS AS COMMENTED BY LEON... 
n.pc <- seq_len(length.out= ncol(x=object_filtered@dr$pca@gene.loadings)) #Get the total number of PCs computed with Seurat
x <- pcaRes$rotation[, n.pc]
y <- object_filtered@dr$pca@gene.loadings

dim(x)   #16541 rows, 20 columns
dim(y)   #16379 rows, 20 columns      #TRY USING SETDIFF TO FIND THE DIFFERENT 162 GENES!
head(x=x)
head(x=y)

Thanks again so much for your time to help troubleshooting this! Appreciate it!

Kind regards, Megan.

leonfodoulian commented 5 years ago

Hi Megan,

The genes that are removed when computing Seurat::RunPCA() are genes that have 0 variance. The code that prepares the data before running PCA is the following: https://github.com/satijalab/seurat/blob/7618ac8798e2aefc509529f07af9d2836b96a1be/R/dimensional_reduction_internal.R#L44-L66

You can therefore prepare your data by selecting the genes that have passed the filtering before computing PCA using stats::prcomp().

n.pc <- seq_len(length.out = ncol(x = object@dr$pca@gene.loadings)) # Get total number of PCs computed with Seurat
pcaRes.irlab <- object@dr$pca@gene.loadings
genes.used <- rownames(x = object@dr$pca@gene.loadings)
data.used <- object@scale.data[genes.used, ] # Select genes that have passed filtering before computing PCA using prcomp
pcaRes.prcomp <-  prcomp(x = t(x = data.used), center = F, scale. = F)$rotation[, n.pc]
rel.diff <- (abs(x = pcaRes.prcomp) - abs(x = pcaRes.irlab)) / abs(x = pcaRes.irlab) # Get relative differences in gene loadings
apply(X = rel.diff, MARGIN = 2, FUN = max) # Get maximum values of relative differences per PC

I hope that this solves your issue.

Best, Leon