satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.27k stars 910 forks source link

Why low numbers of PC get better result than high PCs?What should I choose? #3531

Closed Sophia409 closed 4 years ago

Sophia409 commented 4 years ago

Hi, Seurat team

I have a developing mouse brain dataset of 20,000 cells.When I performed downstream analyses with a different number of PCs (Fig1), the results differ dramatically. image

As you see from Fig1, if I choose 3-4 PCs, two populations of IPC (IPC1 and IPC2) flock together and connect with RGC. The best result is 5 PCs, IPC2 and IPC1 are separated and close to RGC, which is in line with our expectation. Since UMAP better resolves the global and continuous structure of the differentiation manifold, it is used for visualizing the developmental trajectories of cells. In this UMAP representation,we can see a trajectory of great biological significance :RGC>IPC1 and IPC2>Neuron.

But in your tutorial, you referred that performing downstream analyses with only 5 PCs does signifcanltly and adversely affect results. It's obvious that 5 PCs is not enough for explaining the variance of 20,000 cells. Both JackStrawPlot and ElbowPlot also showed that 40-50 PCs may be an appropriate choice for our dataset. Figure2

However, if I choose more than 5 PCs, IPC2 somehow jumps out and keeps away from IPC1 and RGC. This really puzzled me, because a trajectory can not be RGC>IPC1>Neuron>IPC2 as it is shown in Fig3. And one reviewer for our paper also raised this question and doubted that the mapping of the IPC2 cluster is somewhat flawed. But I indeed followed the guided tutorial and repeated this procedure many times, only to get similar result. Though I adjusted the parameter min.dist(0.5) and n.neighbors(50) to the maximum in order to get more global structure, the IPC2 cluster still can't link to RGC cell.

I really don't know how to explain this. The only explanation I can think of is distortions introduced by UMAP. See this paper for the extent non-linear dimension reduction methods distort the data.

So how many PCs should I consider for my analyse? 5PCs seems make biological sense but not enough for explaining the variance, while more PCs drop IPC2 in the wrong place. Do you have any advice on my analysis or reply to reviewers? I will be much appreciated if you can hep me with it.

Sophia

dlmatera commented 4 years ago

Not a member of the Seurat team so I'm sure they can give a better answer - but the techniques used for clustering and normalization have a variety of assumptions and obviously will carry forward to the final result. You may want to try SCTransform to see if that helps at all (it should lead to a more accurate normalization with less technical noise). If there is a lot of technical noise in your dataset, that could explain why your clusters make little biological sense (you are overfitting and capturing noise). The elbowplot at least to me implies that most of the big jumps in SDEV reduction stop after ~PC6

Assuming your biological hypothesis is the correct one (which i cant comment on because I'm not familiar with neural biology) - an alternative option is that the differentiation program you are expecting is based on a small number of genes, and therefore the clustering algorithm is being bias by the other ~20k genes that are changing simultaneously. You may be able to fiddle with what genes (Variable features) are fed into the clustering, but then you are of course inserting bias into the result.

You also can making a subset of the data only the cell types of interest (i see 5 clusters but you only mention 4 populations of interest) and recluster, this may give you something more in line with what you are expecting

Sophia409 commented 4 years ago

@timoast Hi ,since the problem still didn't get solved, could you reopen this question again? Much Thanks.