lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.41k stars 805 forks source link

Trajecoty can't be reconstructed because of distortions introduced by UMAP #494

Open Sophia409 opened 4 years ago

Sophia409 commented 4 years ago

Hi,

I have a developing mouse brain dataset of 20,000 cells.When I performed downstream analyses with a different number of PCs (Fig1), the UMAP results differ dramatically. image

As you see from Fig1, if I choose 3-4 PCs, two populations of IPC (IPC1 and IPC2) flock together and connect with RGC. The best result is 5 PCs, IPC2 and IPC1 are separated and close to RGC, which is in line with our expectation. Since UMAP better resolves the global and continuous structure of the differentiation manifold, it is used for visualizing the developmental trajectories of cells. In this UMAP representation,we can see a trajectory of great biological significance :RGC>IPC1 and IPC2>Neuron.

But in Seurat tutorial, it referred that performing downstream analyses with only 5 PCs does signifcanltly and adversely affect results. It's obvious that 5 PCs is not enough for explaining the variance of 20,000 cells. Both JackStrawPlot and ElbowPlot also showed that 40-50 PCs may be an appropriate choice for our dataset. Figure2

However, if I choose more than 5 PCs, IPC2 somehow jumps out and keeps away from IPC1 and RGC. This really puzzled me, because a trajectory can not be RGC>IPC1>Neuron>IPC2 as it is shown in Fig3. And one reviewer for our paper also raised this question and doubted that the mapping of the IPC2 cluster is somewhat flawed. But I indeed followed the guided tutorial and repeated this procedure many times, only to get similar result. If the IPC2 cluster can't link to RGC cell in umap, how can Monocle3 recognize the lineage relationship between them and reconstruct the trajectory? Figure3

I really don't know how to explain this. The only explanation I can think of is distortions introduced by UMAP. See this paper for the extent non-linear dimension reduction methods distort the data.

Do you have any advice on my analysis or reply to reviewers? I will be much appreciated if you can hep me with it.

Sophia

lmcinnes commented 4 years ago

Unfortunately I have little expertise in the relevant biology, but I would agree with the observation that 5 PCs looks to be too few based on the elbow plot. As to what is going astray -- I really can't say for sure. One possibility is that there is some level of connectivity/relationship, but the n_neighbors value is too low to see it. You could try increasing n_neighbors and see if that helps. You could also try connectivity plots (https://umap-learn.readthedocs.io/en/latest/plotting.html#plotting-connectivity) as a diagnostic to see if there is any apparent connectivity in the UMAP graph. Sorry I can't be more help.