kstreet13 / slingshot

Functions for identifying and characterizing continuous developmental trajectories in single-cell data.
265 stars 43 forks source link

Clustering discrepancy with Slingshot and Seurat #202

Closed SirKuikka closed 1 year ago

SirKuikka commented 2 years ago

Hi!

I have an issue related to Slingshot. I'm using Seurat's graph-based clustering and Seurat's PCA matrix as input for Slingshot to infer a trajectory. Based on previous discussions on the issue channel, this seems like a common approach that people take. However, when I visualize the clustering over a UMAP visualization of Seurat before and after trajectory inference, I get sometimes very dissimilar clustering results. The clustering which is obtained after trajector inference is generated using the dynplot R package ("Cells are coloured according their position in the trajectory"). I have attached an example that shows the two clustering results (Seurat and dynplot milestones).

You can see that the cluster 9 (purple in the above visualization and cyan in the bottom visualization) has somehow spread to the cluster 4 (green-bluish in the above visualization) of the original clustering.

Do you have any suggestions on what can be done to mitigate this issue? Shouldn't the visualization and the trajectory be generally compatible? If that's the case, would it be better to use the UMAP directly for Slingshot? But then the clustering would be based on PCA and the trajectory on UMAP. Would that be problematic?

image

image

SirKuikka commented 2 years ago

It can be seen better from these visualizations. Seurat vs Slignshot milestones

image

image

kstreet13 commented 2 years ago

Hi @SirKuikka,

I don't think I fully understand what you've done here, but it doesn't look like you used Slingshot at all. You said that your "trajector inference is generated using the dynplot R package", which would explain why the curves shown in the first two figures don't look like Slingshot output. Also, Slingshot uses clusters as input, but it doesn't generate new cluster labels, so I think you must be using a different method.

To answer your more specific questions, Seurat's PCA coordinates should work as an input to Slingshot. Their graph-based clustering results can sometimes cause issues, due to one or two weird clusters that appear in multiple places on a UMAP/tSNE visualization (in your fourth plot, clusters 5 and 9 might be a little tricky, but these are not the worst examples I've seen). And if you're interested in a plot where cells are "coloured according their position in the trajectory", I would recommend coloring by the pseudotime values, rather than clusters.

Hope this helps and let me know if I've missed something! Best, Kelly

SirKuikka commented 2 years ago

Hi @kstreet13,

Sorry that one of the sentences was unclear. I used dynverse (https://github.com/dynverse/dynmethods) to run Slingshot, which also enables to visualize and perform DE analysis on the trajectory. The cluster labels can be obtained from the trajectory based on the position in the trajectory.

This is from the documentation of dynplot::plot_dimred.

"Cells are coloured according their position in the trajectory. The positioning of the cells are determined by parameter milestone_percentages or else by trajectory$milestone_percentages."

It feels quite logical to me that you can determine the cluster labels from the trajectory based on the positions in the trajectory.

But I guess these questions are more related to dynplot and dynverse than Slingshot, in which case I can understand that you can't help much.

Thanks for the comments on the graph-based clustering. I guess using a different algorithm (or different number of clusters) then or different data matrix as input (e.g. UMAP) might work better.

SirKuikka commented 2 years ago

But maybe one more question. Do you see any problems with using UMAP as input to Slingshot?

kstreet13 commented 2 years ago

Yeah, sorry I can't be more help, but it sounds like most of the discrepancies you're seeing are introduced by dynverse and not Slingshot. Personally, I don't see the utility of clustering after trajectory inference; that sounds like taking something continuous (pseudotime) and artificially separating it into categorical groups.

As for your second question, I would recommend PCA over UMAP. Like tSNE, UMAP is non-deterministic, so analyses built on top of it will be harder to reproduce. That said, UMAP is a nice visualization tool, which is why we provided the embedCurves function for embedding Slingshot results in a different dimensionality reduction.