pseudotime distribution is uneven when run slingshot on a subset of dimensionality reduction

kstreet13 / slingshot

Functions for identifying and characterizing continuous developmental trajectories in single-cell data.

259 stars 42 forks source link

pseudotime distribution is uneven when run slingshot on a subset of dimensionality reduction #199

Closed altairwei closed 1 year ago

altairwei commented 1 year ago

The datasets from different conditions had been integrated by harmony. I focused on 4 cell populations which can form an expected lineage, so I run slingshot on a subset of harmony cell embeddings. However, the pseudotime estimated by slingshot looks a little bit weird. The distribution of pseudotime shows a sharp peak in the middle, I'm afraid that this will influence the {tradeSeq} analysis. The DEGs identified by tradeSeq::associationTest() are mainly on the beginning side of pseudotime.

unnamed-chunk-12-4

unnamed-chunk-12-6

unnamed-chunk-20-1 (1)

kstreet13 commented 1 year ago

Hi @altairwei ,

I'm not sure why the tradeSeq results seem to favor genes that are highly expressed at the beginning, that's pretty interesting. But the strange distribution of pseudotime values is a direct result of the dimensionality reduction. Diffusion maps often make "stringy" shapes like this, where cells form thin strands extending from a central core, which is how you ended up with the distributions you showed. It might be worth exploring other dimensionality reduction methods, such as PCA.

Best, Kelly

altairwei commented 1 year ago

Hi @kstreet13,

Thank you for your reply! One thing I forget to mention is that Diffusion map is only used for visualization and is not used as input to {slingshot}. The dimensionality reduction used to calculate pseudotime is the corrected PCA (produced by {harmony}). However, I just use a subset of corrected PCA to infer trajectories in the case of a very complex dataset. Is this a correct practice?

Best, Altair

kstreet13 commented 1 year ago

Oh, that's cool! That sounds very similar to how I have been doing things. I typically use fastMNN to get corrected PCs and then UMAP for visualization, but harmony and diffusion map seem like equally valid choices.

It's odd that this is happening, then. My best guess is that when you subset the cells, you end up with a lot of "extra" dimensions that no longer explain much biological variance (because the cell types they helped separate out have been removed). Something that may be worth trying would be running (regular) PCA on the corrected PCs after subsetting, just to "re-focus" your dimensionality reduction on the cell types you're actually interested in. Theoretically, I think you wouldn't lose any information, this should just be like a high-dimensional rotation of the data.

Let me know if that doesn't make sense or if I've missed something.

altairwei commented 1 year ago

@kstreet13 Sorry for the late reply. I tried your suggestion, and here is the code:

sce <- scater::runPCA(sce, ncomponents = 10, dimred = "HARMONY", name = "SubPCA")
sce <- slingshot::slingshot(sce,
    clusterLabels = sce$cluster_id, start.clus = "14",
    reducedDim = "SubPCA", approx_points = FALSE)

However, it appears that "re-focusing" had no effect on the distribution of pseudotime.

kstreet13 commented 1 year ago

Ok, well I guess that's reassuring in some sense, but not very helpful. Hm, is it possible that this is being caused by some artificial effect, like sequencing depth? I know that that can sometimes cause issues in PCA-like dimensionality reductions, and it may explain why so many of the genes go from high to low in the tradeSeq plots.

Otherwise, I don't see anything wrong with the methods you're using, so this might just be what the distribution looks like. The tradeSeq results would still be pretty strange, though.

altairwei commented 1 year ago

After experimenting with various trajectory inference and pseudotime estimation methods, I decided to return to {slingshot} because of its good interoperability with {tradeSeq}. The pseudotime distribution issue I mentioned above was resolved by excluding outliers (generally dozens of cells) manually. I noticed that {slingshot} calculates pseudotime by orthogonal projection onto curves, so I examined pseudotime values in Harmony space and discovered that a small portion of cells located far from the main cell populations were assigned extremely large or small pseudotime values. I believe that cells in the leading/tailing region of pseudotime distribution will influence the association test of {tradeSeq}. Consequently,I believe that removing these outliers would be a viable solution to the issue.