Clarification on specifying unclustered cells

apekshasingh commented 2 years ago

Hi there,

Under the clusterLabels argument it says "Either representation may optionally include a "-1" group meaning "unclustered"". For my work, I'm specifying starting and ending clusters but am leaving the remaining cells unclustered. Is my understanding correct then that these unclustered cells would not factor into the initial MST step, but would contribute to the fitting of the principal curves (that is the functionality I would like)? I'm finding that my unclustered cells are not being assigned to the constructed lineages, perhaps since they never fall below the distance thresholds? Would any changes in the default parameters address this issue?

Greatly appreciate your help!

kstreet13 commented 2 years ago

Hi @apekshasingh ,

That is a very interesting question and it's been a while since I've thought about how the unclustered cells are handled.

In general (as you've noticed), Slingshot mostly just ignores any unclustered cells, meaning they don't contribute to the MST or to the final curves. Our thinking was that this would be a way to handle "outliers" and prevent them from bending the curves in weird ways.

For your purposes, it sounds like you basically have three groups: starting cluster, ending cluster(s), and everything else. Rather than calling this last group "unclustered", I'm wondering if it would work to make it one big cluster? This might achieve the behavior you're looking for. Assuming the "everything else" cluster is more spread out than the others, it would probably be highly connected in the MST, so it seems likely that the start and end clusters would connect to it rather than each other. And then all the cells would be used for fitting curves, like you wanted.

If that doesn't work or if I'm misunderstanding something, please let me know! Kelly

apekshasingh commented 2 years ago

Thanks so much for your reply Kelly! I have also ran slingshot with the "unclustered" cells as another specified cluster and it does give reasonable results. If I'm understanding correctly, in this case the initial paths from the MST run through the center of the "unclustered" cells and then the principal curve fitting starts from there. I do wonder if the initial paths were defined only by the starting and ending cluster centers (but the remaining cells were still included in the principal curve fitting) how the results might change, if at all? Might it impact where the lineages branch/diverge?

kstreet13 commented 2 years ago

Sorry for the delay, I thought this was an interesting question and wanted to explore it a bit, since I had never really tried this approach before. Ultimately, I stand by my earlier comment recommending a single, large "everything else" cluster, but you know your data better than I do, so here's how you can achieve the behavior you wanted:

Hiding this because it's dark magic and, in general, I don't recommend it

First, here's how to pull off the "unclustered middle" approach. It requires running `getLineages` twice, once with all cells clustered (however you want) and once with the middle cells unclustered (to get the MST you want). Then you just have to replace the metadata of the first result with that of the second and run `getCurves`: ``` data("slingshotExample") rd <- slingshotExample$rd cl <- slingshotExample$cl # alternate cluster labels with missing middle clusters cl.miss <- cl cl.miss[cl %in% 2:3] <- -1 pto <- getLineages(rd, cl, start.clus = '1') pto2 <- getLineages(rd, cl.miss, start.clus = '1') metadata(pto) <- metadata(pto2) pto <- getCurves(pto) ``` This works with PseudotimeOrdering objects, but not SlingshotDataSet objects, which means it's harder to plot the results. It may break other downstream things, as well (I haven't tried it with tradeSeq, but I could imagine it causing issues, there).

Reassuringly, the two methods do seem to produce similar results, at least on a simple test case:

This may be an overly simplistic example, though. In general, it is important that the MST gets the right overall shape and I think you will almost always get a better fit with the "big cluster" approach or (better yet) a full set of cluster labels.

apekshasingh commented 2 years ago

Thanks so much for your very detailed reply and investigation, I really appreciate it! In my example, the two methods also gave qualitatively very similar results. Thanks again!

apekshasingh commented 2 years ago

I had one quick follow-up question that I decided would be best to include in this thread since I've already shared a bit about the structure of my data above. I actually have this data for both a healthy and diseased state where I have the analogous starting/ending clusters defining my lineages. I'm interested in comparisons between these two datasets. If I apply the same dimensionality reduction to both datasets and then perform the pseudotime reconstruction on each dataset independently, can I compare pseudotime values for each lineage across the datasets? I guess alternatives would be to perform the pseudotime reconstruction on the datasets combined or perform it on the healthy dataset and then project the diseased data onto it, however I think I would prefer to do the reconstruction independently to better capture any differences. Thanks again so much for all of your help!

kstreet13 commented 2 years ago

This is a great question and one we've thought about quite a bit. To start, here are the materials for the last workshop we did exploring these questions, including one example (TGF-beta) where we did the trajectory inference on both conditions at once and another (KRAS) where we did it separately for each condition.

To answer your first question: no, you can't really compare pseudotime values that come from different dimensionality reductions. The units are only meaningful in that context, so they can't be directly compared. You could try finding some sort of mapping, such as scaling the pseudotimes so that they start at 0 and end at 1, and then comparing on that scale, but that is a strong assumption and it could end up masking differences between lineages from the same condition.

In general, I think it is best to run Slingshot on both conditions at once, in a shared dimensionality reduction. This way, both conditions are on the same trajectory/in the same space, so pseudotime values are directly comparable. It also allows you to build a "complete" trajectory, even if one condition is missing some element(s). For example, if your healthy state has 2 diverging lineages, but your diseased state only has 1, it would be hard to justify a manual mapping between these two trajectory structures. But a single trajectory structure with both lineages would show pretty clearly that one is made up exclusively of cells from the healthy condition.

apekshasingh commented 2 years ago

Thanks so much for your quick reply! Just to clarify, if the same dimensionality reduction is applied to the datasets and then the pseudotime is inferred independently, the values are still not comparable? Thanks again so much!

kstreet13 commented 2 years ago

That is correct. For example, applying PCA to two different datasets will yield two different versions of PC-1 that are not directly comparable.

apekshasingh commented 2 years ago

Thanks for your reply! Sorry I wasn't clear, what I was trying to ask is if the same transformation is applied to both datasets (so for example PCA on the combined dataset or applying the PCA loadings from one dataset to project the other) and then the pseudotime construction is done independently, are the pseudotime values still not comparable?

kstreet13 commented 2 years ago

Ah, I see. In that case, the units would probably be comparable, but you would still need some sort of mapping between the different pseudotime variables in order to do any statistical comparisons. Otherwise, minor differences in the trajectory can appear like distributional differences between conditions. As long as the pseudotimes are on different axes, it's very difficult to compare them.

apekshasingh commented 2 years ago

Thanks for your responses, really appreciate it!

kstreet13 / slingshot

Clarification on specifying unclustered cells #183