kstreet13 / slingshot

Functions for identifying and characterizing continuous developmental trajectories in single-cell data.
269 stars 43 forks source link

Comparison of two trajectories? #41

Closed jeremymsimon closed 5 years ago

jeremymsimon commented 5 years ago

I have separate trajectories built for two conditions, and the paths are visually different. I'm curious if you have recommendations on how to make quantitative claims about how the two trajectories differ. Neither are a clean line/curve though- there are some branches. We'd be interested in extracting the proportion of cells falling along a certain path and showing how those differ, and perhaps extracting full distance matrices or something like that so we can look at correlations. Do you have any experience or a vignette on how to do something like this?

kstreet13 commented 5 years ago

Hi @jeremymsimon

This is a very interesting question and although I don't currently have one, I'm beginning to think I should put together a vignette to summarize my thoughts on the topic. I've performed this sort of analysis before by combining the two conditions and only producing one set of trajectories (which we could do because the two conditions shared a common starting cell type). This way, we could directly compare, e.g. "Condition A along Lineage 1" to "Condition B along Lineage 1" and know that these things were comparable. This could allow you to detect certain topological differences (if Lineage 1 ends in a cell fate that cannot be reached by cells in Condition A, then we would see that all cells after a certain pseudotime were from Condition B). And if you wanted to assign a p-value to this, you could try something like a rank sum test (comparing the ranks of the Lineage 1 pseudotime values in Condition A vs. Condition B) or logistic regression (where the predictor is pseudotime and the outcome is a binary indicator of condition).

If that paradigm doesn't work for your data, you could also estimate the percentages of cells that fall along each lineage for a given dataset. The slingCurveWeights matrix provides values for weighting cells in downstream analysis of individual lineages, but if you divide this by the rowSums, you could interpret it as probabilistic assignment of cells to lineages and get an overall percentage of cells assigned to each lineage. However, this assumes that you can directly map the lineages in one condition to the lineages in another, in order to compare their percentages.

Let me know if that doesn't make sense or if you have any other thoughts!

jeremymsimon commented 5 years ago

Thanks for the reply, this is a helpful place to start. Can you clarify how we would quantify "Condition A/B along Lineage 1" using the data stored in these various objects? It seems pretty complex. I think in our case, we likely can combine both conditions and call just one trajectory, but it will only make sense if we can do some comparisons of our conditions from there.

Thanks!

kstreet13 commented 5 years ago

Of course, sorry that was a bit vague. The idea is that you would use all the cells to construct the lineages and then compare the conditions based on their distributions of pseudotime values. For the example I described above, I'm picturing something like this, where the two conditions share a starting point and have one similar lineage and another that is only seen in Condition B:

image

image

Then for each lineage, you could do some sort of test for independence between pseudotime and condition (I think a rank sum test makes sense here, but it isn't perfect because it doesn't incorporate the weights assigned by Slingshot).

# test lineage 1
wilcox.test(slingPseudotime(sds)[condition=='A',1], slingPseudotime(sds)[condition=='B',1])
# test lineage 2
wilcox.test(slingPseudotime(sds)[condition=='A',2], slingPseudotime(sds)[condition=='B',2])

There are definitely other ways you could approach this last part, but I think this makes sense if your goal is to detect large, topological differences between the conditions.

jeremymsimon commented 5 years ago

Hmm so maybe this is a feature unique to our dataset, but our trajectories seemingly make this tricky. Essentially we have 3 trajectories, and the first 4-5 points on the curves are shared among all 3 (ie it's a late fork). The one we care about most (traj2) has just one unique-ish offshoot as the endpoint. By making a boxplot of the 3 like you did above, the values in slingPseudotime(x) don't distinguish our two conditions, and actually, show the opposite trend. We also can't just find cells unique to traj2 since that'll just give us cells from that offshoot cluster (which we already know), because the rest of the cells will be shared with traj1 and 3.

Do you have any other suggestions on how we might see this? Happy to share some plots if any of this is unclear

kstreet13 commented 5 years ago

Yeah, I'm not sure if I understand your situation, sorry. Would you mind sharing some plots?

jeremymsimon commented 5 years ago

My apologies. Here is our merged trajectory. So there are 2 conditions here, and the trajectory as you can see starts on the bottom left, proceeding through the arc towards the right. Each node here is a different cluster we identified, but they fall into 3 broader cell types (blue/cyan = progenitors, green/yellow = differentiated cells, and red = possible intermediate). What I was attempting to explain above is that the red cluster itself has a significantly different abundance between our conditions, and as you can see, the path to get to the red cluster is the mostly held in common with the other 2 trajectories.

2019-07-29

Unless we're doing something wrong, we can quantify the trajectories like you described above using boxplots:

2019-07-29-2

however you can see that condition 2 (the one with more red cells) actually goes down in the 2nd trajectory (the one including the red cluster) compared to condition 1. We expected the opposite trend since again there are more cells along the trajectory (mainly in the red cluster) in condition 2.

Does that make sense?

kstreet13 commented 5 years ago

Yes, thanks that was very helpful!

The first thing that comes to mind is that, judging by the plot of the top two PCs, it's not clear that there even is a branching point (if it weren't for the colors, it would look like a single lineage). Is the distinction between the various endpoints more obvious in higher dimensions? And relatedly, how many dimensions are you using as input to Slingshot? It looks like the red lineage has a very sharp angle, which can sometimes lead to weird results (specifically, it may be bending back on itself and picking up some of the blue cells as "late" rather than "intermediate"). A similar plot of the smooth curves could potentially help to diagnose this.

jeremymsimon commented 5 years ago

Oooookay well, it turns out we were incorrectly using the $rotation rather than $x output of prcomp so this whole thing is different now- we only get one trajectory instead of 3 and everything looks pretty clean.

Screen Shot 2019-07-30 at 3 32 00 PM

So first off I'll thank you for your help and apologize for now pivoting to the following related question: aside from looking just at the proportions of cells in each of these clusters between condition 1 vs condition 2, do the weights or any other information from Slingshot's output tell us how the differentiation trajectory may be different across conditions? Again, we know that the blue/cyan clusters are fewer in number in condition 2, whereas the red cluster has a significantly higher proportion in condition 2. So if there's something here that tells us that condition 2 are "more differentiated" than condition 1, that would be helpful.

We're also looking at temporally-regulated genes and how those may differ but I see that as a somewhat separate question.

Thanks again!

kstreet13 commented 5 years ago

Ok, no worries! I think the same kinds of ideas apply as from our conversation above, just the differences will be less obvious. The weights are probably no longer relevant, since there is only one lineage, so every cell should have a weight of one along it.

Basically, it sounds like we want to test for a relationship between the binary condition variable and the continuous pseudotime variable. Or alternatively, a difference between the distributions of pseudotime values between the two conditions. Taking this second approach (and apologies for revising my answer from above), something like a Kolmogorov-Smirnov Test might be most appropriate. This should theoretically allow you to detect any difference in the distributions of pseudotimes, not just mean/median shifts. Implementation would be similar to the Wilcox test above, but with the ks.test function.