kstreet13 / slingshot

Functions for identifying and characterizing continuous developmental trajectories in single-cell data.
259 stars 42 forks source link

trajectory bootstrapping and uncertainty #192

Closed gdagstn closed 1 year ago

gdagstn commented 2 years ago

Hi, first of all thank you for this really beautiful package.

Apologies if this is not a bug; if necessary I will repost elsewhere.

I was wondering if you had any input or advice on bootstrapping trajectory estimation for large/noisy datasets.

I am working on the integration of two large fetal hematopoiesis datasets in which, to speed things up, I have uniformly sampled 10% of each cell label and run slingshot on it, for about 100 iterations in parallel.

I understand that slingshot already performs several steps for averaging and shrinkage, and incorporates an uncertainty measure when it comes to assigning cell weights. However, for such a large dataset, I find it easier to parallelize smaller instances of slingshot and use some aggregate statistic; moreover my naive intuition is that the bootstrapping will give me different degrees of instability at more or less well-defined parts of clusters that I can use for downstream inferences.

Here's an example with a quite stable trajectory:

Here's a less stable one on the same dataset:

It is quite easy to get a "median trajectory" (red line in the plots): take the s matrix from the curve of each iteration and calculate median points across PCs. Similarly, it is easy to calculate median absolute deviations for every point. Once the median trajectory has been obtained, princurve::project_to_curve() can be used to estimate the position of the lineage from the whole dataset to the median curve, and the new lambda values are the pseudotime for all cells in the lineage.

What is less clear to me is how I could make use of the uncertainty (given by the mean absolute deviation of bootstrapped trajectories at every point of the curve) to give some sort of weight to the following steps, i.e. differential gene expression analysis along trajectories. In fact, there is no guarantee that cell weights (as calculated by slingshot) will be present for all cells when using this bootstrapping approach - although this could be enforced at the sampling stage - and even if they were, they may change by iteration.

I was thinking one could instead use the sum of deviations (per point, across dimensions) * distance from the median curve as a cell weights vector in a model (such as the ones used by tradeSeq), but I don't know if this is legal, or if this makes sense. I know these weights would not be the same as the other ones, but could they still add useful information?

Is this something that makes sense, or am I completely off track?

Thanks again!

kstreet13 commented 2 years ago

Hi @gdagstn ,

Thank you for the interesting and thoughtful questions! This is something we've thought about quite a lot and have been revisiting recently, so I have several thoughts. In general, I think this is a promising approach, though!

First, I'm curious if you could expand on your experiences with Slingshot's run time? With the update that added the approx_points argument, I was under the impression that the run time scaled linearly with the number of cells. So I would think your bootstrap approach would only be faster than basic Slingshot on the full dataset if you used >10 cores.

Even more crucially, how did you map the trajectories onto each other between iterations? With the cluster-based subsampling, I understand that you are guaranteed to have representation of all clusters, but there is no guarantee that Slingshot will produce the same MST. Did you somehow force the MST to be the same each time?

The median curve you described is very interesting and I'm curious about some of the exact parameter choices. For instance, stretch and extend both affect total curve length, so setting them to 0 and 'n' (respectively) might make the curves more comparable between iterations. Otherwise, two curves might be quite similar up until one of them ends and the other extends further, meaning that the best mapping between them is not a simple 1-to-1.

And I very much like the idea of creating some sort of weights to be used by tradeSeq (as we typically do with the slingCurveWeights). However, I'm not sure what you mean by "sum of deviations (per point, across dimensions)". Isn't each point fixed and it's only the curves that have deviations? Similarly, I don't know if "distance from the median curve" is an appropriate metric here. Just because a point is far from the curve, that doesn't necessarily mean we are uncertain about its pseudotime or which lineage it belongs to.

Sorry for having more questions than answers, but I really do appreciate all the thought that went into this and would be happy to continue discussing it!

Best, Kelly

gdagstn commented 2 years ago

Hi Kelly,

Thank you for the quick and detailed response. I will try to address everything.

1) runtime: I am running slingshot using 40 cores so the parallelization does help. I am working with ~200K cells at the moment, but possibly more.

2) MST: this is a very good point; I naively thought that 10% sampling would still keep centroids in a not so different position, but like you said there is no guarantee and I should at the very least check that this is the case. In my case I am not interested in all lineages, and the ones I do care about seem to be corresponding across iterations, but to make this a general approach the mapping should always be 1:1 across all iterations. I will modify my code accordingly.

3) parameters: for this run I kept the defaults, but I do take your point and will try with these two values.

4) deviations: let me elaborate further!

For every lineage at every sampling+slingshot iteration we will have 150 points (s) and however many cell weights (w) for that lineage. My understanding is that w gives you the uncertainty of lineage assignment, and thus it is very iteration-specific. After N iterations I can hope to have at least 1 weight per cell in a lineage, although sometimes more.

The point of bootstrapping then is not so much to get a distribution of weights per lineage per cell (whose size can vary), but rather a distribution of curve positions (of total size M = N * 150). This means that at every one of these 150 points we have N positions from which we take the median, and median absolute deviation as a measure of uncertainty of these positions.

Once we have the median curve, we project all the points of dataset that are in the lineage of interest to the median curve, getting a "median pseudotime" ordering.

So at the end while we no longer have the original w weights - which tell us about the uncertainty of lineage assignment, I was wondering if we could introduce a different measure thanks to bootstrap, i.e. the uncertainty of pseudotime ordering of a cell. This was why I was thinking about the distance of every cell to the closest point on a curve, multiplied by the sum of deviations across dimensions; the deviations belong to one of 150 curve points, while distances are cell-specific. The point about the distance not being necessarily a good indicator of uncertainty is well taken though, so it may make sense to just weigh each cell by the deviation of the nearest point. Or not to do this at all!

Thanks for the discussion! Cheers, Giuseppe

kstreet13 commented 2 years ago

Hi @gdagstn,

Thanks for the explanation, that makes more sense now!

Regarding the cell weights, are you only considering the non-zero weights? As it is, every cell will have a weight for every lineage, it's just that those weights can often be zero. I'm not sure why you would need a non-zero value for each cell, though, if you ultimately only use the cells "that are in the lineage of interest", so I may still be missing something.

That said, if you're interested in a median pseudotime for each cell and the deviations around that, I think there might be another approach worth considering. The predict method in the slingshot package is basically a wrapper around princurve::project_to_curve, which you could use to get a pseudotime for every cell (not just the 10% subset) in every bootstrap iteration. It's a little more computationally intensive, but certainly not as bad as running slingshot on the full dataset, and it might be the most straightforward way to get a "median pseudotime" per cell.

Perhaps that could be used as way of quantifying pseudotime uncertainty and combined (multiplied?) with a measure of lineage uncertainty, that might make an appropriate vector of weights for tradeSeq? I need to think about the assumptions there a little bit more, but it sounds plausible.

Best, Kelly