kstreet13 / slingshot

Functions for identifying and characterizing continuous developmental trajectories in single-cell data.
270 stars 43 forks source link

MNN-based MST? #96

Closed LTLA closed 3 years ago

LTLA commented 4 years ago

I recently added a MNN-based MST to TSCAN and I wonder whether the code would be of interest here.

http://bioconductor.org/books/release/OSCA/trajectory-analysis.html#tweaking-the-mst http://bioconductor.org/books/release/OSCA/merged-hsc.html#combined-analyses

The basic idea is that the MST is constructed using distances between MNN pairs between clusters rather than between cluster centroids. This allows the MST to focus on the closeness of the boundaries, especially when clusters are rather elongated.

kstreet13 commented 4 years ago

Yeah, that would great! I've definitely seen a few cases where that would have helped.

Right now, the distance metric used for calculating the MST is actually an argument (dist.fun), so it should be possible to write a function (in the style of .dist_clusters_full) that returns an MNN-based distance, instead.

LTLA commented 4 years ago

Perhaps the big question here is whether it is worth creating some lightweight package that will do the MST calculation for both TSCAN and slingshot; I believe that we are currently computing exactly the same thing, i.e., MST from centroids (I also ~stole~ was inspired by your OMEGA idea for what I call the outgroup= argument). This would give us the ability to share improvements and new ideas in a single implementation without requiring the higher-level packages to depend on each other.

That said, creating a new package just for that is a bit of a pain, and I wonder whether there are any other shared concepts or data structures we could add to the package. For example, I already ~stole~ re-used your idea of multiple paths as columns of the pseudotime matrix. Perhaps if the slingshot object (can't remember its name) was generalized, it could serve as a more general container for "trajectory analysis results".

kstreet13 commented 4 years ago

Yeah, I definitely think it would help to have some of these common things in a smaller package and I very much like the idea of a more general data structure (it's sometimes nice to look at trajectory results without requiring a full SingleCellExperiment object). Maybe something like a SummarizedExperiment based around the pseudotime matrix? Assignment weights (this might be Slingshot-specific) could be a secondary assay, cluster labels are rowData, and additional lineage information is colData?

LTLA commented 4 years ago

Yeah, guess we should start thinking of a name for the package. Maybe something boring like TrajectoryUtils?

Agree with the class structure, though we should make sure that it isn't presented to the user as an SE derivative, as the sample/feature interpretation is flipped around. We could probably use an SE internally and add wrappers on top.

LTLA commented 4 years ago

It has begun: https://github.com/LTLA/TrajectoryUtils

LTLA commented 4 years ago

I'm out of ideas for general utilities, so any ideas on how to make createClusterMST() less lonely would be welcome.

kstreet13 commented 4 years ago

As soon as I have time, I will add the distance metrics that Slingshot uses (both are just scaled versions of distance between cluster centroids).

As for additional functionality, would it make sense to include some basic plotting abilities? Maybe something like this? It would be nice to have a lightweight way of looking at particular genes, without fitting a model to every gene (as tradeSeq would do).

LTLA commented 4 years ago

would it make sense to include some basic plotting abilities?

I'm glad you brought that up, because I was thinking along the same lines. Though in my mind, the plotting functionality would live in a separate package building on top of TrajectoryUtils; it might even serve as place for all tradeSeq's plotting functions, possibly alleviating the dependency problems that @koenvandenberge and I discussed some time ago.

Certainly the relevant chapter of the OSCA book would benefit from easier ways to make certain plots. I wouldn't want the plotting functions to live in TrajectoryUtils itself, though, as this would bloat the dependencies for all downstream packages.

kstreet13 commented 3 years ago

Yeah, I really like this idea, especially if it helps reduce the number of dependencies for tradeSeq.

Another thought I had for TrajectoryUtils: what if we had some way to incorporate RNA velocity information? It could definitely help with guessMSTRoots and maybe with MST edge confidence. I was thinking if we have a velocity estimate for each cell, we could try to project those onto the MST and get an estimate of each edge's direction.

LTLA commented 3 years ago

I'd be open to that, if there's a relatively general way we could do it, e.g., that allows for both velocity-derived pseudotime or real time (e.g., from a time course). The trick is that now each tree might have multiple roots, e.g., in the case of convergence of distinct progenitors to the same cell type. Downstream tools would need to be able to deal with that; basically the paths through the MST would be defined as those from all local minima in time to the neighboring local maxima in time.

Guess there could well be another function to define paths from the MST. In the absence of external timing information, this would be just what we do now, based on the guestimated roots. In the presence of timings, we would identify local max/min in the tree and create paths to go to/from them. That sounds generally useful and avoids tying us explicitly to RNA velocity.

If you want to coordinate with @koenvandenberge on the plotting, I can do what I mentioned above.

koenvandenberge commented 3 years ago

These are great suggestions! I'll be happy to migrate tradeSeq plotting functions to a common package that has all the plotting functionalities relevant for the trajectory inference and downstream analysis. Let me know which package I should contribute to. Also adding @HectorRDB to the thread.

LTLA commented 3 years ago

That package doesn't exist yet, so the hardest task still awaits you... that's right, choosing a good name. I would have chosen traffic (trajectory functions) for this package, but lower-level infrastructure packages should probably have more sober/boring names.

koenvandenberge commented 3 years ago

We decided we couldn't come up with anything better for a package name, so here goes: https://github.com/koenvandenberge/traffic

I added @LTLA @kstreet13 and @HectorRDB as collaborators, let me know if anyone else should be added.

For now I'm adding the visualization functions from tradeSeq. I wonder if we should also be adding the functions that support input from monocle, too.

LTLA commented 3 years ago

@kstreet13 I can't remember where we talked about this, but you can now pass a matrix of soft assignment weights in clusters.

@koenvandenberge sounds good. I don't have any plotting functions to contribute but if you can make some nicer/easier alternatives to the function calls in http://bioconductor.org/books/release/OSCA/trajectory-analysis.html, I'm open to swapping them out.

LTLA commented 3 years ago

@kstreet13 are we go, no-go on the current functionality for TrajectoryUtils? If so, I'll get the ball rolling on BioC submission.

kstreet13 commented 3 years ago

Depends on what you think about using velocity estimates to infer MST paths. I think it could be useful for guessing the root node and defining paths, but it does assume that the user has done some additional work.

kstreet13 commented 3 years ago

I found this vignette, so I was thinking something that takes the output of embedVelocity and projects it onto the MST. Then each branch can be assigned a best-guess directionality.

LTLA commented 3 years ago

We do have a defineMSTPaths() that takes a times=; this can use the velocity_pseudotime to determine direction. It'll avoid having to do any explicit projection of our own.

kstreet13 commented 3 years ago

Ok then yeah, I think it's good to go. I saw that comment in the documentation and wasn't sure about using pseudotime to predict pseudotime, but I'm also not super familiar with RNA velocity methods.

kstreet13 commented 3 years ago

Actually, there's one other thing that I've found useful for slingshot that we might want to include in createClusterMST, which is the ability to specify known endpoint clusters. This imposes a constraint on the MST that those clusters must have degree 1 (which is the same as constructing the MST on all other clusters and then connecting the specified cluster to its nearest neighboring cluster).

LTLA commented 3 years ago

Done. This was implemented by just increasing the distances to all non-closest points, which plays nice with the downstream confidence calculations and provides consistent behavior with outgroup=TRUE, i.e., a selected endpoint can also be sliced off into its own component of the graph rather than being forced to connect with another node.

On that note, I'm trying to figure out why on earth I divide omega by 2. Seems like a bug.

kstreet13 commented 3 years ago

Ok, that makes sense! I think there may be a slight issue, which is that the "closest" cluster might also be a pre-specified endpoint, so we probably want that to be the "closest non-endpoint cluster," in order to avoid a weird diad or something.

Also, while I was exploring this possibility, I found some differences between getLineages and createClusterMST that don't make sense to me. Theoretically, the distances found by createClusterMST should always be the square root of those found by getLineages, meaning they should always produce the same MST, but that doesn't seem to be the case. Here's the example I've been using, and the results:

library(TrajectoryUtils)
library(slingshot)
data("slingshotExample")
rd <- slingshotExample$rd
cl <- slingshotExample$cl
set.seed(1)
clus6 <- cbind(rnorm(25, 10),rnorm(25, 1.5))
clus7 <- cbind(rnorm(25, 10),rnorm(25, -1.5))
rd <- rbind(rd, clus6, clus7)
cl <- c(cl, rep(6:7,each=25))

sds <- getLineages(rd, cl)
g <- createClusterMST(rd, clusters = cl, dist.method = 'slingshot')

image

Things get even weirder when you add supervision (I was expecting to see the weird scenario described above, but instead got something even weirder?):

sds <- getLineages(rd, cl, end.clus = 6:7)
g <- createClusterMST(rd, clusters = cl, dist.method = 'slingshot', endpoints = 6:7)

image

I'll keep looking into it, but wanted to check if you knew where these differences might be coming from.

LTLA commented 3 years ago

Re endpoints: I ended up implementing an exhaustive recursive algorithm that looks through all possible ways that endpoint nodes can be connected to non-endpoint nodes, and picks the configuration that minimizes the extra distance. The greedy approach that we were using earlier may not yield the optimal configuration in terms of the total distance. It also has the finesse to allow dyads but only when outgroup=TRUE (and thus multiple components are acceptable); otherwise it forces everyone to be together.

Unfortunately, the recursive logic is pretty complex; I've been picking out bugs for hours. Keep an eye on the results when you use endpoints=. You can also review the code itself - it's in the .enforce_endpoints() function in createClusterMST.R.

Anyway. The first discrepancy is because it wasn't using the full covariance matrices; this should be fixed.

The second discrepancy - I'm not sure how this happened, but it goes away with the new algorithm, which is good, I suppose.

kstreet13 commented 3 years ago

The greedy algorithm does produce the optimal tree, at least in all cases where outgroup=FALSE (I couldn't leave that statement unjustified, so I found a proof). And I think that's good enough, so long as we're clear about what the outgroup argument does (I'm imagining a potentially obscure case where someone specifies cluster k as an endpoint, but the other cluster it gets attached to is the OMEGA cluster, so it ends up as a singleton rather than a leaf).

kstreet13 commented 3 years ago

Oh, ok I see the point you were making about dyad subtrees, now. Yes, when two endpoint clusters are far enough away from everything else to get split off, but close enough to each other to be connected, then I think it makes sense to have them form a two-cluster tree. So we do need a little extra logic in there (like picking the right connections if this happens with 3+ clusters).

kstreet13 commented 3 years ago

Yeah, sorry for being slow, but yes, I think this behavior makes sense and is an improvement over getLineages (moving clusters 6 and 7 a little further out to the right and adding an outgroup): image

LTLA commented 3 years ago

Yeah, I should have clarified that the exhaustive algorithm was intended to handle the endpoint-endpoint dyad case. Otherwise the greedy algorithm would have been more than satisfactory. And faster, too.

kstreet13 commented 3 years ago

Ok, so following up on that, I think there may be cases where the dyad is prioritized over more plausible graphs, but it has to do with the interaction between outgroup and endpoints (without cluster 5, both methods produce the same graph):

library(TrajectoryUtils)
library(slingshot)
set.seed(1)
clus5 <- cbind(rnorm(25, 0), rnorm(25, 0))
clus6 <- cbind(rnorm(25, 10),rnorm(25, 1.5))
clus7 <- cbind(rnorm(25, 10),rnorm(25, -1.5))
clus8 <- cbind(rnorm(25, 13),rnorm(25, 0))
rd <- rbind(clus5, clus6, clus7, clus8)
cl <- c(rep(5:8,each=25))

sds <- getLineages(rd, cl, end.clus = 6:7, omega = 10)
g <- createClusterMST(rd, clusters = cl, dist.method = 'slingshot', endpoints = 6:7, outgroup = sqrt(10))

image

Which kind of makes sense, given the way the search works (only adding one edge will presumably be cheaper than adding two), but I don't know if it's the best graph. Pushing it a little further, if 6 and 7 are far enough apart that outgroup says they can't be connected, you end up with no edges, which seems unhelpful:

clus6 <- cbind(rnorm(25, 10),rnorm(25, 2))
clus7 <- cbind(rnorm(25, 10),rnorm(25, -2))

image

LTLA commented 3 years ago

Hm. The immediate problem can be solved by just counting the edge width for the dyads twice.

More generally, it might be better to move the exhaustive search just above minimum.spanning.tree, so it gets repeated when the outgroup has been added (and thus the decision to form a dyad can consider the presence of the outgroup).

LTLA commented 3 years ago

Also, can you make a separate test- file with some small MST examples like those above? And hard-code the expected edges?

LTLA commented 3 years ago

Check out the repeated branch, which implements what I said above. It solves your immediate problem, but the dyad interactions are still pretty difficult to predict, so keep an eye out.

kstreet13 commented 3 years ago

Ok, I've added tests for these cases (and a few similar ones). They aren't all passing, so I'll try to figure out why that is. This one is especially odd:

    y <- rbind(A=c(0, 0), B=c(1, 1), C=c(1, -1)) 
    mst <- createClusterMST(y, cluster=NULL, endpoints = c("B", "C"))

This throws an error, saying the tree is unsolvable. But if you add outgroup = TRUE, it behaves as expected.

LTLA commented 3 years ago

Oh damn. That's the edge confidence estimation; it checks each edge by deleting it and seeing what happens. In that case, the tree is impossible if the edge A-B edge is lost, so I should probably intercept it and give it a confidence of Inf.

LTLA commented 3 years ago

I think I have it. The solution is to force one member of the dyad to have a connection to the outgroup. This retains the penalty that one would receive in any other context involving the creation of a new subcomponent. It is also more natural than creating two edges within the dyad. All your tests pass, though I had to edit some of them a bit; have a look and see if it makes sense.

LTLA commented 3 years ago

Forgot about this. What are the opinions on the latest state of affairs?

kstreet13 commented 3 years ago

I think it's looking pretty good. I threw all the weird scenarios I could think of into tests, so it should be fairly robust, at this point.

LTLA commented 3 years ago

Here we go: Bioconductor/Contributions#1865.

LTLA commented 3 years ago

And it is done: https://bioconductor.org/packages/devel/bioc/html/TrajectoryUtils.html

kstreet13 commented 3 years ago

Awesome! Congrats and thanks for suggesting this!

kstreet13 commented 3 years ago

ok, I got the update done and it passes BiocCheck on my machine, but GitHub Actions doesn't like it (and we have to wait 'til tomorrow to find out if the bioconductor devel branch likes it). My guess is that this is related to TrajectoryUtils/DelayedMatrixStats, since that's the thing that has changed, but I'm not sure. (and yes, I could just get rid of this vignette, since the condiments package is a thing now, but I get these same errors on the main vignette).

Error: E> --- re-building ‘conditionsVignette.Rmd’ using rmarkdown E> The magick package is required to crop "/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/RtmpCCRQtr/Rbuild12154d22b53f/slingshot/vignettes/conditionsVignette_files/figure-html/unnamed-chunk-1-1.png" but not available. E> Quitting from lines 68-69 (conditionsVignette.Rmd) E> Error: processing vignette 'conditionsVignette.Rmd' failed with diagnostics: E> there is no package called 'DelayedMatrixStats' E> --- failed re-building ‘conditionsVignette.Rmd’

LTLA commented 3 years ago

Okay. Not sure why GHA doesn't like it; I'm guessing it's a caching issue?

kstreet13 commented 3 years ago

yeah, that would make sense. I tried the cache-related tricks listed in the check-bioc.yml comments, but neither seemed to work. Biocondcutor devel doesn't seem to mind, though.

LTLA commented 3 years ago

Yep, looks good on devel. Seems like we can finally close this; looking forward to seeing what you guys come up with for the book.