Closed LTLA closed 4 years ago
Hey! Yes, this is something we've considered and something I would very much like to have, but it's not currently implemented. Mainly because I don't know how it would handle the sort of discontinuities that tSNE tends to create (and I don't want it making wild curves that try to connect all the disjoint parts). I usually recommend just coloring the cells by pseudotime, but I'm definitely open to suggestions and I can try putting something together based on local averages.
and I don't want it making wild curves that try to connect all the disjoint parts
I thought about that as well, and I think people would be fairly understanding of what went wrong if that did happen. Just mention in the docs that, if that does happen, users should up the relevant connectivity parameters for the embedding (perplexity for t-SNE, number of neighbors for UMAP).
I can try putting something together based on local averages.
Sounds like it could be a good first shot. The bandwidth of the local average could also serve to smooth over any distortions introduced in the low-dimensional embedding.
Alright, I have something up on the develop
branch (embedCurves
) that does basically what you suggested (a single iteration of the procedure used to construct the curves). I think using loess as the smoother might be more philosophically aligned with the idea of local "average coordinate," but I left smoothing splines as the default for now because that's the principal curves default.
Works about as well as one might expect.
And it is, if not completely pretty, sensible enough for branches:
I'm pretty happy with that.
Insofar as I'm talking art, I'd also like:
NA
with a non-NA
value. Imagine! You can get consistent colors across multiple branches, much like like a blooming flower! It would be beautiful.geom_line
layer, but it would be nice to give me the data.frame
that I can just plug-n-play into that layer (given scater uses ggplot2 everywhere).Not sure there's a perfect solution to this, but rowMeans(slingPseudotime(x), na.rm = TRUE)
might be the best imperfect solution. The main problem is that lineages with different lengths (which seem to come up a lot with UMAP) will end in different colors. But I think that's better than first scaling each lineage to be between 0 and 1, which would cause some weird backward jumps in the longer lineages:
It's not pretty, but slingCurves(x)[[1]]$s[slingCurves(x)[[1]]$ord, ]
will pull out an appropriate matrix for plotting the first lineage. Unfortunately I'm not a ggplotter, but @HectorRDB has said he could make an appropriate geom
function for this.
Are these things that we should be providing functions for?
- The main problem is that lineages with different lengths (which seem to come up a lot with UMAP) will end in different colors.
That's fine. Preferable, in fact.
- Unfortunately I'm not a ggplotter, but @HectorRDB has said he could make an appropriate
geom
function for this.
While that would be cool, I'm also happy to do the geom
part myself (probably). Save you from including ggplot2 in your dependencies.
Are these things that we should be providing functions for?
For the single pseudotime, I would say "probably not", now that I think about it. Worth documenting though, to provide a guide on how people to color the plots when you have branches.
(I do have some thoughts about the extraction of the pseudotime columns from the SingleCellExperiment
via the SlingshotDataset
class but that's for another issue.)
For the lines, if you just give me coordinates and their groupings (i.e., branch), I'll be fine. Just a dataframe with one row for each cell/branch combination containing x, y and group columns.
Yeah, it would be good to limit the number of dependencies.
And I'm not sure what you're asking for in the last paragraph. All cell/branch combinations would be quite a lot in some cases, and what do you mean by "group?"
In order to make these lines in ggplot2 via geom_line
, I would need a data.frame that contains the coordinates along the line for each curve. This data.frame is most easily constructed by:
rbind
all of the data.frames together.Basically, you'd now have three columns - the X, Y and curve number (i.e., group). The number of rows would be somewhere between the number of cells (if no cells are shared across curves) and the product of the number of cells and curves (if all cells are shared across curves). Technically, the shared cells only need to show up once, but having them repeated across multiple curves is helpful if I want to highlight one curve later by just subsetting the data.frame by the curve number.
Oh, ok. So it's fairly straightforward to do that when the curves didn't use the approx_points
argument:
dfs <- lapply(seq_along(slingCurves(sce)), function(l){
crv <- slingCurves(sce)[[l]]
df <- data.frame(crv$s[crv$ord, ])
df$curve <- l
return(df)
})
do.call(rbind, dfs)
This will also work in cases where approx_points
was used, but there won't be one point for every cell along the curve (ie. if approx_points = 100
, each curve will consist of 100 points rather than n
points). If it's vital to have one point along the curve for each cell, we could do this with linear interpolation based on the pseudotime values, but it's a little messier.
dfs <- lapply(seq_along(slingCurves(sds)), function(l){
crv <- slingCurves(sds)[[l]]
if(length(crv$lambda) != nrow(crv$s)){
approx_points <- nrow(crv$s)
xout_lambda <- seq(min(crv$lambda), max(crv$lambda),
length.out = approx_points)
s <- apply(crv$s, 2, function(sjj){
return(approx(x = xout_lambda,
y = sjj,
xout = crv$lambda, ties = 'ordered')$y)
})
}else{
s <- crv$s
}
df <- data.frame(s[order(crv$lambda), ])
df$curve <- l
return(df)
})
do.call(rbind, dfs)
For the line drawing, I think any points are fine, they don't have to be one per cell.
If you can get this into BioC-devel, I'm happy to consider this case closed.
I've finally gotten around to writing the trajectory chapter and was thinking about how to get those nice
slingshot
curves in some other embedding. Maybe you have already considered this, but it seems like it would be pretty straightforward; using the same logic used to fit the principal curves in the first place, you would simply compute an average coordinate based on cells with similar pseudotime values, and then join those up in a (hopefully also smooth) curve of some sort.The idea here is that the principal curves are still fitted in some higher-dimensional PC space where there haven't been too many distortions, but are then visualized in, say, t-SNE or UMAP instead of the PCA plot in the vignette. This would allow me to see some pretty lines on my t-SNE art.
Perhaps you already have a function to do this, but I couldn't find it.