Create lines in other low-dimensional spaces

LTLA commented 4 years ago

I've finally gotten around to writing the trajectory chapter and was thinking about how to get those nice slingshot curves in some other embedding. Maybe you have already considered this, but it seems like it would be pretty straightforward; using the same logic used to fit the principal curves in the first place, you would simply compute an average coordinate based on cells with similar pseudotime values, and then join those up in a (hopefully also smooth) curve of some sort.

The idea here is that the principal curves are still fitted in some higher-dimensional PC space where there haven't been too many distortions, but are then visualized in, say, t-SNE or UMAP instead of the PCA plot in the vignette. This would allow me to see some pretty lines on my t-SNE art.

Perhaps you already have a function to do this, but I couldn't find it.

kstreet13 commented 4 years ago

Hey! Yes, this is something we've considered and something I would very much like to have, but it's not currently implemented. Mainly because I don't know how it would handle the sort of discontinuities that tSNE tends to create (and I don't want it making wild curves that try to connect all the disjoint parts). I usually recommend just coloring the cells by pseudotime, but I'm definitely open to suggestions and I can try putting something together based on local averages.

LTLA commented 4 years ago

and I don't want it making wild curves that try to connect all the disjoint parts

I thought about that as well, and I think people would be fairly understanding of what went wrong if that did happen. Just mention in the docs that, if that does happen, users should up the relevant connectivity parameters for the embedding (perplexity for t-SNE, number of neighbors for UMAP).

I can try putting something together based on local averages.

Sounds like it could be a good first shot. The bandwidth of the local average could also serve to smooth over any distortions introduced in the low-dimensional embedding.

kstreet13 commented 4 years ago

Alright, I have something up on the develop branch (embedCurves) that does basically what you suggested (a single iteration of the procedure used to construct the curves). I think using loess as the smoother might be more philosophically aligned with the idea of local "average coordinate," but I left smoothing splines as the default for now because that's the principal curves default.

LTLA commented 4 years ago

Works about as well as one might expect.

umap_sling

And it is, if not completely pretty, sensible enough for branches:

umap_sling_branch

I'm pretty happy with that.

LTLA commented 4 years ago

Insofar as I'm talking art, I'd also like:

A single pseudo-time vector for coloring points. Basically collapse values from 3 branches into one vector; this should work as you're either collapsing identical values (shared part of two branches) or you're replacing a NA with a non-NA value. Imagine! You can get consistent colors across multiple branches, much like like a blooming flower! It would be beautiful.
It wasn't clear to me how to get the line information out into something I could put into ggplot2. I don't expect you to spoon-feed me the geom_line layer, but it would be nice to give me the data.frame that I can just plug-n-play into that layer (given scater uses ggplot2 everywhere).

kstreet13 commented 4 years ago

Not sure there's a perfect solution to this, but rowMeans(slingPseudotime(x), na.rm = TRUE) might be the best imperfect solution. The main problem is that lineages with different lengths (which seem to come up a lot with UMAP) will end in different colors. But I think that's better than first scaling each lineage to be between 0 and 1, which would cause some weird backward jumps in the longer lineages:
It's not pretty, but slingCurves(x)[[1]]$s[slingCurves(x)[[1]]$ord, ] will pull out an appropriate matrix for plotting the first lineage. Unfortunately I'm not a ggplotter, but @HectorRDB has said he could make an appropriate geom function for this.

Are these things that we should be providing functions for?

LTLA commented 4 years ago

The main problem is that lineages with different lengths (which seem to come up a lot with UMAP) will end in different colors.

That's fine. Preferable, in fact.

Unfortunately I'm not a ggplotter, but @HectorRDB has said he could make an appropriate geom function for this.

While that would be cool, I'm also happy to do the geom part myself (probably). Save you from including ggplot2 in your dependencies.

Are these things that we should be providing functions for?

For the single pseudotime, I would say "probably not", now that I think about it. Worth documenting though, to provide a guide on how people to color the plots when you have branches.

(I do have some thoughts about the extraction of the pseudotime columns from the SingleCellExperiment via the SlingshotDataset class but that's for another issue.)

For the lines, if you just give me coordinates and their groupings (i.e., branch), I'll be fine. Just a dataframe with one row for each cell/branch combination containing x, y and group columns.

kstreet13 commented 4 years ago

Yeah, it would be good to limit the number of dependencies.

And I'm not sure what you're asking for in the last paragraph. All cell/branch combinations would be quite a lot in some cases, and what do you mean by "group?"

LTLA commented 4 years ago

In order to make these lines in ggplot2 via geom_line, I would need a data.frame that contains the coordinates along the line for each curve. This data.frame is most easily constructed by:

Imagine having one data.frame per curve, where each row is a cell.
Each data.frame contains the x/y coordinates on the line for all cells assigned to that curve.
We now stick a constant curve number as a separate column each data.frame. The curve number differs across curves (obviously) but is the same for all cells in the same curve.
Then we rbind all of the data.frames together.

Basically, you'd now have three columns - the X, Y and curve number (i.e., group). The number of rows would be somewhere between the number of cells (if no cells are shared across curves) and the product of the number of cells and curves (if all cells are shared across curves). Technically, the shared cells only need to show up once, but having them repeated across multiple curves is helpful if I want to highlight one curve later by just subsetting the data.frame by the curve number.

kstreet13 commented 4 years ago

Oh, ok. So it's fairly straightforward to do that when the curves didn't use the approx_points argument:

dfs <- lapply(seq_along(slingCurves(sce)), function(l){
    crv <- slingCurves(sce)[[l]]
    df <- data.frame(crv$s[crv$ord, ])
    df$curve <- l
    return(df)
})
do.call(rbind, dfs)

This will also work in cases where approx_points was used, but there won't be one point for every cell along the curve (ie. if approx_points = 100, each curve will consist of 100 points rather than n points). If it's vital to have one point along the curve for each cell, we could do this with linear interpolation based on the pseudotime values, but it's a little messier.

dfs <- lapply(seq_along(slingCurves(sds)), function(l){
    crv <- slingCurves(sds)[[l]]
    if(length(crv$lambda) != nrow(crv$s)){
        approx_points <- nrow(crv$s)
        xout_lambda <- seq(min(crv$lambda), max(crv$lambda),
            length.out = approx_points)
        s <- apply(crv$s, 2, function(sjj){
            return(approx(x = xout_lambda,
                y = sjj,
                xout = crv$lambda, ties = 'ordered')$y)
        })
    }else{
        s <- crv$s
    }
    df <- data.frame(s[order(crv$lambda), ])
    df$curve <- l
    return(df)
})
do.call(rbind, dfs)

LTLA commented 4 years ago

For the line drawing, I think any points are fine, they don't have to be one per cell.

LTLA commented 4 years ago

If you can get this into BioC-devel, I'm happy to consider this case closed.

kstreet13 / slingshot

Create lines in other low-dimensional spaces #74