kstreet13 / slingshot

Functions for identifying and characterizing continuous developmental trajectories in single-cell data.
269 stars 43 forks source link

the cluster in end.clust is not the end of pseudotime value #207

Closed shangguandong1996 closed 2 years ago

shangguandong1996 commented 2 years ago

Hi, dear developer

My cluster result is like below 图片

I run the below code

# approx_points do not infulence the result
sce.sling <- slingshot(mnn_slingshot,
                       reducedDim='corrected',
                       cluster = colLabels(mnn.out),
                       approx_points = 100,
                       start.clus = "1",
                       end.clus = c("8", "4", "7", "10"))

and it did produce four lineage

> SlingshotDataSet(sce.sling)
class: SlingshotDataSet 

 Samples Dimensions
   14379         49

lineages: 4 
Lineage1: 1  2  6  3  9  10  
Lineage2: 1  2  5  4  
Lineage3: 1  2  6  7  
Lineage4: 1  2  5  8  

curves: 4 
Curve1: Length: 2.0161  Samples: 6949.59
Curve2: Length: 1.7033  Samples: 8610.2
Curve3: Length: 1.7119  Samples: 5623.75
Curve4: Length: 1.4177  Samples: 7561.2

But the weried thing is the Curve2. As you can see, the other 3 lineage pseudo time end is the end.clus, but the curve 2 pseudo time end is not in the cluster 4, it is in the cluster 6 and 3. I am wondering whether you can give me some advice :)

图片 图片 图片 图片

Best wishes

Guandong Shang

kstreet13 commented 2 years ago

Hi @shangguandong1996,

It might be a little easier to diagnose this issue if you showed a plot with the curves produced by Slingshot. But just from looking at this, I would say that cluster 4 appears to be fairly centrally located and well connected, so I can see why it would be difficult. If you really believe that cluster 4 should be an endpoint, you may want to use a higher dimensional embedding (eg. 10 PCs or 3D UMAP), as it can be hard to fully characterize 4 lineages in just two dimensions.

Best, Kelly

shangguandong1996 commented 2 years ago

Thanks for your help, Kelly :)

I try to choose less variable gene in fastMNN intergate, and I get a little better result for me. (I am sorry the cluster number is not same as above, which may confuse you). And I am very sorry I can not reproduce same result as above.

You can see cluster 2(which is same as cluster 4 above) is less linked to other clusters but it still mixed some clusters.

image image image image image

And I am confused by the

a higher dimensional embedding (eg. 10 PCs or 3D UMAP), as it can be hard to fully characterize 4 lineages in just two dimensions.

Here is my whole code, I use fastMNN in batchelor to intergate data, and the then use the corrected value to calcluate slingshot.

dec.sce <- modelGeneVar(mergeCell, block = mergeCell$batch)
mergeCell <- multiBatchNorm(mergeCell,batch = mergeCell$batch)

chosen.hvgs <- rownames(dec.sce)[dec.sce$bio > 0]
chosen.hvgs <- chosen.hvgs[!chosen.hvgs %in% protoplast_genes]

set.seed(20160806)
mnn.out <- fastMNN(mergeCell,
                   batch = mergeCell$batch,
                   merge.order = c("CIM0d", "CIM1d", "CIM3d", "CIM7d"),
                   subset.row = chosen.hvgs,
                   BSPARAM=BiocSingular::RandomParam())

set.seed(20160806)
mnn.out <- runUMAP(mnn.out, 
                   dimred="corrected", 
                   n_neighbors = 15)

clusters.mnn.50 <- clusterCells(mnn.out, 
                                use.dimred="corrected", 
                                BLUSPARAM=NNGraphParam(k = 50,
                                                       cluster.fun="louvain",
                                                       type="jaccard"))

colLabels(mnn.out) <- clusters.mnn.50
mnn_slingshot <- mnn.out
reducedDim(mnn_slingshot, "corrected") <- reducedDim(mnn_slingshot, "corrected")[, 1:49]
sce.sling <- slingshot(mnn_slingshot,
                       reducedDim='corrected',
                       cluster = colLabels(mnn.out),
                       start.clus = "1",
                       end.clus = c("7", "2", "8", "10"))

embedded <- slingCurves(embedCurves(sce.sling, "UMAP"))

gg <- plotUMAP(sce.sling,text_by = "label", 
               colour_by = "label", point_size = 0.5)
for (path in embedded) {
    embedded <- data.frame(path$s[path$ord,])
    gg <- gg + geom_path(data=embedded, aes(x=Dim.1, y=Dim.2), size=1.2)
}
kstreet13 commented 2 years ago

Oh ok, I think that's a very good pipeline, then (using MNN and running Slingshot on the corrected coordinates). I thought you might be running Slingshot on the UMAP coordinates, which wouldn't be as appropriate. But using UMAP for visualization is fine.

Also, just to make sure, you said that you used the "less variable gene[s]", but your code uses hvg, which is a common abbreviation for "highly variable genes". Generally speaking, you probably want the highly variable genes, not the less variable genes (which will be mostly zeros).

shangguandong1996 commented 2 years ago

Oh ok, I think that's a very good pipeline, then (using MNN and running Slingshot on the corrected coordinates). I thought you might be running Slingshot on the UMAP coordinates, which wouldn't be as appropriate. But using UMAP for visualization is fine.

Also, just to make sure, you said that you used the "less variable gene[s]", but your code uses hvg, which is a common abbreviation for "highly variable genes". Generally speaking, you probably want the highly variable genes, not the less variable genes (which will be mostly zeros).

sorry, it is my mistake. I should say I use "fewer hvg"……