sensitivity to cluster definition

janinemelsen commented 4 years ago

Hi,

I created a diffusion map on 123.000 cells, based on 9 parameters (its a flow cytometry dataset). Now I am using Slingshot to calculate the pseudotime, however the curve is really dependent on the cluster definitions (which on its turn is sensitive to the number of dimensions as input). For instance:

2 diffusion components, clusters calculated by mclust (=9)

4 diffusion components, clusters calculated by mclust (=9)

4 diffusion components, clusters calculated manually (=10)

4 diffusion components, clusters calculated by mclust (=10)

I was wondering whether you could give me some advice, on how many components and clusters to include for the slingshot calculation, in case of 123000 cells with 9 parameters?

Thanks!

kstreet13 commented 4 years ago

Hi @janinemelsen

I just want to start by saying that I don't think there is any "correct" answer for these sorts of questions, but I'll try to help out!

I think some of the weirder results you showed may be due to a quirk of Gaussian mixture modeling, particularly in the "4 diffusion components, clusters calculated by mclust (=9)" and "4 diffusion components, clusters calculated by mclust (=10)" plots. Both of these seem to include one very large, highly dispersed cluster (purple and orange, respectively) that tends to mess up the minimum spanning tree, as many other clusters are connected to it. I haven't played around with mclust too much, but you might be able to avoid these sorts of clusters by setting the modelName argument (maybe to something like "EVV" for "ellipsoidal, equal volume")? Alternatively, you could try other clustering methods, though I have generally had good results with mclust.

And I'm not familiar with how diffusion maps work, but I have used them occasionally, via the destiny package. My understanding (largely informed by the destiny documentation) is that they are based on an eigen decomposition, where the diffusion components are the eigenvectors and the corresponding eigenvalues behave similarly to the variances in PCA (ie. strictly positive and decreasing). So my guess is that selecting a particular number of diffusion components is similar to selecting a number of PCs, for which there are a lot of existing methods (qualitatively, I think looking for the "elbow" in the plot of eigenvalues would be a good starting point).

Finally, I should mention that you don't strictly need to perform clustering before running Slingshot, if you believe that the data only contain one lineage (with no branching). I bring this up because, at least in two dimensions, that seems reasonable for your data. In this case, slingshot will just fit a principal curve and you may be better off using the princurve package directly.

Hope this helps!

janinemelsen commented 4 years ago

Hi,

Thank you for the quick response! I adjusted the model of mclust, and the clusters look much better, however they are not reproducible. Each time I run mclust on the same number of diffusion components, the clusters are different. For instance (model is EVV, number of components is 5, number of clusters is 8):

My guess is that this could be explained by the cells in the center of the plot (which seems to be outliers).

Unfortunately, the elbow plot is not very informative, since there is no elbow. According to the destiny paper (figure 1B) this can be explained by the large intrinsic dimensionality.

Best, Janine

kstreet13 commented 4 years ago

That's interesting that you don't get the same clusters every time. My best guess is that that's caused by some sort of random initialization. You could probably make a particular set of results reproducible by setting the random seed, but that wouldn't actually make the algorithm any more stable. If you want to try out other methods, I know clusterExperiment::RSEC is specifically designed for stability (and leaves some cells unclustered, which may prevent those points in the middle from causing issues). There's also the graph-based Louvain clustering, which is fairly popular (available via scran::buildSNNGraph+igraph::cluster_louvain or Seurat::FindNeighbors+Seurat::FindClusters).

And yeah, I agree that that plot seems to indicate a high intrinsic dimensionality. Fortunately, most if not all of the methods I've mentioned can work in 10 dimensions without issue. I think it may be best to do the analysis on the full data and only use dimensionality reduction for visualization purposes.

janinemelsen commented 4 years ago

I was not able to set the resolution parameter with the igraph package (without resolution the number of clusters is way too high) so I used the Seurat package. And... it looks much better and reproducible!

The SNN graph and clusters were based on the full data and the slingshot was based on the diffusion components. I have one question left: is it possible to calculate the slingshot on the full data (and clusters), and then to plot it on the diffusion map?

kstreet13 commented 4 years ago

That's great, glad you found something that works!

And yes, it is totally possible (and in this case, recommended) to run Slingshot on the full dataset. This makes plotting the results a bit more tricky, since there's no straightforward way to map the smooth curves onto the 2D diffusion map. However, you can get around this by plotting multiple versions of the diffusion map (or tSNE, UMAP, etc.) with cells colored by the pseudotime values along the different lineages (see example here).

janinemelsen commented 4 years ago

Thanks!

Unfortunately, slingshot based on the full dataset was influenced by some 'noise' I guess. Especially in the third lineage red cells are visible in the blue zone, which is not correct in my opinion. According to the louvain clustering these cells do belong to the same cluster, so I dont understand how this can happen.

On the other hand the color gradient is more gradual in the slingshot based on the full dataset compared to the diffusion components:

Slingshot based on full dataset (3 lineages)

Slingshot based on the diffusion components (3 lineages)

Clustering (based on full data)

kstreet13 commented 4 years ago

Hmm, that is unfortunate and I agree that it's probably a result of the intermediate cells in the middle. My guess is that those cells are even more ambiguous in 9 dimensions. Some of them get assigned to the early stage and some to the later stage and where we would draw that line on a 2-dimensional diffusion map isn't exactly where it gets drawn in the original space. Is it possible that some of these are doublets?

Otherwise, I haven't played around with this too much, but one thing you could try is defining a threshold for cells that are "well clustered" and only constructing the lineages based on those cells. For your purposes, I think something like silhouette width (via cluster::silhouette) would work. You could temporarily remove cells with silhouette scores below a certain threshold, such as 0, and then run Slingshot on the rest. Then you can assign the held-out cells to the lineages with the predict method. This would be analogous to how we handle unclustered cells from clustering methods that look to identify "stable" clusters, such as RSEC.

janinemelsen commented 4 years ago

I think I will leave it like this (its not that bad I think). I only adjusted the plot visualization a bit, compared to the previous plots;)

Thanks for the help!

Janine

kstreet13 commented 4 years ago

Cool, glad you were able to find something that works!

kstreet13 / slingshot

sensitivity to cluster definition #47