scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.91k stars 599 forks source link

Visualizing Standalone Dendrogram and Merging Similar Nodes #362

Open biskra opened 5 years ago

biskra commented 5 years ago

I would like to visualize a dendrogram from scanpy.pl.heatmap() without the heatmap, grouped by louvain clusters. It would be nice to have the louvain clusters as the x-axis and the y-axis representative of OOBE or some other metric for cluster scores (similar to Seurat).

Is there some obvious method for doing this that I have not seen? I realize that adata.uns['dendrogram'] contains dendrogram information; however, I am unsure of how to use the information to generate a dendrogram within scanpy.

fidelram commented 5 years ago

It is possible to do what you want. Here is a working example:

from scanpy.plotting.anndata import _plot_dendrogram, _compute_dendrogram

adata = sc.datasets.pbmc68k_reduced()
res = _compute_dendrogram(adata, 'bulk_labels', use_raw=False, cor_method='pearson', linkage_method='ward')
fig, dendro_ax = plt.subplots(1, 1)
_plot_dendrogram(dendro_ax, adata, orientation='top', remove_labels=False)
labels  = [adata.obs.bulk_labels.cat.categories[x] for x in adata.uns['dendrogram']['categories_idx_ordered']]

image If this is useful for other people we may consider adding a dedicated dendrogram function that follows the scanpy convention of a tool to compute the dendrogram and a tool to plot the dendrogram. Furthermore, the current dendrogram uses either all genes in adata.var_names (or adata.raw.var_names), then averages the values by the chosen category and computes the correlation matrix. I think that Seurat's version has a different approach which we could also implement if someone is familiar on how it works.

biskra commented 5 years ago

That is a wonderful solution! I will give it a shot shortly. Thank you so much for your help!

Here's some example code of how Seurat handles cluster scoring and merging with random forests and OOBE:

pbmc <- ValidateClusters(pbmc, pc.use = 1:30, top.genes = 30)

pbmc <- BuildClusterTree(pbmc, 
                         do.reorder = T, 
                         reorder.numeric = T)

node.scores <- AssessNodes(pbmc)

node.scores[order(node.scores$oobe,decreasing = T),] -> node.scores

nodes.merge <- node.scores[which(node.scores[,2] > 0.1),]
nodes.to.merge <- sort(nodes.merge$node) 
pbmc.merged <- pbmc

for (n in nodes.to.merge)
{
  pbmc.merged <- MergeNode(pbmc.merged, n)
}

Here's an explanation, as this code was derived from this recent (and awesome) publication:

From page 6 of the Supplementary Methods of Plass et al 2018: http://science.sciencemag.org/content/early/2018/04/18/science.aaq1723

To prevent obtaining spurious clusters result of overclustering, the robustness of the clusters was calculated using the function AssessNodes from Seurat. For each cluster, the average expression of all variable genes (4910) is computed and a phylogenetic tree based on the distance matrix in gene expression space is computed. Next, it computes an Out of Bag Error for a random forest classifier trained on each internal node split of the tree. We recursively build a tree and assessed all its nodes, merging all clusters with an out of bag error bigger than 0.1 until no such nodes were found.