Neighbors with `metric='jaccard'` breaks clustering

scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.

https://scanpy.readthedocs.io

BSD 3-Clause "New" or "Revised" License

1.9k stars 597 forks source link

Neighbors with `metric='jaccard'` breaks clustering #177

Closed sidneymbell closed 6 years ago

sidneymbell commented 6 years ago

Hello!

I found an odd bug where computing sc.pp.neighbors() with metric='jaccard' results in random cluster assignments coming out of sc.tl.louvain(). Running with the euclidean distance metric yields appropriate cluster assignments from sc.tl.louvain().

Reproduce (adata is a bone marrow data set, with "ground truth" cell type annotations in adata.obs['cell_type']:

sc.pl.tsne(adata, color='cell_type', title='Ground truth') # Color tsne plot by ground truth cell annotations
plt.show()
plt.clf()

sc.pp.neighbors(adata, metric='jaccard', random_state=2018)  # compute neighbor graph with jaccard metric
sc.tl.louvain(adata,random_state=2018) # Then use the Louvain algorithm to identify clusters
sc.pl.tsne(adata, color='louvain', title='Louvain + jaccard metric')
plt.show()
plt.clf()

sc.pp.neighbors(adata,metric='euclidean', random_state=2018) # compute neighbor graph with euclidean distance metric
sc.tl.louvain(adata, random_state=2018) # Rerun cluster identification
sc.pl.tsne(adata, color='louvain',  title='Louvain + default metric')
plt.show()
plt.clf()

Thanks! Let me know if you need another set of eyes in tracking this one down :)

gokceneraslan commented 6 years ago

Current tSNE implementation unfortunately doesn't visualize the kNN graph constructed via sc.pp.neighbors() but rather reconstruct it's own kNN graph via euclidean distance. Therefore tSNE calls here ignore the provided jaccard metric.

Can you use any other visualization method to see what Louvain groups correspond to e.g. sc.tl.draw_graph() or sc.tl.umap() and sc.pl.*? (It'll probably not look so nice.)

sidneymbell commented 6 years ago

Hey @gokceneraslan, thanks for looking at this.

I wouldn't expect these to completely correspond. But, I would expect there to be some level of concordance between them, especially as the these two methods (jaccard-based kNN clustering + tSNE vis) were used in tandem to assign the "ground truth" annotations shown above a few months back. Seemed like odd enough behavior is was worth flagging as a potential issue.

gokceneraslan commented 6 years ago

Jaccard metric may lead to very different graphs where Louvain communities are not really distinct compared to those in graphs built with euclidean metric. For example:

Graph with euclidean metric and clusters:

Graph with jaccard metric and clusters (or lack thereof)

So in this case, there is simply no communities in the Jaccard graph (at least with the default resolution). However, tSNE always uses the euclidean distance to compute a kNN graph from scratch (completely discarding the metric user specifies in sc.pp.neighbors) and the clusters computed on the Jaccard graph are not representative any more due to very different topology of the graph.

In your data, something similar is going on, therefore using sc.tl.draw_graph would show how Louvain clusters really look like because it uses the kNN graph built with the metric specified by the user.

@falexwolf does it make sense to print a warning in sc.tl.tsne if sc.pp.neighbors is called with a a metric other than euclidean like "a non-euclidean metric is used to construct the graph, Louvain clusters may not be representative in tSNE" or would it be too hacky? tSNE is apparently misleading in these cases...

falexwolf commented 6 years ago

Hi Gökcen: makes sense!

Hi Sidney: if I'm not completely mistaken: I don't think that the Jaccard metric makes sense at all for continuous ordinal variables. It would make sense if one had boolean gene expression or something like this... I guess this is the reason why you get a meaningless graph with it. I always only use euclidean distance. All other desired aspects of the metric are engineered in the preprocessing already.

batson commented 6 years ago

The Jaccard metric, taken from umap, I believe, returns for a pair of vectors x and y the Jaccard distance between their sets of nonzeros. (Code.)

In the case that you feed the raw or logged gene expression matrix, this is reasonable: it's the fraction of genes with nonzero expression shared by the two cells.

If you compute this on PCA vectors, however, which are essentially all nonzero, you are getting some version of the complete graph, downsampled to k in some random way. This would explain why clustering and embedding based on that graph is garbage.

While investigating this, we (me and @sidneymbell) came across some behavior that may or may not be the desired default. If you call tools._utils.choose_representation with use_rep=None, n_pcs=None, then it will return X_pca (all columns) because X_pca[:,:None] = X_pca, which will be the top 50 PCs if one is running with defaults. I might have expected this to instead return X.

falexwolf commented 6 years ago

Yes, Joshua, thank you... it makes sense that it takes the nonzero overlap - this is what I meant with "boolean gene expression". 🙂 And yes, on PCA this does not make sense at all.

OK, you dug out the private function tools._utils.choose_representation. This returns the PCA representation if the data matrix has more than 50 variables: See the documentation of the use_rep argument here and the code https://github.com/theislab/scanpy/blob/8e06ff6ecfab892240b58d2206e461685216a926/scanpy/tools/_utils.py#L22-L43. This behavior is intended as it is rarely advisable to compute distances on an uncompressed data matrix with more than 50 dimensions. Don't you think so? If .X is already a 100-dimensional compressed latent representation of another model, then, of course, a PCA on top of that could be nonsense - here I'd agree.

batson commented 6 years ago

Thanks for the clarification, @falexwolf. The documentation is clear, not sure how I missed it. I agree that euclidean distance is well-approximated by PCA (as long as populations are sufficiently large). For other metrics, that may not be the case (and for Jaccard it bails hard), and so maybe a warning would be appropriate in those cases rather than changing the behavior of choose_representation.