Closed sidneymbell closed 6 years ago
Current tSNE implementation unfortunately doesn't visualize the kNN graph constructed via sc.pp.neighbors()
but rather reconstruct it's own kNN graph via euclidean distance. Therefore tSNE calls here ignore the provided jaccard metric.
Can you use any other visualization method to see what Louvain groups correspond to e.g. sc.tl.draw_graph()
or sc.tl.umap()
and sc.pl.*
? (It'll probably not look so nice.)
Hey @gokceneraslan, thanks for looking at this.
I wouldn't expect these to completely correspond. But, I would expect there to be some level of concordance between them, especially as the these two methods (jaccard-based kNN clustering + tSNE vis) were used in tandem to assign the "ground truth" annotations shown above a few months back. Seemed like odd enough behavior is was worth flagging as a potential issue.
Jaccard metric may lead to very different graphs where Louvain communities are not really distinct compared to those in graphs built with euclidean metric. For example:
Graph with euclidean metric and clusters:
Graph with jaccard metric and clusters (or lack thereof)
So in this case, there is simply no communities in the Jaccard graph (at least with the default resolution). However, tSNE always uses the euclidean distance to compute a kNN graph from scratch (completely discarding the metric user specifies in sc.pp.neighbors) and the clusters computed on the Jaccard graph are not representative any more due to very different topology of the graph.
In your data, something similar is going on, therefore using sc.tl.draw_graph
would show how Louvain clusters really look like because it uses the kNN graph built with the metric specified by the user.
@falexwolf does it make sense to print a warning in sc.tl.tsne
if sc.pp.neighbors
is called with a a metric other than euclidean
like "a non-euclidean metric is used to construct the graph, Louvain clusters may not be representative in tSNE" or would it be too hacky? tSNE is apparently misleading in these cases...
Hi Gökcen: makes sense!
Hi Sidney: if I'm not completely mistaken: I don't think that the Jaccard metric makes sense at all for continuous ordinal variables. It would make sense if one had boolean gene expression or something like this... I guess this is the reason why you get a meaningless graph with it. I always only use euclidean distance. All other desired aspects of the metric are engineered in the preprocessing already.
The Jaccard metric, taken from umap, I believe, returns for a pair of vectors x and y the Jaccard distance between their sets of nonzeros. (Code.)
In the case that you feed the raw or logged gene expression matrix, this is reasonable: it's the fraction of genes with nonzero expression shared by the two cells.
If you compute this on PCA vectors, however, which are essentially all nonzero, you are getting some version of the complete graph, downsampled to k in some random way. This would explain why clustering and embedding based on that graph is garbage.
While investigating this, we (me and @sidneymbell) came across some behavior that may or may not be the desired default. If you call tools._utils.choose_representation
with use_rep=None, n_pcs=None
, then it will return X_pca
(all columns) because X_pca[:,:None] = X_pca
, which will be the top 50 PCs if one is running with defaults. I might have expected this to instead return X
.
Yes, Joshua, thank you... it makes sense that it takes the nonzero overlap - this is what I meant with "boolean gene expression". 🙂 And yes, on PCA this does not make sense at all.
OK, you dug out the private function tools._utils.choose_representation
. This returns the PCA representation if the data matrix has more than 50 variables: See the documentation of the use_rep argument here and the code https://github.com/theislab/scanpy/blob/8e06ff6ecfab892240b58d2206e461685216a926/scanpy/tools/_utils.py#L22-L43. This behavior is intended as it is rarely advisable to compute distances on an uncompressed data matrix with more than 50 dimensions. Don't you think so? If .X
is already a 100-dimensional compressed latent representation of another model, then, of course, a PCA on top of that could be nonsense - here I'd agree.
Thanks for the clarification, @falexwolf. The documentation is clear, not sure how I missed it. I agree that euclidean distance is well-approximated by PCA (as long as populations are sufficiently large). For other metrics, that may not be the case (and for Jaccard it bails hard), and so maybe a warning would be appropriate in those cases rather than changing the behavior of choose_representation
.
Hello!
I found an odd bug where computing
sc.pp.neighbors()
withmetric='jaccard'
results in random cluster assignments coming out ofsc.tl.louvain()
. Running with theeuclidean
distance metric yields appropriate cluster assignments fromsc.tl.louvain()
.Reproduce (adata is a bone marrow data set, with "ground truth" cell type annotations in
adata.obs['cell_type']
:Thanks! Let me know if you need another set of eyes in tracking this one down :)