scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 597 forks source link

Switch t-SNE in scanpy to openTSNE #1233

Open dkobak opened 4 years ago

dkobak commented 4 years ago

I'm not sure what t-SNE implementation is currently used in scanpy, but would it make sense to switch it to openTSNE? It's a Cython re-implementation of FIt-SNE, it's available on conda and should be very easy to depend on.

As far as I understand the scanpy architecture, it builds a kNN graph and then runs downstream analysis (like UMAP or Louvain) on this kNN graph. Is that right? I suppose this is currently not implemented for t-SNE? With openTSNE it'd be easy to use the pre-built kNN graph and run t-SNE directly on that.

Also, the default parameters of t-SNE in scanpy could IMHO be improved, see https://www.nature.com/articles/s41467-019-13056-x. Some of these recommendations (learning rate, initialization) are now default in openTSNE.

There are some open issues at scanpy related to t-SNE such as https://github.com/theislab/scanpy/issues/1150 and https://github.com/theislab/scanpy/issues/996 but I think this suggestion would supersede them. We had a brief discussion of this at openTSNE here https://github.com/pavlin-policar/openTSNE/issues/102.

I can see four somewhat separate suggestions:

  1. switch scanpy to using openTSNE for tSNE, using already constructed kNN graph
  2. add tSNE support for ingest using openTSNE functionality.
  3. change default tSNE parameters (n_iter, learning rate, initialization) following openTSNE defaults.
  4. add some tSNE "recipes" based on https://www.nature.com/articles/s41467-019-13056-x

What of this, if any, makes sense from the scanpy point of view?

ivirshup commented 4 years ago

I'm not sure what t-SNE implementation is currently used in scanpy, but would it make sense to switch it to openTSNE? It's a Cython re-implementation of FIt-SNE, it's available on conda and should be very easy to depend on.

We use MulticoreTSNE if it's installed, but fall back to sklearn.

As far as I understand the scanpy architecture, it builds a kNN graph and then runs downstream analysis

Right now, we tend to use a connectivity graph built by UMAP, but are working on making this more generic. We're thinking about allowing the UMAP embedding to be generated on graphs we provide as well.

  1. switch scanpy to using openTSNE for tSNE, using already constructed kNN graph

I think I'd like to see this. That package is much more actively maintained than our current backend, and looks interesting. I would like it if the TSNE was flexible about the graph that was used.

I'm not sure that I'll get to this, but a PR would be welcome. I'd have to see some performance/ results before thinking about changing the defaults, or whether this would go into a major or minor version change.

  1. add tSNE support for ingest using openTSNE functionality.

@Koncopd do you have any thoughts on this?

  1. change default tSNE parameters (n_iter, learning rate, initialization) following openTSNE defaults.

Again, I'd have to think about backwards compatibility. Maybe this could start as a sc.tl.opentsne function?

  1. add some tSNE "recipes"

I'd be interested in this. Skimming that paper now, I really like the idea of showing regions of uncertainty for projection would be very useful. I'd be interested in how these "recipes" could be wrapped in a function.

dkobak commented 4 years ago

Thanks! Can I ask for two clarifications before replying:

Right now, we tend to use a connectivity graph built by UMAP ...

UMAP uses Pynndescent to construct the kNN graph. So does it mean that you depend on Pynndescent to construct the kNN graph, and then if the user calls UMAP, it's run on this previously constructred kNN graph?

By the way, openTSNE uses Annoy instead of Pynndescent for non-sparse input data and simple metrics that are supported by Annoy (like Euclidean or cosine). It seems to work quite a bit faster. For sparse input data and/or fancy metrics, it uses Pynndescent.

... but are working on making this more generic. We're thinking about allowing the UMAP embedding to be generated on graphs we provide as well.

What are the use cases here that you thinking of?

ivirshup commented 4 years ago

That sounds mostly right. We use the nearest_neighbors function from umap, which uses pynndescent if it's installed

https://github.com/theislab/scanpy/blob/5bc37a2b10f40463f1d90ea1d61dc599bbea2cd0/scanpy/neighbors/__init__.py#L270-L280

By the way, openTSNE uses Annoy instead of Pynndescent for non-sparse input data and simple metrics that are supported by Annoy

I'm curious about how much the backend changes the runtime and results of nearest neighbors methods. I'm definitely for being more generic about how the neighbors graph is generated and weighted. I haven't seen anything yet which looks at the character of the inaccuracies for each method, something that's probably important when they're used for classification.

What are the use cases here that you thinking of?

Mainly cases of merged graphs, like when you have multiple datasets or modalities.

dkobak commented 4 years ago

I'm curious about how much the backend changes the runtime and results of nearest neighbors methods.

You can see some quick comparisons between Pynndescent and Annoy here: https://github.com/pavlin-policar/openTSNE/issues/101#issuecomment-597178379. But I have not investigated it very thoroughly.

Anyway, returning to the main conversation:

I think switching to openTSNE makes sense even if nothing else that we are discussing is implemented. It's A LOT faster than Mutlicore t-SNE for large datasets: https://opentsne.readthedocs.io/en/latest/benchmarks.html. It is also more flexible, actively supported, conveniently packaged/distributed, etc. I don't see any possible disadvantage. You could potentially keep all the default parameters as you have now in scanpy (even though I would not recommend it, see below).

However, what I said about using pre-build kNN graph requires some thinking. T-SNE uses perplexity=30 by default and uses kNN graph with k=3*perplexity, so that's 90 by default. UMAP uses k=15 and that's what you use in scanpy by default too. I can see three options here:

i) Let openTSNE do its own thing and ignore the kNN graph built in scanpy. Advantage: that's what you do now. Disadvantage: not very consistent architecture IMHO.

ii) Use the kNN graph built in scanpy and query() it to get 90 neighbors. Disadvantage: can be a bit slow. But I think it's better than (i).

iii) Run t-SNE using 15 neighbors. Turns out, t-SNE with uniform affinities across 15 neigbours is extremely similar to t-SNE with perplexity 30. Evidence: https://twitter.com/hippopedoid/status/1232698023253303298. So you could run this version of t-SNE with uniform kernel. This will be very fast.

Regarding default parameters: learning rate = 1000 that you use by default is simply not enough for large data (sample size in millions), as shown in that Nat Comms paper in detail. If you want to keep it for compatibility reasons, that's your choice, but be aware that you are getting suboptimal tSNE embeddings. The same about initialization: UMAP smartly uses Laplacian Eigenmaps to initialize, but sklearn/multicore tSNE use random init, which is simply a bad choice (as again shown in that paper). openTSNE now uses PCA init by default which is much more sensible.

pavlin-policar commented 4 years ago

Glad to see this discussion going on. Integrating openTSNE into scanpy should be fairly straightforward but may require some thought. I think Dmitry has already pointed out the most important things such as improved defaults, which other t-SNE implementations are lagging behind in. Apart from that, we package prebuilt binaries, so adding openTSNE as a dependency would be incredibly easy

The main thing we'd have to agree on is how to deal with the KNN graphs. UMAP by default calculates 15 nearest neighbors, and from what I can tell, louvain and leiden clustering both use those 15 neighbors as well by default. t-SNE, on the other hand, calculates 90 nearest neighbors by default. This is every single t-SNE implementation, not just openTSNE. Dmitry suggested using a uniform kernel with 15 neighbors, which would fit elegantly, but then again, this isn't truly t-SNE anymore, but rather something very close.

The same goes for the ingest functionality. openTSNE does something similar to UMAP for adding new samples to existing embeddings, and then we'd again have to figure out how to calculate nearest neighbors from the new data to the reference data. I don't know how you do this currently for UMAP.

I'm not exactly sure how these neighbors are meant to be used in scanpy, since there are several different algorithms that use them. Graph-based clustering uses the KNNG, UMAP uses it, forceatlas2 uses it, PAGA probably as well? Is relying on a single k=15 from UMAP for everything really ok? For example, seurat defaults to using 30 neighbors for clustering.

dkobak commented 4 years ago

Pavlin made good points. @ivirshup What are you thoughts?

ivirshup commented 4 years ago

UMAP by default calculates 15 nearest neighbors, and from what I can tell, louvain and leiden clustering both use those 15 neighbors as well by default.

Both of those clustering algorithms just use whatever graph is passed, so this shouldn't be an issue.

t-SNE, on the other hand, calculates 90 nearest neighbors by default. openTSNE does something similar to UMAP for adding new samples to existing embeddings

Could there just be a separate function for computing neighbors for tsne? sc.pp.neighbors can be considered to be "compute nearest neighbors and connectivity as expected by UMAP", while a separate function could use methods and defaults appropriate to openTSNE.

@Koncopd would have more to say on how this should work w.r.t. ingest.

Is relying on a single k=15 from UMAP for everything really ok?

Ultimately, up to the user. There is an element of consistency and simplicity to using the same representation of the data for multiple parts of the analysis. I think there would have to be a good reason for using a different connectivity matrix for the 2d embedding and for the clustering.

Koncopd commented 4 years ago

Yep, UMAP can use a graph with any number of neighbors, so can ingest.

dkobak commented 3 years ago

@ivirshup @pavlin-policar I'd like to get back to this issue. I think maybe we should postpone discussing ingest until later (as well as "recipes" based on our Nat Comms paper, and other topics raised above) and focus on switching to openTSNE first.

Scanpy's architecture is to compute kNN graph by calling sc.pp.neighbors and then run dimensionality reduction (UMAP) and clustering (Leiden) on this graph. I think this is very neat and makes everything consistent and other functions should follow this approach as much as possible. So IMHO if it's possible to run t-SNE on the kNN graph with k=15, then that's what we should do.

And luckily it is possible! I can even see two approaches. (1) Either use k=15 kNN graph with the uniform similarity kernel. As I said, and as Pavlin knows, this yields result that is very similar to using perplexity=30. (2) Or use k=15 kNN graph with UMAP weights, normalize it as t-SNE expects it to be normalized and use that. My expectation is that it would yield very similar results, but I haven't actually tried it.

Admittedly, this is not the "vanilla" t-SNE. But it's very close. And I think advantages outweigh the disadvantages. Actually this is quite a bit faster than the standard t-SNE, because it only uses k=15 instead of k=90 (3 times perplexity=30).

Moreover, we could make the standard t-SNE available by extending sc.pp.neighors with method="tsne" (there are several methods there already). What I mean is that

sc.pp.neighors()
sc.tl.tsne()

would use let's say uniform kernel on k=15 kNN neighbor graph (and maybe print a warning about it, but I am not even sure it's needed), while

sc.pp.neighbors(method="tsne", perplexity=30)
sc.tl.tsne()

would construct k=90 weighted kNN graph as standard t-SNE does and then use that.

Either way, sc.tl.tsne() runs openTSNE with the pre-defined affinity matrix.

Thoughts?

pavlin-policar commented 3 years ago

I think it would make a lot of sense to at least switch the t-SNE implementation to openTSNE at the very least. For the recipes, there' already something similar in the preprocessing module. So I'd imagine calling standard t-SNE with sc.tl.tsne and the recipes like sc.tl.tsne.recipe_multiscale.

And luckily it is possible! I can even see two approaches. (1) Either use k=15 kNN graph with the uniform similarity kernel. As I said, and as Pavlin knows, this yields result that is very similar to using perplexity=30. (2) Or use k=15 kNN graph with UMAP weights, normalize it as t-SNE expects it to be normalized and use that. My expectation is that it would yield very similar results, but I haven't actually tried it.

I very much prefer option 1. If I understand option 2 correctly, we would normalize the 15 neighbors to essentially perplexity=5. I've never once found a case where that is useful, so having this as the default behaviour in scanpy seems like a really bad idea (I foresee a lot of issues in the style "why is t-SNE not working?"). Using a uniform kernel produces results that are virtually indistinguishable from vanilla t-SNE, so that's fine IMO, and it's faster as well. It's still less than the default perplexity=30, but this seems like the best option. Whatever we agree on, the same can be applied to the ingest functionality, so adding that would also be straightforward.

Moreover, we could make the standard t-SNE available by extending sc.pp.neighors with method="tsne" (there are several methods there already).

I don't understand this, why would this belong on sc.pp.neighbors? The graph weighing should go into the sc.tl.tsne call. Are the UMAP weights assigned to the graph in sc.pp.neighbors? That seems questionable. I would expect the output to be a directed, unweighted graph, and let each method take care of the graph. If anything, I'd expect it to weight it using the Jaccard index of shared nearest neighbors, which seems to me like pretty much the standard thing to do in single-cell analysis.

dkobak commented 3 years ago

I don't understand this, why would this belong on sc.pp.neighbors? The graph weighing should go into the sc.tl.tsne call. Are the UMAP weights assigned to the graph in sc.pp.neighbors?

Yes, this is my understanding of how it works in scanpy. See https://scanpy.readthedocs.io/en/stable/api/scanpy.pp.neighbors.html:

method : {‘umap’, ‘gauss’, ‘rapids’}, None (default: 'umap')

    Use ‘umap’ [McInnes18] or ‘gauss’ (Gauss kernel following [Coifman05] with 
adaptive width [Haghverdi16]) for computing connectivities. Use ‘rapids’ for the
 RAPIDS implementation of UMAP (experimental, GPU only).

If I understand option 2 correctly, we would normalize the 15 neighbors to essentially perplexity=5.

That's not what I meant. I meant taking UMAP's weights for k=15 and normalizing the matrix so that it sums to 1, as in t-SNE. That said, I would also prefer option 1 because I don't want anything that sounds like it's a UMAP/t-SNE hybrid.

pavlin-policar commented 3 years ago

Yes, this is my understanding of how it works in scanpy.

Thinking about this a bit further, yes, that does make perfect sense. I suppose we'd want to add a "tsne" option there as well then.

That's not what I meant. I meant taking UMAP's weights for k=15 and normalizing the matrix so that it sums to 1, as in t-SNE. That said, I would also prefer option 1 because I don't want anything that sounds like it's a UMAP/t-SNE hybrid.

I see now and I completely agree with you: I don't really like this. This would then be some strange mesh between UMAP and t-SNE. It feels messy.