Implement `sc.tl.mde` - Githubissues

adamgayoso commented 2 years ago

[ ] Additional function parameters / changed functionality / changed defaults?
[x] New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
[ ] New plotting function: A kind of plot you would like to seein sc.pl?
[ ] External tools: Do you know an existing package that should go into sc.external.*?
[ ] Other?

PyMDE is a nice visualization method that to me seems to effectively serve the same purpose as UMAP in analyses (discussion about appropriateness of these methods can be in another issue :) ). It's super fast because after running pynndescent it puts the graph on the GPU optionally (using pytorch). I would love to see this in scanpy. There might be a way to use the scanpy neighbors graph from sc.pp.neighbors directly in pymde as the function below is a wrapper of some internal classes.

ivirshup commented 2 years ago

I'm hesitant to add another embedding method without clear benefits over existing implementations.

Could you give some detail on benefits here, ideally with direct comparisons?

adamgayoso commented 2 years ago

See colab notebook.

With random initialization it's about 10x faster than UMAP on this system. The quadratic init (default) is as fast as UMAP, but there's an opportunity to optimize that code to use the GPU.

akshayka commented 2 years ago

Hi @ivirshup; I am the creator of PyMDE. I'm happy to provide some context. I'd like to say up front though that I am not very familiar with Scanpy (which seems like a fantastic library!), so please bear with me if some of the things I mention are not relevant to Scanpy.

PyMDE (documentation here: https://pymde.org/) has a few benefits:

as @adamgayoso mentioned, PyMDE supports computing embeddings on GPU. This makes it possible to compute very large embeddings quickly (often 4-10x faster than CPU).
PyMDE is a very general embedding library. It is based on a general framework for embedding, and this framework includes many well-known methods --- such as UMAP, PCA, Laplacian embedding, multi-dimensional scaling, and more --- as special cases. This makes it easy to compare different methods using a single framework.
PyMDE also supports creating entirely new types of embeddings, as custom instances of our framework.
PyMDE provides ways to reason about how much an embedding distorts the original neighborhood graph.

There are some comparisons to UMAP & openTSNE in the third part of our manuscript, which has been published in Foundations & Trends in Machine Learning and is available here: https://web.stanford.edu/~boyd/papers/pdf/min_dist_emb.pdf

on CPU, UMAP and PyMDE are comparable in speed, with UMAP often having a slight edge; on GPU PyMDE can be much faster
unlike UMAP/openTSNE, PyMDE allows users to fit constrained embeddings. Right now the supported constraints are standardization (zero mean, unit covariance; this forces embeddings to spread out, but not too much, and as a result standardized embeddings are typically similarly scaled), centering, and anchoring (pre-specifying the coordinates of a subset of the items)
PyMDE allows for more types of embeddings, in addition to UMAP-style embeddings

On the other hand, PyMDE is young software. If you do depend on it, I would recommend including it as an optional dependency, not a required one.

Happy to chat more, to answer any questions, and to help with integration, if that is something you are ultimately interested in.

adamgayoso commented 2 years ago

I just updated the notebook linked at the top of the PR. I have a PR at pymde to improve the initialization speed using the GPU (https://github.com/cvxgrp/pymde/pull/55). Using these changes, pymde takes 20 seconds and umap takes around 200 seconds (150k cells). Most of the time of pymde I believe is from the initial pynndescent call. Therefore, if implemented well here and therefore using a precomputed neighbors graph, pymde would take no more than a few seconds for most use cases.

ivirshup commented 2 years ago

@akshayka thanks for contributing to the conversation! The package does indeed look interesting. A couple questions about the tool:

Can we get a weighted graph out of the fit embedding object? For context, we use the UMAP weighted connectivity graph for a number of downstream tasks. This seems related to distortions, but maybe not quite what they are.

I'm also wondering about just how early the package is. I would like to be able to take advantage of any new features, and wouldn't want an early API decision to lock us out of those.

akshayka commented 2 years ago

Can we get a weighted graph out of the fit embedding object?

Yes, PyMDE can do that. The longer answer is that it depends on the type of embedding problem you set up --- some are specified using weighted graphs, others are not. But most embedding problems (including all problems specified using the preserve_neighbors function, which is the most commonly used recipe) have associated weighted graphs.

I'm also wondering about just how early the package is. I would like to be able to take advantage of any new features, and wouldn't want an early API decision to lock us out of those.

Great question. The internals will very likely change over the coming months/year. But the interface to the MDE class, which is the central object in PyMDE, will likely be stable.

ivirshup commented 2 years ago

The internals will very likely change over the coming months/year.

Good to know, thanks! Any hints about what will change here? In particular, I'm wondering if there might be a jax implementation as I'm a bit more keen on that as a dependency.

But most embedding problems (including all problems specified using the preserve_neighbors function, which is the most commonly used recipe) have associated weighted graphs.

I'd be interested in seeing how these graphs perform compared to the ones we get from UMAP. Would this be the right way to retrieve the graphs for the object, or is distortions not the right field?

from scipy import sparse

weights = mde.distortions().cpu().numpy()
edges = mde.edges.cpu().numpy()

graph = sparse.coo_matrix((weights, (edges[:, 0], edges[:, 1])), shape=(mde.n_items, mde.n_items))

adamgayoso commented 2 years ago

I'm wondering if there might be a jax implementation as I'm a bit more keen on that as a dependency.

Probably for another discussion -- I like jax as much as anyone, but it's not nearly as easy to install as pytorch, especially on windows and m1 mac.

akshayka commented 2 years ago

In particular, I'm wondering if there might be a jax implementation as I'm a bit more keen on that as a dependency.

I don't have any plans to switch from PyTorch to JAX. I did evaluate JAX when I started the project, but it wasn't mature enough back then.

I'd be interested in seeing how these graphs perform compared to the ones we get from UMAP.

I'm not super clear on the semantics of the graphs obtained from UMAP. They might differ somewhat from the ones obtained from PyMDE.

Would this be the right way to retrieve the graphs for the object, or is distortions not the right field?

That's not quite right. Assuming that mde was constructed from preserve_neighbors, try this:


weights = mde.distortion_function.weights.cpu().numpy()
edges = mde.edges.cpu().numpy()
n_items = mde.n_items

graph = pymde.Graph.from_edges(edges, weights, n_items).adjacency_matrix

(API docs for Graph here: https://pymde.org/api/index.html#pymde.Graph. In the Graph class, distances/weights are used interchangeably.)

I'll just mention however that with PyMDE, the weights and edges don't fully determine the embedding. The weights are parameters to distortion functions, which convey the extent to which two items are similar or dissimilar. Roughly speaking positive weights mean items are similar and should be close together, and negative weights mean that they're dissimilar and shouldn't be close (but need not be far). More details here:https: //pymde.org/mde/index.html

scverse / scanpy

Implement `sc.tl.mde` #2154