lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.31k stars 796 forks source link

Support for wasserstein (earth mover's) distance #751

Open sgbaird opened 3 years ago

sgbaird commented 3 years ago

On a longer term front, pynndescent, which is going to be the engine for nearest neighbor search in umap v0.5 (and was spun out of the original umap code), will have a wasserstein/kantorovich distance in version 0.5. A sparse/online version of that would probably be enough to implement word-mover distance directly in the nearest neighbor search itself. That is still a little way off however.

Originally posted by @lmcinnes in https://github.com/lmcinnes/umap/issues/424#issuecomment-625469698

What is the status on this? I'm using vector-based (not text-based) earth mover's distance.

lmcinnes commented 3 years ago

It exists, and you can use it right now; it isn't necessarily well or prominently placed in any documentation at this time. Depending on how big your distributions are there are a couple of ways of going about this. I'll outline how the most basic version works.

I will suppose that we have data as distributions over finite number of vectors. For simplicity of display here, let's make number of vectors 3 (it can definitely be larger!). So we might have three vectors in 5d space (numbers made up randomly):

vectors = [
    [0.1, -0.1, 7.9, 6.1, -2.4],
    [0.7, 2.1, 6.5, -1.3, 3.1],
    [-0.1, 5.4, -0.5, 1.1, 1.2],
]

and our data is probability distributions over those vectors:

distributions = [
    [0.1, 0.9, 0.0],
    [0.33, 0.33, 0.33],
    [0.2, 0.5, 0.3],
    [0.25, 0.7, 0.05],
    [0.6, 0.2, 0.2],
]

The the thing you need is the cost or distance matrix between the vectors. You could do something like:

cost = sklearn.metrics.pairwise_distances(vectors)

And then you can hand the whole thing directly to UMAP as

mapper = umap.UMAP(metric="wasserstein", metric_kwds={"cost": cost}).fit(distributions)

and that should just work. Note, however, that it will tend to be fairly slow (especially if you have a lot of data), because it is actually solving the requisite optimal transport problems. Obviously this is not going to scale to large distributions over many vectors as the cost matrix would simply get too large in those cases. There are options to deal with that, but it is a little more complicated out of the box, so I won't discuss that now.

Another option, if you are looking into these sorts of things, is the WassersteinVectorizer in the vectorizers library. It uses linear optimal transport theory to create a linear space such that distances are correlated with Wasserstein distance of the original data. It would look something like:

linear_vectors = vectorizers.WassersteinVectorizer(metric="euclidean").fit_transform(distributions, vectors=vectors)
sgbaird commented 2 years ago

https://github.com/lmcinnes/pynndescent/issues/136#issuecomment-994289807

I think dist_matrix can be useful in this context, but only for 1D feature vectors, i.e. it's not general to nD like the optimal transport libraries., but it's very fast (10k x 10k distance matrix in ~15s on GPU). So far I've been using it to supply distance matrices in precomputed mode to UMAP, but I run into the limitation of transform (and inverse_transform) being unavailable. I've been meaning to give it a try where I pass in wasserstein_distance directly (in reality, probably my_wasserstein_distance to "absorb" the weights into the original features).

I'm not sure if a 1D solution is of any use in a word mover's distance context, so the point may be moot other than for specialized applications like my own (chemical composition materials informatics: mat_discover / chem_wasserstein).

lmcinnes commented 2 years ago

Actually the 1D case just got added to pynndescent (lmcinnes/pynndescent#155), and so should also be available in UMAP with metric="wasserstein-1d" if you get the current version of pynndescent from github (it isn't in a PyPI or conda-forge release yet). It should support both dense and sparse input. In case it is useful that PR also included circular Wasserstein for when your distribution is on a circle rather than being linear.