Closed fbeilstein closed 1 year ago
Hi @fbeilstein,
Thanks for your interest in this library! The mismatch here comes from the fact that Gudhi is not computing the usual Wasserstein distance, but a non-standard variation (documented here) that is tailored for persistence diagrams:
Going forward, I suggest the following:
I remember a discussion with @tlacombe on the topic: he may already have some pointers and/or implementation available somewhere?
Best regards, Jean
Hello,
TL;DR: I completely agree with @jeanfeydy on this topic: this is not a bug and the results are completely normal, what gudhi
calls wasserstein
is not the same as what geomloss
calls Wasserstein (the seminal Wasserstein, tbh).
Details: somewhat unfortunately, the naming wasserstein
that is used in Topological Data Analysis (in particular in gudhi
) is overloading what people usually call Wasserstein in the optimal transport community.
As Jean said, gudhi.wasserstein
should be called (if we want to stick to OT formalism) gudhi.unbalanced_optimal_transport_with_Dirichlet_boundary_condition
, as it actually correspond to the model proposed by Figalli and Gigli in this paper, as detailed here. Due to the clear similarities between these two models, the TDA community sticks to calling the metrics between persistence diagrams Wasserstein (hence the name chosen in gudhi
), but I agree that this can be the source of some confusion.
Unfortunately, I do not have a KeOps/geomloss-friendly implementation of "Wasserstein-for-persistence-diagrams" available. I agree with Jean, implementing this should be doable and could be quite useful (I was considering working on that at some point and then, you know, time flies...). I however warn that adding an entropic term to (properly) regularize PD metrics is not as easy as I expected in the past as it may yield inhomogeneity issues, something I have been discussing there.
I am open to discussion if someone is willing to spend some time implementing gudhi.wasserstein
using KeOps
.
Hi @jeanfeydy!
Thank you so much for the detailed explanation! Now things are starting to make sense as I was already getting desperate trying to find the source of the problem. I'll review your thesis and, if I can wrap my head around it, try to implement some code.
Sincerely, Vitalii
Sorry to hijack the thread a little here but I didn't want to create a new one for such a similar question. I was wondering if you could shed light on the differences between GeomLoss' Wasserstein and Scipy's Wasserstein Distance.
Using the same settings above,
a = np.array([6, 4, 3, 0, 0, 1, 0, 8])
b = np.array([0, 8, 1, 0, 0, 1, 0, 2])
a_dist = a / np.sum(a)
b_dist = b / np.sum(b)
geomloss_wasserstein = SamplesLoss(
loss='sinkhorn',
debias=False,
p=1,
blur=1e-3,
scaling=0.999,
backend='auto'
)(torch.tensor(a_dist.reshape(1, -1)), torch.tensor(b_dist.reshape(1, -1)))
scipy_wasserstein = wasserstein_distance(a_dist, b_dist)
Gives me 0.5937, 0.0758 (respectively for GeomLoss, Scipy).
My use case is such that I want to use some of the losses defined in GeomLoss as a simple distance metric between probability distributions or vector representations for machine learning. Thanks in advance.
)(torch.tensor(a_dist.reshape(1, -1)), torch.tensor(b_dist.reshape(1, -1)))
Did you mean reshape(-1,1)
?
)(torch.tensor(a_dist.reshape(1, -1)), torch.tensor(b_dist.reshape(1, -1)))
Did you mean
reshape(-1,1)
?
I do this simply when I am testing with a single sample as the package expects (N,D) so I reshape it to (1,D).
Sorry to hijack the thread a little here but I didn't want to create a new one for such a similar question.
Next time, you still should create a new issue for a new question.
I do this simply when I am testing with a single sample as the package expects (N,D) so I reshape it to (1,D).
Then the comparison makes no sense, because scipy only supports D=1.
Sorry to hijack the thread a little here but I didn't want to create a new one for such a similar question.
Next time, you still should create a new issue for a new question.
I do this simply when I am testing with a single sample as the package expects (N,D) so I reshape it to (1,D).
Then the comparison makes no sense, because scipy only supports D=1.
Apologies, will do. I only reshape it for Geomloss, using Scipy's wasserstein I compare the 1D distributions as they are, i.e. Geomloss: [1, 8] v. [8] for Scipy.
Hi @pugantsov ,
Thanks to @mglisse : the problem was indeed related to the switch from .reshape(1,-1)
to .reshape(-1,1)
.
The code below works as expected:
import torch
import numpy as np
from geomloss import SamplesLoss
from scipy.stats import wasserstein_distance
a = np.array([6, 4, 3, 0, 0, 1, 0, 8])
b = np.array([0, 8, 1, 0, 0, 1, 0, 2])
a_dist = a / np.sum(a)
b_dist = b / np.sum(b)
geomloss_wasserstein = SamplesLoss(
loss='sinkhorn',
debias=False,
p=1,
blur=1e-4,
scaling=0.999,
backend='auto'
)(torch.tensor(a_dist.reshape(-1, 1)), torch.tensor(b_dist.reshape(-1, 1)))
scipy_wasserstein = wasserstein_distance(a_dist, b_dist)
print(geomloss_wasserstein)
print(scipy_wasserstein)
Which gives 0.0759
and 0.07575
.
To be clear on what caused the confusion, SamplesLoss interprets (N,D)
input arrays as collections of N
points in dimension D
. In this context,
SamplesLoss(...)(torch.tensor(a_dist.reshape(1, -1)), torch.tensor(b_dist.reshape(1, -1)))
is interpreted as a Wasserstein distance between single-point Diract distributions in dimension len(a) = 8
, which is clearly not what was expected by @pugantsov. To retrieve a comparison between 8 points in dimension 1, one should indeed call:
SamplesLoss(...)(torch.tensor(a_dist.reshape(-1, 1)), torch.tensor(b_dist.reshape(-1, 1)))
Best regards, Jean
When transfering from Gudhi (library for Topological Data Analysis) I've spotted significant difference in results
yields
0.8269999999999964
, whileyields
12.854110717773438
.I'm not sure if this is a bug or me doing something wrong, but I suppose it to be worth reporting.