microsoft / otdd

Optimal Transport Dataset Distance
MIT License
156 stars 48 forks source link

Distance between the same dataset > 0? #22

Closed prabhant closed 2 years ago

prabhant commented 2 years ago

Hi,

I don't understand why the difference between the same distribution is greater than 0. MWE

import torch
import numpy as np
from torch.utils.data import TensorDataset
from otdd.pytorch.distance import DatasetDistance

import openml
d1 = openml.datasets.get_dataset(31)
d2 = openml.datasets.get_dataset(31)

def dataset_from_numpy(X, Y, classes = None):
    targets =  torch.LongTensor(list(Y))
    ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
    ds.targets =  targets
    ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
    return ds

samples = 100
dim     = 6

x1,y1,_,_ = d1.get_data(dataset_format="array", target=d1.default_target_attribute)
x2,y2,_,_ = d2.get_data(dataset_format="array", target=d2.default_target_attribute)

# x2 = np.random.randn(samples, dim)
# y2 = np.random.randint(0, 2, size=(samples))

ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
dist = DatasetDistance(ds1,ds2, inner_ot_method = 'exact',
                       debiased_loss = True, entreg = 1e-1,
                       device='cuda')
print(dist.distance())

output:
tensor(1.8688, device='cuda:0')
dmelis commented 2 years ago

Hi @prabhant. Can you run it again with the arg inner_ot_debiased=True? It should be 0 now. The explanation is that the label-to-label distances (the 'inner problem') also relies on entropy regularization, which introduces a bias term, and therefore might lead to d(a,a) >0. That flag controls whether the debiased version of OT is used for this inner problem too. I should probably set the default to True.

prabhant commented 2 years ago

Yes i get a zero now, thanks for explanation.