microsoft / otdd

Optimal Transport Dataset Distance
MIT License
156 stars 48 forks source link

Using OTDD on two different datasets with different sizes? #18

Closed prabhant closed 2 years ago

prabhant commented 2 years ago

Is it possible to use OTDD on two datasets with different number of labels and different size of X and Y as well as different number of distinct features? If yes what are the recommended setting for this problem?

ChenChengKuan commented 2 years ago

From what I know from the paper and my experience of using it, the answer is yes.

OTDD can naturally fit the dataset with different size. The label can be mismatched or completely disjoint as mentioned in the paper abstract (iii). If the dimension of feature between dataset X and Y is different, you can replace the euclidean distance in Eq (5) with Gromov-Wasserstein distance (See discussion on page 8 )

dmelis commented 2 years ago

@ChenChengKuan's right: the vanilla OTDD already deals with datasets with different number of classes and/or different number of samples. For datasets of different feature dimension, there's two options:

  1. use a feature_cost that embeds the two original feature spaces into embedding spaces of the same dimensionality (see example in the README)
  2. use IncomparableDatasetDistance which uses the Gromov-Wasserstein distance under the hood. This method is in beta mode, so if you encounter any bugs please let me know.
prabhant commented 2 years ago

@dmelis thanks for the help This is the MWE for the error I'm getting while using IncomparableDatasetDistance with cuda backend(its a pytorch error) Implemented on google colab with GPU backend

import torch
import openml
from otdd.pytorch.distance import IncomparableDatasetDistance
from otdd.pytorch.datasets import load_torchvision_data
import numpy as np
from torch.utils.data import TensorDataset

d1 = openml.datasets.get_dataset(31)
d2 = openml.datasets.get_dataset(1464)

def dataset_from_numpy(X, Y, classes = None):
    targets =  torch.LongTensor(list(Y))
    ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
    ds.targets =  targets
    ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
    return ds

x1,y1,_,_ = d1.get_data(dataset_format="array", target=d1.default_target_attribute)
x2,y2,_,_ = d2.get_data(dataset_format="array", target=d2.default_target_attribute)

ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
print('datasets created')

dist = IncomparableDatasetDistance(ds1, ds2,
                          debiased_loss = False,
                          device='cuda')

d = dist.distance(maxsamples = 10000)

This is the error I am getting from this implementation

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-9-618f8e73a768>](https://localhost:8080/#) in <module>()
     33                           device='cuda')
     34 
---> 35 d = dist.distance(maxsamples = 10000)

2 frames
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in batch_augmented_cost(Z1, Z2, W, Means, Covs, feature_cost, p, λ_x, λ_y)
   1321     ## NOTE: geomloss's cost_routines as defined above already divide by p. We do
   1322     ## so here too for consistency. But as a consequence, need to divide C2 by p too.
-> 1323     D = λ_x * C1  +  λ_y * (C2/p)
   1324 
   1325     return D

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

When changing the distance computations to CPU I get OT error

dist = IncomparableDatasetDistance(ds1, ds2,
                          debiased_loss = False,
                          device='cuda')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-10-c2ec3eaf1938>](https://localhost:8080/#) in <module>()
     33                           device='cpu')
     34 
---> 35 d = dist.distance(maxsamples = 10000)

2 frames
[/usr/local/lib/python3.7/dist-packages/ot/backend.py](https://localhost:8080/#) in get_backend(*args)
    159     # check all same type
    160     if not len(set(type(a) for a in args)) == 1:
--> 161         raise ValueError(str_type_error.format([type(a) for a in args]))
    162 
    163     if isinstance(args[0], np.ndarray):

ValueError: All array should be from the same type/backend. Current types are : [<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>]
prabhant commented 2 years ago

I also tried to implement the distance with the feature_cost class but it seems to be giving problems to me with 2D tabular datasets as I am running in too many shape errors, Do you have any suggestions about the parameters src_dim and tgt_dim in Feature_cost class for datasets with dimensions:(1000, 20) and (748, 4)

dmelis commented 2 years ago

Just pushed an quick patch that should fix the bug with IncomparableDatasetDistance. Can you try it again and let me know if it works for you?

As for the feature_cost approach, what kind of embedder are you using? Can you provide a mwe?

prabhant commented 2 years ago

Hi, thanks for your fix, the code worked but I am getting a warning

/usr/local/lib/python3.7/dist-packages/ot/bregman.py:517: UserWarning: Sinkhorn did not converge. You might want to increase the number of iterations `numItermax` or the regularization parameter `reg`.
  warnings.warn("Sinkhorn did not converge. You might want to "
It.  |Err         
-------------------
    0|1.027331e-03|
   10|7.006171e-10|

Can you also give a small interpretation of the result? looks like its the error on different iterations, am I right? if yes then what is the OTDD here?

prabhant commented 2 years ago

Regarding the feature cost approach here is a small MWE, im using the code from example and just trying to change the datasets and src embedding dimensions(I am not very familiar with embeddings and encoders so I think its an error generated because of wrong parameters supplied by me, your help here will be very appreciated too )

import torch
import openml
from otdd.pytorch.distance import DatasetDistance, FeatureCost
from torchvision.models import resnet18
from otdd.pytorch.datasets import load_torchvision_data
import numpy as np
from torch.utils.data import TensorDataset

d1 = openml.datasets.get_dataset(31)
d2 = openml.datasets.get_dataset(1464)

def dataset_from_numpy(X, Y, classes = None):
    targets =  torch.LongTensor(list(Y))
    ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
    ds.targets =  targets
    ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
    return ds

x1,y1,_,_ = d1.get_data(dataset_format="array", target=d1.default_target_attribute)
x2,y2,_,_ = d2.get_data(dataset_format="array", target=d2.default_target_attribute)

ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
print('datasets created')

# Embed using a pretrained (+frozen) resnet
embedder = resnet18(pretrained=True).eval()
embedder.fc = torch.nn.Identity()
for p in embedder.parameters():
    p.requires_grad = False

# Here we use same embedder for both datasets
feature_cost = FeatureCost(src_embedding = embedder,
                           src_dim = (1000,21),
                           tgt_embedding = embedder,
                           tgt_dim = (748,5),
                           p = 2,
                           device='cuda')

dist = DatasetDistance(ds1, ds2,
                          inner_ot_method = 'exact',
                          debiased_loss = True,
                          feature_cost = feature_cost,
                          sqrt_method = 'spectral',
                          sqrt_niters=10,
                          precision='single',
                          p = 2, entreg = 1e-1,
                          device='cuda')

d = dist.distance(maxsamples = 10000)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in __call__(self, X1, X2)
   1213             try:
-> 1214                 X1 = self.src_emb(X1.view(-1,*self.src_dim).to(self.device)).reshape(B1, N1, -1)
   1215             except: # Memory error?

RuntimeError: shape '[-1, 1000, 21]' is invalid for input of size 14000

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
8 frames
RuntimeError: shape '[-1, 1000, 21]' is invalid for input of size 14000

During handling of the above exception, another exception occurred:

SystemExit                                Traceback (most recent call last)
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/wasserstein.py](https://localhost:8080/#) in pwdist_exact(X1, Y1, X2, Y2, symmetric, loss, cost_function, p, debias, entreg, device)
    340                   " 1. Too many samples with this label, causing memory issues" \
    341                   " 2. Datatype errors, e.g., if the two datasets have different type")
--> 342             sys.exit('Distance computation failed. Aborting.')
    343         if symmetric:
    344             D[j, i] = D[i, j]

SystemExit: Distance computation failed. Aborting.

What do you suggest the right dimensions for the datasets in FeatureCost should be?

ChenChengKuan commented 2 years ago

@prabhant I think the meaning of src_dim and tgt_dim are the feature dimension of single data point in the data. In the example in readme, the dimension of of an image in CIFAR10 is 3 x 28 x28. The number of data in CIFAR10 should not be included here.

A tabular data with 1000 x 21 has number of data point = 1000 where each data point has dimension 21. So I think your src_dim should be 21. The same logic can be applied to your target data.

dmelis commented 2 years ago

What do you suggest the right dimensions for the datasets in FeatureCost should be?

@prabhant The torchvision models as the resnet18 used in the Readme example are intended to be used on images. For tabular data, you'll need to do something different. You might want to look into dimensionality reduction techniques.

Can you also give a small interpretation of the result? looks like its the error on different iterations, am I right? if yes then what is the OTDD here?

The std output you see there is produced by the Gromov Wasserstein solver of POT. The error you get is pretty low, so the early termination is probably not a big issue here. You can also try different entropy regularization values. The method will likely converge with sufficiently large regularization.

prabhant commented 2 years ago

The std output you see there is produced by the Gromov Wasserstein solver of POT. The error you get is pretty low, so the early termination is probably not a big issue here. You can also try different entropy regularization values. The method will likely converge with sufficiently large regularization.

Large regularisation did solve the convergence bug, I get one more UserWarning during the run. Maybe you can check if that's relevant or not

/usr/local/lib/python3.7/dist-packages/otdd/pytorch/sqrtm.py:54: UserWarning: torch.symeig is deprecated in favor of torch.linalg.eigh and will be removed in a future PyTorch release.
The default behavior has changed from using the upper triangular portion of the matrix by default to using the lower triangular portion.
L, _ = torch.symeig(A, upper=upper)
should be replaced with
L = torch.linalg.eigvalsh(A, UPLO='U' if upper else 'L')
and
L, V = torch.symeig(A, eigenvectors=True)
should be replaced with
L, V = torch.linalg.eigh(A, UPLO='U' if upper else 'L') (Triggered internally at  ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2499.)
  s, v = A.symeig(eigenvectors=True) # This is faster in GPU than CPU, fails gradcheck. See https://github.com/pytorch/pytorch/issues/30578

I am still working on creating an embedder for feature_cost approach.