Closed prabhant closed 2 years ago
From what I know from the paper and my experience of using it, the answer is yes.
OTDD can naturally fit the dataset with different size. The label can be mismatched or completely disjoint as mentioned in the paper abstract (iii). If the dimension of feature between dataset X and Y is different, you can replace the euclidean distance in Eq (5) with Gromov-Wasserstein distance (See discussion on page 8 )
@ChenChengKuan's right: the vanilla OTDD already deals with datasets with different number of classes and/or different number of samples. For datasets of different feature dimension, there's two options:
feature_cost
that embeds the two original feature spaces into embedding spaces of the same dimensionality (see example in the README) IncomparableDatasetDistance
which uses the Gromov-Wasserstein distance under the hood. This method is in beta mode, so if you encounter any bugs please let me know. @dmelis thanks for the help This is the MWE for the error I'm getting while using IncomparableDatasetDistance with cuda backend(its a pytorch error) Implemented on google colab with GPU backend
import torch
import openml
from otdd.pytorch.distance import IncomparableDatasetDistance
from otdd.pytorch.datasets import load_torchvision_data
import numpy as np
from torch.utils.data import TensorDataset
d1 = openml.datasets.get_dataset(31)
d2 = openml.datasets.get_dataset(1464)
def dataset_from_numpy(X, Y, classes = None):
targets = torch.LongTensor(list(Y))
ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
ds.targets = targets
ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
return ds
x1,y1,_,_ = d1.get_data(dataset_format="array", target=d1.default_target_attribute)
x2,y2,_,_ = d2.get_data(dataset_format="array", target=d2.default_target_attribute)
ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
print('datasets created')
dist = IncomparableDatasetDistance(ds1, ds2,
debiased_loss = False,
device='cuda')
d = dist.distance(maxsamples = 10000)
This is the error I am getting from this implementation
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
[<ipython-input-9-618f8e73a768>](https://localhost:8080/#) in <module>()
33 device='cuda')
34
---> 35 d = dist.distance(maxsamples = 10000)
2 frames
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in batch_augmented_cost(Z1, Z2, W, Means, Covs, feature_cost, p, λ_x, λ_y)
1321 ## NOTE: geomloss's cost_routines as defined above already divide by p. We do
1322 ## so here too for consistency. But as a consequence, need to divide C2 by p too.
-> 1323 D = λ_x * C1 + λ_y * (C2/p)
1324
1325 return D
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
When changing the distance computations to CPU I get OT error
dist = IncomparableDatasetDistance(ds1, ds2,
debiased_loss = False,
device='cuda')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-10-c2ec3eaf1938>](https://localhost:8080/#) in <module>()
33 device='cpu')
34
---> 35 d = dist.distance(maxsamples = 10000)
2 frames
[/usr/local/lib/python3.7/dist-packages/ot/backend.py](https://localhost:8080/#) in get_backend(*args)
159 # check all same type
160 if not len(set(type(a) for a in args)) == 1:
--> 161 raise ValueError(str_type_error.format([type(a) for a in args]))
162
163 if isinstance(args[0], np.ndarray):
ValueError: All array should be from the same type/backend. Current types are : [<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'numpy.ndarray'>, <class 'numpy.ndarray'>]
I also tried to implement the distance with the feature_cost class but it seems to be giving problems to me with 2D tabular datasets as I am running in too many shape errors, Do you have any suggestions about the parameters src_dim and tgt_dim in Feature_cost class for datasets with dimensions:(1000, 20) and (748, 4)
Just pushed an quick patch that should fix the bug with IncomparableDatasetDistance
. Can you try it again and let me know if it works for you?
As for the feature_cost
approach, what kind of embedder
are you using? Can you provide a mwe?
Hi, thanks for your fix, the code worked but I am getting a warning
/usr/local/lib/python3.7/dist-packages/ot/bregman.py:517: UserWarning: Sinkhorn did not converge. You might want to increase the number of iterations `numItermax` or the regularization parameter `reg`.
warnings.warn("Sinkhorn did not converge. You might want to "
It. |Err
-------------------
0|1.027331e-03|
10|7.006171e-10|
Can you also give a small interpretation of the result? looks like its the error on different iterations, am I right? if yes then what is the OTDD here?
Regarding the feature cost approach here is a small MWE, im using the code from example and just trying to change the datasets and src embedding dimensions(I am not very familiar with embeddings and encoders so I think its an error generated because of wrong parameters supplied by me, your help here will be very appreciated too )
import torch
import openml
from otdd.pytorch.distance import DatasetDistance, FeatureCost
from torchvision.models import resnet18
from otdd.pytorch.datasets import load_torchvision_data
import numpy as np
from torch.utils.data import TensorDataset
d1 = openml.datasets.get_dataset(31)
d2 = openml.datasets.get_dataset(1464)
def dataset_from_numpy(X, Y, classes = None):
targets = torch.LongTensor(list(Y))
ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
ds.targets = targets
ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
return ds
x1,y1,_,_ = d1.get_data(dataset_format="array", target=d1.default_target_attribute)
x2,y2,_,_ = d2.get_data(dataset_format="array", target=d2.default_target_attribute)
ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
print('datasets created')
# Embed using a pretrained (+frozen) resnet
embedder = resnet18(pretrained=True).eval()
embedder.fc = torch.nn.Identity()
for p in embedder.parameters():
p.requires_grad = False
# Here we use same embedder for both datasets
feature_cost = FeatureCost(src_embedding = embedder,
src_dim = (1000,21),
tgt_embedding = embedder,
tgt_dim = (748,5),
p = 2,
device='cuda')
dist = DatasetDistance(ds1, ds2,
inner_ot_method = 'exact',
debiased_loss = True,
feature_cost = feature_cost,
sqrt_method = 'spectral',
sqrt_niters=10,
precision='single',
p = 2, entreg = 1e-1,
device='cuda')
d = dist.distance(maxsamples = 10000)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in __call__(self, X1, X2)
1213 try:
-> 1214 X1 = self.src_emb(X1.view(-1,*self.src_dim).to(self.device)).reshape(B1, N1, -1)
1215 except: # Memory error?
RuntimeError: shape '[-1, 1000, 21]' is invalid for input of size 14000
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
8 frames
RuntimeError: shape '[-1, 1000, 21]' is invalid for input of size 14000
During handling of the above exception, another exception occurred:
SystemExit Traceback (most recent call last)
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/wasserstein.py](https://localhost:8080/#) in pwdist_exact(X1, Y1, X2, Y2, symmetric, loss, cost_function, p, debias, entreg, device)
340 " 1. Too many samples with this label, causing memory issues" \
341 " 2. Datatype errors, e.g., if the two datasets have different type")
--> 342 sys.exit('Distance computation failed. Aborting.')
343 if symmetric:
344 D[j, i] = D[i, j]
SystemExit: Distance computation failed. Aborting.
What do you suggest the right dimensions for the datasets in FeatureCost should be?
@prabhant I think the meaning of src_dim and tgt_dim are the feature dimension of single data point in the data. In the example in readme, the dimension of of an image in CIFAR10 is 3 x 28 x28. The number of data in CIFAR10 should not be included here.
A tabular data with 1000 x 21 has number of data point = 1000 where each data point has dimension 21. So I think your src_dim should be 21. The same logic can be applied to your target data.
What do you suggest the right dimensions for the datasets in FeatureCost should be?
@prabhant The torchvision models as the resnet18
used in the Readme example are intended to be used on images. For tabular data, you'll need to do something different. You might want to look into dimensionality reduction techniques.
Can you also give a small interpretation of the result? looks like its the error on different iterations, am I right? if yes then what is the OTDD here?
The std output you see there is produced by the Gromov Wasserstein solver of POT. The error you get is pretty low, so the early termination is probably not a big issue here. You can also try different entropy regularization values. The method will likely converge with sufficiently large regularization.
The std output you see there is produced by the Gromov Wasserstein solver of POT. The error you get is pretty low, so the early termination is probably not a big issue here. You can also try different entropy regularization values. The method will likely converge with sufficiently large regularization.
Large regularisation did solve the convergence bug, I get one more UserWarning during the run. Maybe you can check if that's relevant or not
/usr/local/lib/python3.7/dist-packages/otdd/pytorch/sqrtm.py:54: UserWarning: torch.symeig is deprecated in favor of torch.linalg.eigh and will be removed in a future PyTorch release.
The default behavior has changed from using the upper triangular portion of the matrix by default to using the lower triangular portion.
L, _ = torch.symeig(A, upper=upper)
should be replaced with
L = torch.linalg.eigvalsh(A, UPLO='U' if upper else 'L')
and
L, V = torch.symeig(A, eigenvectors=True)
should be replaced with
L, V = torch.linalg.eigh(A, UPLO='U' if upper else 'L') (Triggered internally at ../aten/src/ATen/native/BatchLinearAlgebra.cpp:2499.)
s, v = A.symeig(eigenvectors=True) # This is faster in GPU than CPU, fails gradcheck. See https://github.com/pytorch/pytorch/issues/30578
I am still working on creating an embedder for feature_cost approach.
Is it possible to use OTDD on two datasets with different number of labels and different size of X and Y as well as different number of distinct features? If yes what are the recommended setting for this problem?