microsoft / otdd

Optimal Transport Dataset Distance
MIT License
156 stars 48 forks source link

Calculating dataset distances of CSV format datasets #5

Closed SHIELD-SKY closed 3 years ago

SHIELD-SKY commented 3 years ago

I see a function called "dataset_from_numpy"

I want read some data from CSV files, then calculating dataset distances .

` import torch import numpy as np from torch.utils.data import TensorDataset from otdd.pytorch.distance import DatasetDistance

def dataset_from_numpy(X, Y, classes = None): targets = torch.LongTensor(list(Y)) ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets) ds.targets = targets ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))] return ds

x1 = np.array([[1,2,3,5,6,9],[4,5,6,2,4,5]]) y1 = np.array([0,1])

x2 = np.array([[2,2,3,5,3,7],[4,5,6,9,1,4]]) y2 = np.array([1,2])

ds1 = dataset_from_numpy(x1,y1) ds2 = dataset_from_numpy(x2,y2) dist = DatasetDistance(ds1,ds2) dist.distance() `

and i face a problem:

` Traceback (most recent call last): File "", line 1, in File "/home/xxx/otdd/otdd/pytorch/distance.py", line 595, in distance _ = self._get_label_distances() File "/home/xxx/otdd/otdd/pytorch/distance.py", line 439, in _get_label_distances Means, Covs = self._get_label_stats() File "/home/xxx/otdd/otdd/pytorch/distance.py", line 385, in _get_label_stats **shared_args) File "/home/xxx/otdd/otdd/pytorch/moments.py", line 321, in compute_label_stats M = torch.stack([μ.to(device) for i,μ in sorted(M.items()) if μ is not None], dim=0) RuntimeError: stack expects a non-empty TensorList

`

Could you please help me how to solve it?

dmelis commented 3 years ago

Sorry for the delay in responding. The problem is that you only have one sample per class. In order to compute per-label means and covariances, the data needs to have at least one (or more, if min_labelcount>2) samples per class.

This works for me:

import torch
import numpy as np
from torch.utils.data import TensorDataset
from otdd.pytorch.distance import DatasetDistance

def dataset_from_numpy(X, Y, classes = None):
    targets =  torch.LongTensor(list(Y))
    ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
    ds.targets =  targets
    ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
    return ds

samples = 100
dim     = 6

x1 = np.random.randn(samples, dim)
y1 = np.random.randint(0, 2, size=(samples))

x2 = np.random.randn(samples, dim)
y2 = np.random.randint(0, 2, size=(samples))

ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
dist = DatasetDistance(ds1,ds2)
dist.distance()