Open xlcbingo1999 opened 1 year ago
error log:
"This is awkward. Distance computation failed. Geomloss is hard to debug" \
"But here's a few things that might be happening: "\
" 1. Too many samples with this label, causing memory issues" \
" 2. Datatype errors, e.g., if the two datasets have different type"
Distance computation failed. Aborting.
When the product of the two matrices used to calculate cost
exceeds 5000**2, Geomloss goes into backend = 'online'
, and your code does not handle the cost of online, thus causing a bug. I did some processing in the code below
if cost_function == 'euclidean':
if p == 1:
small_cost_function = lambda x, y: geomloss.utils.distances(x, y)
big_cost_function = "Norm2(X-Y)"
elif p == 2:
small_cost_function = lambda x, y: geomloss.utils.squared_distances(x, y)
big_cost_function = "(SqDist(X,Y) / IntCst(2))"
else:
raise ValueError()
if loss == 'sinkhorn':
small_distance = geomloss.SamplesLoss(
loss=loss, p=p,
cost=small_cost_function,
debias=debias,
blur=entreg**(1 / p),
)
big_distance = geomloss.SamplesLoss(
loss=loss, p=p,
cost=big_cost_function,
debias=debias,
blur=entreg**(1 / p),
)
elif loss == 'wasserstein':
def small_distance(Xa, Xb):
C = small_cost_function(Xa, Xb).cpu()
return torch.tensor(ot.emd2(ot.unif(Xa.shape[0]), ot.unif(Xb.shape[0]), C))#, verbose=True)
def big_distance(Xa, Xb):
C = big_cost_function(Xa, Xb).cpu()
return torch.tensor(ot.emd2(ot.unif(Xa.shape[0]), ot.unif(Xb.shape[0]), C))#, verbose=True)
else:
raise ValueError('Wrong loss')
logger.info('Computing label-to-label (exact) wasserstein distances...')
pbar = tqdm(pairs, leave=False)
pbar.set_description('Computing label-to-label distances')
D = torch.zeros((n1, n2), device = device, dtype=X1.dtype)
for i, j in pbar:
try:
temp_left = X1[Y1==c1[i]].to(device)
temp_right = X2[Y2==c2[j]].to(device)
if temp_left.shape[0] * temp_right.shape[0] >= 5000 ** 2:
D[i, j] = big_distance(temp_left, temp_right).item()
else:
D[i, j] = small_distance(temp_left, temp_right).item()
except:
print("This is awkward. Distance computation failed. Geomloss is hard to debug" \
"But here's a few things that might be happening: "\
" 1. Too many samples with this label, causing memory issues" \
" 2. Datatype errors, e.g., if the two datasets have different type")
sys.exit('Distance computation failed. Aborting.')
if symmetric:
D[j, i] = D[i, j]
When I try to compare the distance of two subsets, which randomly sampled from the EMNIST dataset, I use the 'exact' method and follow the format of
example.py
, but I always enter except at the functionpwdist_exact
.It is worth noting that the label distribution of the MNIST subset is not the same as that of the entire EMNIST dataset. In the subset, the number of instances of some labels is 0.
The function
pwdist_exact
seems to return the correct result when I take evenly spaced samples. Here is the code.You can download my
sub_train_datasets_config.json
andtest_dataset_config.json
in Google Drive. Link: https://drive.google.com/drive/folders/1r_vyLJ-RmuuNZqneBP3meexrEZvgc_Ce?usp=sharing