Open lithiumice opened 1 year ago
Hi, have you solved this problem? I got the same error after the first several iterations of training.
ValueError: Expected parameter loc (Tensor of shape (128, 32)) of distribution Normal(loc: torch.Size([128, 32]), scale: torch.Size([128, 32])) to satisfy the constraint Real(), but found invalid values: tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', grad_fn=<AddmmBackward0>) Epoch 0: 0%| | 106/145399 [00:05<2:08:50, 18.79it/s, v_num=6, train_loss=0.777]
I've successfully identified and fixed a bug in the geodesic_loss_R class, which is part of the loss function used in VPoser. The issue was related to the calculation of the cosine values in the geodesic loss function for rotation matrices. The modified code snippet in src/human_body_prior/tools/angle_continuous_repres.py
is shown below:
class geodesic_loss_R(nn.Module):
def __init__(self, reduction='batchmean'):
super(geodesic_loss_R, self).__init__()
self.reduction = reduction
self.eps = 1e-6
# batch geodesic loss for rotation matrices
def bgdR(self, m1, m2):
m = torch.bmm(m1, m2.transpose(1, 2)) # batch*3*3
cos = (m[:, 0, 0] + m[:, 1, 1] + m[:, 2, 2] - 1) / 2
# the fixed bug
cos = torch.clamp(cos, -1 + self.eps, 1 - self.eps)
return torch.acos(cos)
def forward(self, ypred, ytrue):
theta = self.bgdR(ypred, ytrue)
if self.reduction == 'mean':
return torch.mean(theta)
if self.reduction == 'batchmean':
return torch.mean(torch.sum(theta, dim=theta.shape[1:]))
else:
return theta
I try to retrain VPoser on AMASS dataset which I downloaded from the official website, I follow the instruction of README but still got this weird error. After training about 200 epoch, the code line 56 of src/human_body_prior/models/vposer_model.py"
torch.distributions.normal.Normal
turn to get the Nan value. It seems like it is caused by data issues.I will appreciate it if anyone can figure out why and how, or give me any insight.