Closed JISock closed 3 years ago
Thanks Juil for your interest and reporting the issue. I found the bug to be in the "rotation matrix to axis angle" conversion functions which break the gradient graph and give NaN values. I now replaced the functions (here )and it should be fine now. Please let me know if it still doesn't work.
Best, Omid
Thank you. the issue has been resolved.
Best, Juil
Hi Otaheri,
Thank you for the great dataset and the code. I have tried to train the model using the provided code and the dataset and have encountered the following error in the first epoch:
2020-11-27 16:14:05,346 - root - INFO - Dataset Train, Vald, Test size respectively: 0.32 M, 31.03 K, 65.32 K 2020-11-27 16:14:08,496 - root - INFO - Total Trainable Parameters for CoarseNet is 14.04 M. 2020-11-27 16:14:08,497 - root - INFO - Total Trainable Parameters for RefineNet is 3.26 M. 2020-11-27 16:14:08,510 - root - INFO - Started Training at 2020-11-27_16:14:08 for 500 epochs 2020-11-27 16:14:08,511 - root - INFO - --- starting Epoch # 001 2020-11-27 16:14:12,249 - root - INFO - [V00]_TR00_E000 - It 00000 - CoarseNet - train: [T:1.58e+01] - [loss_kl = 2.58e-02 | loss_edge = 1.62e-01 | loss_mesh_rec = 2.18e+00 | loss_dist_h = 7.86e+00 | loss_dist_o = 5.62e+00] [W python_anomaly_mode.cpp:60] Warning: Error detected in DivBackward0. Traceback of forward call that caused the error: File "/home/workspace/GrabNet/train.py", line 106, in
grabnet_trainer.fit()
File "/home/workspace/GrabNet/grabnet/train/trainer.py", line 412, in fit
train_loss_dict_cnet, train_loss_dict_rnet = self.train()
File "/home/workspace/GrabNet/grabnet/train/trainer.py", line 216, in train
drec_cnet = self.coarse_net(*dorig)
File "/home/anaconda3/envs/grabnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(input, *kwargs)
File "/home/workspace/GrabNet/grabnet/models/models.py", line 133, in forward
hand_parms = self.decode(z_s, bps_object)
File "/home/workspace/GrabNet/grabnet/models/models.py", line 117, in decode
results = parms_decode(pose, trans)
File "/home/workspace/GrabNet/grabnet/models/models.py", line 208, in parms_decode
pose_full = CRot2rotmat(pose)
File "/home/workspace/GrabNet/grabnet/tools/utils.py", line 87, in CRot2rotmat
b2 = F.normalize(reshaped_input[:, :, 1] - dot_prod b1, dim=-1)
File "/home/anaconda3/envs/grabnet/lib/python3.8/site-packages/torch/nn/functional.py", line 3752, in normalize
return input / denom
(function print_stack)
Traceback (most recent call last):
File "/home/workspace/GrabNet/train.py", line 106, in
grabnet_trainer.fit()
File "/home/workspace/GrabNet/grabnet/train/trainer.py", line 412, in fit
train_loss_dict_cnet, train_loss_dict_rnet = self.train()
File "/home/workspace/GrabNet/grabnet/train/trainer.py", line 221, in train
loss_total_cnet.backward()
File "/home/anaconda3/envs/grabnet/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/anaconda3/envs/grabnet/lib/python3.8/site-packages/torch/autograd/init.py", line 125, in backward
Variable._execution_engine.run_backward(
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.
Process finished with exit code 1
Could you shed some light on how to fix the above issue?
thanks
Regards, Juil