otaheri / GrabNet

GrabNet: A Generative model to generate realistic 3D hands grasping unseen objects (ECCV2020)
https://grab.is.tue.mpg.de
Other
229 stars 29 forks source link

NaN gradient at the beginning of the training #2

Closed JISock closed 3 years ago

JISock commented 3 years ago

Hi Otaheri,

Thank you for the great dataset and the code. I have tried to train the model using the provided code and the dataset and have encountered the following error in the first epoch:

2020-11-27 16:14:05,346 - root - INFO - Dataset Train, Vald, Test size respectively: 0.32 M, 31.03 K, 65.32 K 2020-11-27 16:14:08,496 - root - INFO - Total Trainable Parameters for CoarseNet is 14.04 M. 2020-11-27 16:14:08,497 - root - INFO - Total Trainable Parameters for RefineNet is 3.26 M. 2020-11-27 16:14:08,510 - root - INFO - Started Training at 2020-11-27_16:14:08 for 500 epochs 2020-11-27 16:14:08,511 - root - INFO - --- starting Epoch # 001 2020-11-27 16:14:12,249 - root - INFO - [V00]_TR00_E000 - It 00000 - CoarseNet - train: [T:1.58e+01] - [loss_kl = 2.58e-02 | loss_edge = 1.62e-01 | loss_mesh_rec = 2.18e+00 | loss_dist_h = 7.86e+00 | loss_dist_o = 5.62e+00] [W python_anomaly_mode.cpp:60] Warning: Error detected in DivBackward0. Traceback of forward call that caused the error: File "/home/workspace/GrabNet/train.py", line 106, in grabnet_trainer.fit() File "/home/workspace/GrabNet/grabnet/train/trainer.py", line 412, in fit train_loss_dict_cnet, train_loss_dict_rnet = self.train() File "/home/workspace/GrabNet/grabnet/train/trainer.py", line 216, in train drec_cnet = self.coarse_net(*dorig) File "/home/anaconda3/envs/grabnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, *kwargs) File "/home/workspace/GrabNet/grabnet/models/models.py", line 133, in forward hand_parms = self.decode(z_s, bps_object) File "/home/workspace/GrabNet/grabnet/models/models.py", line 117, in decode results = parms_decode(pose, trans) File "/home/workspace/GrabNet/grabnet/models/models.py", line 208, in parms_decode pose_full = CRot2rotmat(pose) File "/home/workspace/GrabNet/grabnet/tools/utils.py", line 87, in CRot2rotmat b2 = F.normalize(reshaped_input[:, :, 1] - dot_prod b1, dim=-1) File "/home/anaconda3/envs/grabnet/lib/python3.8/site-packages/torch/nn/functional.py", line 3752, in normalize return input / denom (function print_stack) Traceback (most recent call last): File "/home/workspace/GrabNet/train.py", line 106, in grabnet_trainer.fit() File "/home/workspace/GrabNet/grabnet/train/trainer.py", line 412, in fit train_loss_dict_cnet, train_loss_dict_rnet = self.train() File "/home/workspace/GrabNet/grabnet/train/trainer.py", line 221, in train loss_total_cnet.backward() File "/home/anaconda3/envs/grabnet/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/anaconda3/envs/grabnet/lib/python3.8/site-packages/torch/autograd/init.py", line 125, in backward Variable._execution_engine.run_backward( RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

Process finished with exit code 1

Could you shed some light on how to fix the above issue?

thanks

Regards, Juil

otaheri commented 3 years ago

Thanks Juil for your interest and reporting the issue. I found the bug to be in the "rotation matrix to axis angle" conversion functions which break the gradient graph and give NaN values. I now replaced the functions (here )and it should be fine now. Please let me know if it still doesn't work.

Best, Omid

JISock commented 3 years ago

Thank you. the issue has been resolved.

Best, Juil