namepllet / HandOccNet

Offical pytorch implementation of "HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network", CVPR 2022.
156 stars 30 forks source link

model can‘t converge #9

Closed luhavefun closed 2 years ago

luhavefun commented 2 years ago

Thanks for your great work. I meet some troubles in training on HO3D-v2. I trained the model according to the given steps, but found that the model did not converge properly. Here is the logs from the last epoch: 08-09 04:21:59 Epoch 69/70 itr 4123/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 5.1809 loss_mano_joints: 5.4888 loss_mano_pose: 0.3954 loss_mano_shape: 0.2697 loss_joints_img: 3.1225 08-09 04:22:00 Epoch 69/70 itr 4124/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 11.5330 loss_mano_joints: 12.5328 loss_mano_pose: 0.6882 loss_mano_shape: 0.3608 loss_joints_img: 4.0549 08-09 04:22:00 Epoch 69/70 itr 4125/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 4.1236 loss_mano_joints: 4.4260 loss_mano_pose: 0.3778 loss_mano_shape: 0.5172 loss_joints_img: 3.0202 08-09 04:22:00 Epoch 69/70 itr 4126/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 23.6163 loss_mano_joints: 25.9149 loss_mano_pose: 0.5916 loss_mano_shape: 0.2481 loss_joints_img: 3.1539 08-09 04:22:00 Epoch 69/70 itr 4127/4128: lr: 1.17649e-05 speed: 0.19(0.19s r0.00)s/itr 0.22h/epoch loss_mano_verts: 10.1710 loss_mano_joints: 10.8513 loss_mano_pose: 0.5306 loss_mano_shape: 0.2882 loss_joints_img: 4.7335

namepllet commented 2 years ago

We trained our model with 4 GPUs.

So the effective batch size is 64 (16 per GPU * 4 GPUs).

I think you trained model with single GPU and set learning rate as 1e-4.

If you'd like to train model with single GPU, please set learning rate as 1e-4*1/4.

luhavefun commented 2 years ago

Thank you for your reply! It is stable at present (20epochs) after changing the lr.