nan loss and weights in training

windness97 commented 4 years ago

@mks0601 Hi, thank you for your great work. I had a problem while training the model in the first 'lixel' stage. I care more about the Human Pose and Mesh Estimation performance, and I've downloaded Human3.6M (from your another project 3DMPPE, they are same stuff, right?), MSCOCO2017 and 3DPW datasets with the links you provided and make them meet the requirements as mentioned in README.md. I haven't downloaded MuCo dataset, so I modify main/config.py like this:

## dataset
# MuCo, Human36M, MSCOCO, PW3D, FreiHAND
# trainset_3d = ['Human36M', 'MuCo'] # MuCo, Human36M, FreiHAND
trainset_3d = ['Human36M'] # only Human36M
trainset_2d = ['MSCOCO'] # MSCOCO
testset = 'PW3D' # Human36M, MSCOCO, PW3D, FreiHAND

besides that, I modified the train_batch_size from 16 to 48. then I try to execute main/train.py with no more modification of the config: python train.py --gpu 0-2 --stage lixel it runs on 3 titan rtx, and everything looks fine, but nan loss occurs in the No.0 epoch:

08-25 21:04:53 Epoch 0/13 itr 353/4336: lr: 0.0001 speed: 0.98(0.97s r0.00)s/itr 1.18h/epoch loss_joint_fit: 0.8628 loss_joint_orig: 0.5176 loss_mesh_fit: 1.1137 loss_mesh_joint_orig: 0.5825 loss_mesh_joint_fit: 0.9378 loss_mesh_normal: 0.0103 loss_mesh_edge: 0.0672
08-25 21:04:54 Epoch 0/13 itr 354/4336: lr: 0.0001 speed: 0.98(0.97s r0.00)s/itr 1.18h/epoch loss_joint_fit: 0.6361 loss_joint_orig: 0.5020 loss_mesh_fit: nan loss_mesh_joint_orig: 0.5407 loss_mesh_joint_fit: 0.6459 loss_mesh_normal: nan loss_mesh_edge: nan
08-25 21:04:55 Epoch 0/13 itr 355/4336: lr: 0.0001 speed: 0.98(0.97s r0.00)s/itr 1.18h/epoch loss_joint_fit: nan loss_joint_orig: nan loss_mesh_fit: nan loss_mesh_joint_orig: nan loss_mesh_joint_fit: nan loss_mesh_normal: nan loss_mesh_edge: nan

I modified the train_batch_size from 16 to 48, so the total iters looks quite small. You can see that loss_mesh_fit, loss_mesh_normal and loss_mesh_edge become nan first, and after that all losses become nan. I debug the program and find out that the weights of relevant layers all become nan when it happens, so it might be due to the occurence of above nan loss, and then spread to the params of all layers through BP. I've tried several times, with different gpu numbers (0, 1, 2) and different batch size (8, 16, 32, 48), it always happens at a point in the first epoch (0/13). I thought it might be due to some specific imgs, so I record the batch of imgs that seems to trigger the above nan loss several times (simply log out the paths of them), but they don't seems to be intersect in between. here is 8 imgs from one attempt, recorded when nan loss occurs, with 1 gpu and train_batch_size=8.

../data/MSCOCO/images/train2017/000000524486.jpg
../data/Human36M/images/s_07_act_06_subact_02_ca_04/s_07_act_06_subact_02_ca_04_000176.jpg
../data/Human36M/images/s_01_act_05_subact_02_ca_04/s_01_act_05_subact_02_ca_04_000946.jpg
../data/MSCOCO/images/train2017/000000573349.jpg
../data/MSCOCO/images/train2017/000000396172.jpg
../data/Human36M/images/s_06_act_07_subact_01_ca_02/s_06_act_07_subact_01_ca_02_000956.jpg
../data/Human36M/images/s_07_act_15_subact_02_ca_03/s_07_act_15_subact_02_ca_03_000061.jpg
../data/Human36M/images/s_06_act_03_subact_01_ca_03/s_06_act_03_subact_01_ca_03_000616.jpg

I'm new to pytorch and HPE, and I'd appreciate your suggestion.

mks0601 commented 4 years ago

Hi!

Hmm.. This is very weird because I haven't meet any NaN issue during training in lots of experiments... Could you check you are loading correct meshes in Human36M/Human36M.py and MSCOCO/MSCOCO.py? Maybe you can use vis_mesh and save_obj functions in utils/vis.py. Also, could you train again without loss_mesh_normal and loss_mesh_edge?

I haven't tried batch size 48 (your GPU seems very huge because it can handle 48), but I tried 8 and 16, and tried 2~4 GPUs.

windness97 commented 4 years ago

@mks0601 Hi, thank you for your prompt reply! I've just disabled mesh_fit, mesh_normal and mesh_edge, since they all become nan at a point in my training process, and I'm trying to visualize the mesh models. It might take a while before I had further progress. Thanks again!

mks0601 commented 4 years ago

I think loss_mesh_fit is necessary for the mesh reconstruction. Please let me know any progress!

windness97 commented 4 years ago

@mks0601 Hi! I think I've found out where the problem is.

I've already successfully executed demo/demo.py (using the pre-trained snapshot you provide: snapshot_8.pth.tar), and the output rendered_mesh_lixel.jpg looks fine, so maybe it's not the SMPL model problem.

I've tried to disabled loss['mesh_fit'], loss['mesh_normal'], loss['mesh_edge'], and use the snapshot to test in demo.py (using common/utils/vis.py: function vis_keypoints()) to visualize. the mesh is a mess but the keypoints seems fine, so keypoints regression has no problem.

Then I try to debug the training process to figure out what makes the mesh loss nan (loss['mesh_fit'], loss['mesh_normal'], loss['mesh_edge']), and it turns out to be targets['fit_mesh_img'] (in main/model.py: forward()). targets['fit_mesh_img'] randomly contains some nan value (usually only 1 vertex coordinate become nan) at some point (for every epoch, it happens only several times) in my training process.

Actually this error happens randomly at a small probabity, so I wonder if it is relative to some specific imgs, so I record some imgs from Human3.6M that trigger the error:

../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg

and I write a script to reproduce the error: debug_h36m_nan.txt

the above script:

defines a dataset class SubHuman36M extends data.Human36M.Human36M.Human36M. It only returns designated samples (2 imgs above), so I slightly overwrite __init__() and load_data(). I also overwrite __getitem__(), setting the parameter exclude_flip of function augmentation() to be True (because in this way the nan error always occurs) and modifying the return of __getitem__() to include the img path for logging. No more modification besides above.
uses SubHuman36M to create a dataloader that does same operations like Human36M but only on designated imgs, and simply call it for processed data. check if targets['fit_mesh_img'] contains nan values.

just put it in main dir and use:

python debug_h36m_nan.py

In my environment settings, the nan error ALWAYS happens on the 2 designated imgs (note that I've modified the __getitem__() and force the augmentation do not flip):

debug_h36m_nan.py
creating index...
  0%|          | 0/1559752 [00:00<?, ?it/s]index created!
Get bounding box and root from groundtruth
100%|██████████| 1559752/1559752 [00:02<00:00, 659397.26it/s]
only test 2 imgs:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg

start test:

----- test no.0 -----
/home/windness/windness/proj/HPE/I2L-MeshNet_RELEASE/main/../common/utils/transforms.py:12: RuntimeWarning: divide by zero encountered in true_divide
  x = cam_coord[:,0] / cam_coord[:,2] * f[0] + c[0]
/home/windness/windness/proj/HPE/I2L-MeshNet_RELEASE/main/../common/utils/transforms.py:13: RuntimeWarning: divide by zero encountered in true_divide
  y = cam_coord[:,1] / cam_coord[:,2] * f[1] + c[1]
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.1 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.2 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.3 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.4 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']

Process finished with exit code 0

you can see that before the error, there's divide-by-zero warning happens in common/utils/transforms.py. Follows this clue, I find out that the nan value comes from Human36M.py: function get_smpl_coord(), which returns smpl_mesh_coord that contains 0 value on z-axis, and then divide-by-zero triggered in common/utils/transforms.py. I have no better ideas how to deal with this error, so I simply add a small float value to the denominator:

def cam2pixel(cam_coord, f, c):
    # if False:
    if cam_coord.shape[0] > 6000 and len(np.where(cam_coord[:, 2] == 0)[0]) > 0:
        x = cam_coord[:,0] / (cam_coord[:,2] + 0.001) * f[0] + c[0]
        y = cam_coord[:,1] / (cam_coord[:,2] + 0.001) * f[1] + c[1]
        z = cam_coord[:,2]
    else:
        x = cam_coord[:,0] / cam_coord[:,2] * f[0] + c[0]
        y = cam_coord[:,1] / cam_coord[:,2] * f[1] + c[1]
        z = cam_coord[:,2]
    return np.stack((x,y,z),1)

and the error seems to be solved.

I don't know if this will cause any accuracy loss or other problem and why common/utils/transforms.py return smpl_mesh_coord with 0 value.(Maybe it's a bug that only occurs on specific environment settings?) I don't know if there is other reason that causes nan error besides divide-by-zero.

I've just started training on that simple modification and see if there is other problem.

Any suggestion?

mks0601 commented 4 years ago

Hi

I got this result.

creating index...
index created!
Get bounding box and root from groundtruth
100%|██████████████████████████████████████████████████| 1559752/1559752 [00:06<00:00, 247654.87it/s]
only test 2 imgs:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg

start test:

----- test no.0 -----
----- test no.1 -----
----- test no.2 -----
----- test no.3 -----
----- test no.4 -----
----- test no.5 -----
----- test no.6 -----
----- test no.7 -----
----- test no.8 -----
----- test no.9 -----

Basically, I didn't get any NaN error. Could you check which cam2pixel function gives error and check whether some coordinates contain zero element?

windness97 commented 4 years ago

@mks0601 Hi! Sure.

I've debugged the script only on s_06_act_14_subact_02_ca_04_000856.jpg, and the error goes like this: in main/debug_h36m_nan.py: SubHuman36M.__getitem__(): line 201-206: (I've modified this file, so maybe the line number is not correct)

                # smpl coordinates
                smpl_mesh_cam, smpl_joint_cam, smpl_pose, smpl_shape = self.get_smpl_coord(smpl_param, cam_param, do_flip, img_shape)
                smpl_coord_cam = np.concatenate((smpl_mesh_cam, smpl_joint_cam))
                focal, princpt = cam_param['focal'], cam_param['princpt']
                smpl_coord_img = cam2pixel(smpl_coord_cam, focal, princpt)

on line 202, the returned smpl_mesh_cam contains 0 value, and so smpl_coord_cam contains 0 value. on line 206, smpl_coord_cam is passed into the func cam2pixel as cam_coord, which contains 0 value on z-axis, so divide-by-zero occurs, and so smpl_coord_img contains -inf values: Screenshot from 2020-08-27 19-20-55

you can see that the no.4794 vertex is the only vertex that contains 0 on z-axis (smpl_mesh_cam and smpl_coord_cam), then after line 206, smpl_coord_img have -inf values on x-axis and y-axis of the no.4794 vertex.

Then main/debug_h36m_nan.py: SubHuman36M.__getitem__(): line 208-215:

                # affine transform x,y coordinates, root-relative depth
                smpl_coord_img_xy1 = np.concatenate((smpl_coord_img[:, :2], np.ones_like(smpl_coord_img[:, :1])), 1)
                smpl_coord_img[:, :2] = np.dot(img2bb_trans, smpl_coord_img_xy1.transpose(1, 0)).transpose(1, 0)[:, :2]
                smpl_coord_img[:, 2] = smpl_coord_img[:, 2] - smpl_coord_cam[self.vertex_num + self.root_joint_idx][2]
                # coordinates voxelize
                smpl_coord_img[:, 0] = smpl_coord_img[:, 0] / cfg.input_img_shape[1] * cfg.output_hm_shape[2]
                smpl_coord_img[:, 1] = smpl_coord_img[:, 1] / cfg.input_img_shape[0] * cfg.output_hm_shape[1]
                smpl_coord_img[:, 2] = (smpl_coord_img[:, 2] / (cfg.bbox_3d_size * 1000 / 2) + 1) / 2. * \
                                       cfg.output_hm_shape[0]  # change cfg.bbox_3d_size from meter to milimeter

after line 209, smpl_coord_img_xy1 contains -inf values too. after line 210, smpl_coord_img contains -inf on x-axis and nan on y-axis: Screenshot from 2020-08-27 19-25-20

Then after line 226, smpl_mesh_img contains -inf and nan, which will be the final output (targets['fit_mesh_img']):

                # split mesh and joint coordinates
                smpl_mesh_img = smpl_coord_img[:self.vertex_num];

Screenshot from 2020-08-27 19-37-25

If I'm the only one that have this problem, then it might be sth to do with my environment settings. I'm using Python3.7.7, torch==1.4.0, numpy==1.19.1. here's the detailed pkgs list, return by pip list:

Package         Version
--------------- -------------------
certifi         2020.6.20
chumpy          0.69
cycler          0.10.0
Cython          0.29.21
decorator       4.4.2
freetype-py     2.2.0
future          0.18.2
imageio         2.9.0
kiwisolver      1.2.0
matplotlib      3.3.1
networkx        2.4
numpy           1.19.1
opencv-python   4.4.0.42
Pillow          7.2.0
pip             20.2.2
pycocotools     2.0.1
pyglet          1.5.7
PyOpenGL        3.1.0
pyparsing       2.4.7
pyrender        0.1.43
python-dateutil 2.8.1
scipy           1.5.2
setuptools      49.6.0.post20200814
six             1.15.0
torch           1.4.0
torchgeometry   0.1.2
torchvision     0.5.0
tqdm            4.48.2
transforms3d    0.3.1
trimesh         3.8.1
wheel           0.34.2

Here's the good news. I modified cam2pixel() to avoid the divide-by-zero problem, and the training process is fine so far. Here's the keypoints and mesh result after 2 epochs. output_joint_lixel rendered_mesh_lixel

Could you please show me your python environment details? That will help. ^ ^

mks0601 commented 4 years ago

Did you mosifiy get_smpl_coord function of Human36M.py? For example, make the xoordinates root-relative. Could you check yours with mine line by line? cam means camera_centered coordinates, and 0 z-axis coordinate means zero distance from camera in z-axis, which is non-sense. Could you visialize smpl_coord_img on image in Human36M.py using vis_mesh function?

windness97 commented 4 years ago

Hmmmm... I'm sure that I haven't changed anything in get_smpl_coord() of Human36M.py. I failed to visualize the smpl_coord_img on the input img, I still can't understand these coordinates transforms. I'll let you know if I have further progress. ^ ^

Besides, do you think that it could be due to using the 3DMPPE version of Human3.6M dataset?

mks0601 commented 4 years ago

The data from 3DMPPE is exactly same with that of I2L-MeshNet. I just added SMPL parameters. Ah when did you download the H36M data? I changed extrinsic camera parameters and corresponding functions at Jun 8 this year. I think this can make the coordinates zero because translation vector was changed. If you downloaded them before Jun 8, could you download camera parameters again and check the error?

windness97 commented 4 years ago

@mks0601 Hi! wow, that makes sense! I downloaded the H36M data at leas half a year ago. I'll redownload the annotations and check if it'll solve the problem. ^ ^

mks0601 commented 4 years ago

Awesome!

windness97 commented 4 years ago

Problem solved! Thank you! ^ ^

mks0601 / I2L-MeshNet_RELEASE

nan loss and weights in training #9