Closed windness97 closed 4 years ago
Hi!
Hmm.. This is very weird because I haven't meet any NaN issue during training in lots of experiments...
Could you check you are loading correct meshes in Human36M/Human36M.py
and MSCOCO/MSCOCO.py
? Maybe you can use vis_mesh
and save_obj
functions in utils/vis.py.
Also, could you train again without loss_mesh_normal
and loss_mesh_edge
?
I haven't tried batch size 48 (your GPU seems very huge because it can handle 48), but I tried 8 and 16, and tried 2~4 GPUs.
@mks0601 Hi, thank you for your prompt reply! I've just disabled mesh_fit, mesh_normal and mesh_edge, since they all become nan at a point in my training process, and I'm trying to visualize the mesh models. It might take a while before I had further progress. Thanks again!
I think loss_mesh_fit
is necessary for the mesh reconstruction. Please let me know any progress!
@mks0601 Hi! I think I've found out where the problem is.
I've already successfully executed demo/demo.py
(using the pre-trained snapshot you provide: snapshot_8.pth.tar
), and the output rendered_mesh_lixel.jpg
looks fine, so maybe it's not the SMPL model problem.
I've tried to disabled loss['mesh_fit'], loss['mesh_normal'], loss['mesh_edge']
, and use the snapshot to test in demo.py
(using common/utils/vis.py: function vis_keypoints()
) to visualize. the mesh is a mess but the keypoints seems fine, so keypoints regression has no problem.
Then I try to debug the training process to figure out what makes the mesh loss nan (loss['mesh_fit'], loss['mesh_normal'], loss['mesh_edge']
), and it turns out to be targets['fit_mesh_img']
(in main/model.py: forward()
). targets['fit_mesh_img']
randomly contains some nan value (usually only 1 vertex coordinate become nan) at some point (for every epoch, it happens only several times) in my training process.
Actually this error happens randomly at a small probabity, so I wonder if it is relative to some specific imgs, so I record some imgs from Human3.6M that trigger the error:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg
and I write a script to reproduce the error: debug_h36m_nan.txt
the above script:
SubHuman36M
extends data.Human36M.Human36M.Human36M
. It only returns designated samples (2 imgs above), so I slightly overwrite __init__()
and load_data()
. I also overwrite __getitem__()
, setting the parameter exclude_flip
of function augmentation()
to be True
(because in this way the nan error always occurs) and modifying the return of __getitem__()
to include the img path for logging. No more modification besides above.SubHuman36M
to create a dataloader that does same operations like Human36M
but only on designated imgs, and simply call it for processed data. check if targets['fit_mesh_img']
contains nan values.just put it in main
dir and use:
python debug_h36m_nan.py
In my environment settings, the nan error ALWAYS happens on the 2 designated imgs (note that I've modified the __getitem__()
and force the augmentation do not flip):
debug_h36m_nan.py
creating index...
0%| | 0/1559752 [00:00<?, ?it/s]index created!
Get bounding box and root from groundtruth
100%|██████████| 1559752/1559752 [00:02<00:00, 659397.26it/s]
only test 2 imgs:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg
start test:
----- test no.0 -----
/home/windness/windness/proj/HPE/I2L-MeshNet_RELEASE/main/../common/utils/transforms.py:12: RuntimeWarning: divide by zero encountered in true_divide
x = cam_coord[:,0] / cam_coord[:,2] * f[0] + c[0]
/home/windness/windness/proj/HPE/I2L-MeshNet_RELEASE/main/../common/utils/transforms.py:13: RuntimeWarning: divide by zero encountered in true_divide
y = cam_coord[:,1] / cam_coord[:,2] * f[1] + c[1]
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.1 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.2 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.3 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
----- test no.4 -----
nan occurs: ['../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg']
nan occurs: ['../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg']
Process finished with exit code 0
you can see that before the error, there's divide-by-zero warning happens in common/utils/transforms.py
. Follows this clue, I find out that the nan value comes from Human36M.py: function get_smpl_coord()
, which returns smpl_mesh_coord
that contains 0 value on z-axis, and then divide-by-zero triggered in common/utils/transforms.py
. I have no better ideas how to deal with this error, so I simply add a small float value to the denominator:
def cam2pixel(cam_coord, f, c):
# if False:
if cam_coord.shape[0] > 6000 and len(np.where(cam_coord[:, 2] == 0)[0]) > 0:
x = cam_coord[:,0] / (cam_coord[:,2] + 0.001) * f[0] + c[0]
y = cam_coord[:,1] / (cam_coord[:,2] + 0.001) * f[1] + c[1]
z = cam_coord[:,2]
else:
x = cam_coord[:,0] / cam_coord[:,2] * f[0] + c[0]
y = cam_coord[:,1] / cam_coord[:,2] * f[1] + c[1]
z = cam_coord[:,2]
return np.stack((x,y,z),1)
and the error seems to be solved.
I don't know if this will cause any accuracy loss or other problem and why common/utils/transforms.py
return smpl_mesh_coord
with 0 value.(Maybe it's a bug that only occurs on specific environment settings?) I don't know if there is other reason that causes nan error besides divide-by-zero.
I've just started training on that simple modification and see if there is other problem.
Any suggestion?
Hi
I got this result.
creating index...
index created!
Get bounding box and root from groundtruth
100%|██████████████████████████████████████████████████| 1559752/1559752 [00:06<00:00, 247654.87it/s]
only test 2 imgs:
../data/Human36M/images/s_01_act_14_subact_02_ca_03/s_01_act_14_subact_02_ca_03_002786.jpg
../data/Human36M/images/s_06_act_14_subact_02_ca_04/s_06_act_14_subact_02_ca_04_000856.jpg
start test:
----- test no.0 -----
----- test no.1 -----
----- test no.2 -----
----- test no.3 -----
----- test no.4 -----
----- test no.5 -----
----- test no.6 -----
----- test no.7 -----
----- test no.8 -----
----- test no.9 -----
Basically, I didn't get any NaN error. Could you check which cam2pixel function gives error and check whether some coordinates contain zero element?
@mks0601 Hi! Sure.
I've debugged the script only on s_06_act_14_subact_02_ca_04_000856.jpg
, and the error goes like this:
in main/debug_h36m_nan.py: SubHuman36M.__getitem__(): line 201-206
:
(I've modified this file, so maybe the line number is not correct)
# smpl coordinates
smpl_mesh_cam, smpl_joint_cam, smpl_pose, smpl_shape = self.get_smpl_coord(smpl_param, cam_param, do_flip, img_shape)
smpl_coord_cam = np.concatenate((smpl_mesh_cam, smpl_joint_cam))
focal, princpt = cam_param['focal'], cam_param['princpt']
smpl_coord_img = cam2pixel(smpl_coord_cam, focal, princpt)
on line 202, the returned smpl_mesh_cam
contains 0 value, and so smpl_coord_cam
contains 0 value.
on line 206, smpl_coord_cam
is passed into the func cam2pixel
as cam_coord
, which contains 0 value on z-axis, so divide-by-zero occurs, and so smpl_coord_img
contains -inf values:
you can see that the no.4794 vertex is the only vertex that contains 0 on z-axis (smpl_mesh_cam
and smpl_coord_cam
), then after line 206, smpl_coord_img
have -inf values on x-axis and y-axis of the no.4794 vertex.
Then main/debug_h36m_nan.py: SubHuman36M.__getitem__(): line 208-215
:
# affine transform x,y coordinates, root-relative depth
smpl_coord_img_xy1 = np.concatenate((smpl_coord_img[:, :2], np.ones_like(smpl_coord_img[:, :1])), 1)
smpl_coord_img[:, :2] = np.dot(img2bb_trans, smpl_coord_img_xy1.transpose(1, 0)).transpose(1, 0)[:, :2]
smpl_coord_img[:, 2] = smpl_coord_img[:, 2] - smpl_coord_cam[self.vertex_num + self.root_joint_idx][2]
# coordinates voxelize
smpl_coord_img[:, 0] = smpl_coord_img[:, 0] / cfg.input_img_shape[1] * cfg.output_hm_shape[2]
smpl_coord_img[:, 1] = smpl_coord_img[:, 1] / cfg.input_img_shape[0] * cfg.output_hm_shape[1]
smpl_coord_img[:, 2] = (smpl_coord_img[:, 2] / (cfg.bbox_3d_size * 1000 / 2) + 1) / 2. * \
cfg.output_hm_shape[0] # change cfg.bbox_3d_size from meter to milimeter
after line 209, smpl_coord_img_xy1
contains -inf values too.
after line 210, smpl_coord_img
contains -inf on x-axis and nan on y-axis:
Then after line 226, smpl_mesh_img
contains -inf and nan, which will be the final output (targets['fit_mesh_img']
):
# split mesh and joint coordinates
smpl_mesh_img = smpl_coord_img[:self.vertex_num];
If I'm the only one that have this problem, then it might be sth to do with my environment settings.
I'm using Python3.7.7, torch==1.4.0, numpy==1.19.1.
here's the detailed pkgs list, return by pip list
:
Package Version
--------------- -------------------
certifi 2020.6.20
chumpy 0.69
cycler 0.10.0
Cython 0.29.21
decorator 4.4.2
freetype-py 2.2.0
future 0.18.2
imageio 2.9.0
kiwisolver 1.2.0
matplotlib 3.3.1
networkx 2.4
numpy 1.19.1
opencv-python 4.4.0.42
Pillow 7.2.0
pip 20.2.2
pycocotools 2.0.1
pyglet 1.5.7
PyOpenGL 3.1.0
pyparsing 2.4.7
pyrender 0.1.43
python-dateutil 2.8.1
scipy 1.5.2
setuptools 49.6.0.post20200814
six 1.15.0
torch 1.4.0
torchgeometry 0.1.2
torchvision 0.5.0
tqdm 4.48.2
transforms3d 0.3.1
trimesh 3.8.1
wheel 0.34.2
Here's the good news.
I modified cam2pixel()
to avoid the divide-by-zero problem, and the training process is fine so far.
Here's the keypoints and mesh result after 2 epochs.
Could you please show me your python environment details? That will help. ^ ^
Did you mosifiy get_smpl_coord function of Human36M.py? For example, make the xoordinates root-relative. Could you check yours with mine line by line? cam means camera_centered coordinates, and 0 z-axis coordinate means zero distance from camera in z-axis, which is non-sense. Could you visialize smpl_coord_img on image in Human36M.py using vis_mesh function?
Hmmmm...
I'm sure that I haven't changed anything in get_smpl_coord()
of Human36M.py
.
I failed to visualize the smpl_coord_img
on the input img, I still can't understand these coordinates transforms. I'll let you know if I have further progress. ^ ^
Besides, do you think that it could be due to using the 3DMPPE version of Human3.6M dataset?
The data from 3DMPPE is exactly same with that of I2L-MeshNet. I just added SMPL parameters. Ah when did you download the H36M data? I changed extrinsic camera parameters and corresponding functions at Jun 8 this year. I think this can make the coordinates zero because translation vector was changed. If you downloaded them before Jun 8, could you download camera parameters again and check the error?
@mks0601 Hi! wow, that makes sense! I downloaded the H36M data at leas half a year ago. I'll redownload the annotations and check if it'll solve the problem. ^ ^
Awesome!
Problem solved! Thank you! ^ ^
@mks0601 Hi, thank you for your great work. I had a problem while training the model in the first 'lixel' stage. I care more about the Human Pose and Mesh Estimation performance, and I've downloaded Human3.6M (from your another project 3DMPPE, they are same stuff, right?), MSCOCO2017 and 3DPW datasets with the links you provided and make them meet the requirements as mentioned in README.md. I haven't downloaded MuCo dataset, so I modify main/config.py like this:
besides that, I modified the train_batch_size from 16 to 48. then I try to execute main/train.py with no more modification of the config:
python train.py --gpu 0-2 --stage lixel
it runs on 3 titan rtx, and everything looks fine, but nan loss occurs in the No.0 epoch:I modified the train_batch_size from 16 to 48, so the total iters looks quite small. You can see that loss_mesh_fit, loss_mesh_normal and loss_mesh_edge become nan first, and after that all losses become nan. I debug the program and find out that the weights of relevant layers all become nan when it happens, so it might be due to the occurence of above nan loss, and then spread to the params of all layers through BP. I've tried several times, with different gpu numbers (0, 1, 2) and different batch size (8, 16, 32, 48), it always happens at a point in the first epoch (0/13). I thought it might be due to some specific imgs, so I record the batch of imgs that seems to trigger the above nan loss several times (simply log out the paths of them), but they don't seems to be intersect in between. here is 8 imgs from one attempt, recorded when nan loss occurs, with 1 gpu and train_batch_size=8.
I'm new to pytorch and HPE, and I'd appreciate your suggestion.