princeton-vl / pose-hg-train

Training and experimentation code used for "Stacked Hourglass Networks for Human Pose Estimation"
Other
575 stars 185 forks source link

bad argument #6 to 'sub' #10

Closed wydges closed 8 years ago

wydges commented 8 years ago

Hi, anewell! I've met this issue during training from scratch on different epochs (4,5,12) after latest commits. ==> Starting epoch: 12/100 torch/install/bin/luajit: /opt/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 3 callback] pose-hg-train/src/util/img.lua:115: bad argument #6 to 'sub' (out of range at torch/pkg/torch/generic/Tensor.c:330)

Any ideas about the ways it was caused and the ways it could be fixed? Thank you in advance

yxchng commented 8 years ago

I am having the same problem too. Have you manged to figure out the problem?

wydges commented 8 years ago

@yxchng Still didn't, I have this trouble on both 4-stack and 8-stack models. I have cuDNN version 5.1 and CUDA 7.5, and it's all running on Ubuntu 16.04

anewell commented 8 years ago

As a temporary solution, try using the crop2 function instead. (edit the call to crop in pose.lua) I'll try to get a proper fix up soon

yxchng commented 8 years ago

@anewell hi crop2 is failing at epoch 37. Same error.

anewell commented 8 years ago

Sorry you ran into that, I've looked into it further and made some modifications that should take care of the problem once and for all. Just pushed the update, let me know if it still has issues. I've also added a protection so that on the off chance there still is a bug it will prevent the whole thing from crashing.

yxchng commented 8 years ago

@anewell just wondering do you tree running the code on your computer? My training actually get stucked but without any error (like frozen). Not sure if it is my server problem though

wydges commented 8 years ago

@anewell , thank you! For now everything seems running okay. Also, I didn't met problems with crop2 function, as @yxchng had