Open max-reuter-2 opened 6 years ago
hm, that's odd, can you remove tee and check the output?
What do you mean by tee?
If what you mean is to change this line in train_cifar.sh:
th train.lua | tee $save/log.txt
to this:
th train.lua
then it is still stalling.
hm, I'd assume that would be threads then, but these issues should have been fixed years ago. can you update threads and torchnet?
I updated threads and torchnet, but I'm still getting the issue.
@soumith maybe you've seen issues like that with latest lua torch?
lua-torch hasn't updated it's packages since July 2017: https://github.com/torch/distro/commits/master
I'm not sure what changed.
I'm running this command:
model=wide-resnet widen_factor=4 depth=40 dropout=0.3 ./scripts/debug_cifar.sh
Most of the time (80%+), the program will reach the point where it prints this:
Network has 40 convolutions
Will save at logs/wide-resnet_1639021580
tput: No value for $TERM and no -T specified
...then it will do nothing. The other 20% of the time, it will begin training and printing out each epoch and its progress.
After a big of debugging, the stalling is occuring at engine:train in train.lua.
How can I fix this?