szagoruyko / wide-residual-networks

3.8% and 18.3% on CIFAR-10 and CIFAR-100
http://arxiv.org/abs/1605.07146
BSD 2-Clause "Simplified" License
1.3k stars 293 forks source link

Training usually doesn't start #44

Open max-reuter-2 opened 6 years ago

max-reuter-2 commented 6 years ago

I'm running this command: model=wide-resnet widen_factor=4 depth=40 dropout=0.3 ./scripts/debug_cifar.sh

Most of the time (80%+), the program will reach the point where it prints this:

Network has 40 convolutions Will save at logs/wide-resnet_1639021580 tput: No value for $TERM and no -T specified

...then it will do nothing. The other 20% of the time, it will begin training and printing out each epoch and its progress.

After a big of debugging, the stalling is occuring at engine:train in train.lua.

How can I fix this?

szagoruyko commented 6 years ago

hm, that's odd, can you remove tee and check the output?

max-reuter-2 commented 6 years ago

What do you mean by tee?

szagoruyko commented 6 years ago

https://github.com/szagoruyko/wide-residual-networks/blob/master/scripts/train_cifar.sh#L15 https://en.wikipedia.org/wiki/Tee_(command)

max-reuter-2 commented 6 years ago

If what you mean is to change this line in train_cifar.sh: th train.lua | tee $save/log.txt to this: th train.lua then it is still stalling.

szagoruyko commented 6 years ago

hm, I'd assume that would be threads then, but these issues should have been fixed years ago. can you update threads and torchnet?

max-reuter-2 commented 6 years ago

I updated threads and torchnet, but I'm still getting the issue.

szagoruyko commented 6 years ago

@soumith maybe you've seen issues like that with latest lua torch?

soumith commented 6 years ago

lua-torch hasn't updated it's packages since July 2017: https://github.com/torch/distro/commits/master

I'm not sure what changed.