Stuck at 1000th iteration and no checkpoint written

mrkiran16 commented 8 years ago

Hello,

I am trying train torch-rnn (LSTM) on a 3gb wikipedia articles plain text file with 3 layers and rnn_size = 512.

I started the training process but it seems to be stuck at the 1000th iteration. My htop shows that the process is running with 95% CPU usage. Also, no checkpoints are written yet.

I am running on a 64gb RAM server with NVidea Titan X GPU.

Any idea about what the possible reason could be?

Thank you!

jcjohnson commented 8 years ago

By default it checks validation loss and saves a checkpoint every 1000 iterations; that is what is causing the hang here.

If your validation set is large then computing validation loss could take a while; saving a checkpoint might also be slow if the model is big and you are saving to some kind of network-attached storage.

mrkiran16 commented 8 years ago

Yes, that was the problem. Thanks a lot @jcjohnson!

lukemunn commented 7 years ago

Actually I'm having the same problem after switching to the open-cl variant of Torch (https://github.com/hughperkins/distro-cl).

Torch hangs at the 1000th iteration, and then eventually stops with an error saying that number cannot be infinity.

It's not related to size or writing files because this is a small dataset (1.6MB txt file) and I'm just using my local hard drive.

This training was working perfectly on the 'vanilla' Torch. Any suggestions appreciated! :-)

Epoch 1.04 / 50, i = 1000 / 1292750, loss = 1.244825 val_loss = inf /torch-cl/install/bin/luajit: ./util/utils.lua:50: Cannot serialise number: must not be NaN or Inf stack traceback: [C]: in function 'encode' ./util/utils.lua:50: in function 'write_json' train.lua:233: in main chunk [C]: in function 'dofile' ...e/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk

qwertystop commented 7 years ago

I am also using the open-cl Torch as @lukemunn linked, but (at least so far) I have not gotten that error – after hanging for a while at 1000, it proceeded as expected. Currently it's hanging at 2000. I can provide my dataset if it would be helpful to do so, but it's rather a lot larger than his (1.8G).

jcjohnson / torch-rnn

Stuck at 1000th iteration and no checkpoint written #112