Open mrkiran16 opened 8 years ago
By default it checks validation loss and saves a checkpoint every 1000 iterations; that is what is causing the hang here.
If your validation set is large then computing validation loss could take a while; saving a checkpoint might also be slow if the model is big and you are saving to some kind of network-attached storage.
Yes, that was the problem. Thanks a lot @jcjohnson!
Actually I'm having the same problem after switching to the open-cl variant of Torch (https://github.com/hughperkins/distro-cl).
Torch hangs at the 1000th iteration, and then eventually stops with an error saying that number cannot be infinity.
It's not related to size or writing files because this is a small dataset (1.6MB txt file) and I'm just using my local hard drive.
This training was working perfectly on the 'vanilla' Torch. Any suggestions appreciated! :-)
Epoch 1.04 / 50, i = 1000 / 1292750, loss = 1.244825 val_loss = inf /torch-cl/install/bin/luajit: ./util/utils.lua:50: Cannot serialise number: must not be NaN or Inf stack traceback: [C]: in function 'encode' ./util/utils.lua:50: in function 'write_json' train.lua:233: in main chunk [C]: in function 'dofile' ...e/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
I am also using the open-cl Torch as @lukemunn linked, but (at least so far) I have not gotten that error – after hanging for a while at 1000, it proceeded as expected. Currently it's hanging at 2000. I can provide my dataset if it would be helpful to do so, but it's rather a lot larger than his (1.8G).
Hello,
I am trying train torch-rnn (LSTM) on a 3gb wikipedia articles plain text file with 3 layers and rnn_size = 512.
I started the training process but it seems to be stuck at the 1000th iteration. My htop shows that the process is running with 95% CPU usage. Also, no checkpoints are written yet.
I am running on a 64gb RAM server with NVidea Titan X GPU.
Any idea about what the possible reason could be?
Thank you!