karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch
11.63k stars 2.59k forks source link

error in sample.lua: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109) #28

Open enicon opened 9 years ago

enicon commented 9 years ago

Hello, thanks for sharing this! I was trying to do some experiment: I keep getting this error from sample.lua and I would appreciate an hint about how to fix it:

[root@sushi char-rnn-master]# th sample.lua -gpuid -1 cv/lm_lstm_epoch9.57_nan.t7 creating an LSTM... seeding with
/root/torch/install/bin/luajit: bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at /root/torch/pkg/torch/lib/TH/generic/THTensorRandom.c:109) stack traceback: [C]: at 0x7f8149182b20 [C]: in function 'multinomial' sample.lua:102: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk [C]: at 0x00405800 [root@sushi char-rnn-master]#

I suspect that something is wrong with the training. I'm using all defaults and the data .txt file is about 550KB. I'm using gpuid=-1 on both training and sampling (no GPU)

Thanks!

(I know I should probably not be running as root...)

Taschi120 commented 9 years ago

Same problem over here, I can provide the snapshot if it helps at all.

Edit: Also, my training and validation losses all come up as NaN during training. Could that have something to do with the bug?

karpathy commented 9 years ago

It seems your dataset is very small. Can you try making small batch size? E.g. batch_size 10 or maybe 20, and also maybe seq_length 50 maybe?

Taschi120 commented 9 years ago

I have a 4.9MB input file and tried to run it with many different parameters - including the ones you just suggested. The error persists.

The same thing happens when I try to train based on the tinyshakespeare dataset with default parameters and -gpuid -1.

tjrileywisc commented 9 years ago

I'm seeing this too when testing the tinyshakespeare dataset with defaults. Training was broken, but then somehow on another run I was getting the losses to update (not showing NaNs) but then sampling wasn't working. I tried to restart training from scratch just to see if I could get the sampling working, but now training is broken again.

If it helps in any way, I'm on the most recent version of Torch, Lua is 5.1.4 and OS is CentOS 6.2.

hughperkins commented 9 years ago

The error means your data are naned. Two possible causes include the weights becoming naned during training, or the cv snapshot file being corrupted somehow.

tjrileywisc commented 9 years ago

Might have been the weights getting NaN'd, since I saw this happen during training on the same machine. I switched to an Ubuntu container (instead of CentOS) and the training and sampling worked there on several datasets without any further problems.

afoland commented 8 years ago

This is nearly for certain due to: https://github.com/torch/torch7/issues/453

I found it crashed when the probability for one output went to 1.0000 . (This explains the observation of its temperature dependence.) I verified none of the inputs were NaN'ed or negative.

In sample.lua I simply subtracted 0.001 from any probability greater than 0.999 before passing it to multinomial, and that cured the crashes.

I have not tried the recommended step of changing multinomial to be double.

soumith commented 8 years ago

i'll try to get this resolved in the torch repo in the next week.

nkoumchatzky commented 8 years ago

@afoland @enicon Would you have a self-contained example I could use to test if the proposed solution to torch/torch7#453 would work in your case? Thanks

afoland commented 8 years ago

Not very self-contained, I'm afraid, I'm using eidnes' word-rnn (derived heavily from char-rnn), run on one of my datasets.

I'm a complete newcomer to github, if there's a more sensible way to contact you than through posting here I'm happy to exchange more info to see if we can hash out some way to test.