Unable to sample checkpoint trained with OpenCL - Githubissues

karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch

11.58k stars 2.58k forks source link

Unable to sample checkpoint trained with OpenCL #52

Open csssuf opened 9 years ago

csssuf commented 9 years ago

It seems to be impossible to sample a checkpoint that has been trained with OpenCL, since sample.lua assumes either CPU trained data or CUDA trained data. Attempting to sample an OpenCL trained checkpoint by explicitly setting gpuid falls back to CPU mode since I do not have the CUDA packages installed, as I am using an AMD card.

hughperkins commented 9 years ago

Will take a look...

hughperkins commented 9 years ago

Added -opencl 1 option to sample.lua, https://github.com/hughperkins/char-rnn/commit/728d8cbcc04f0e1fa99d6b885faf9500b6905426 , which you can get from https://github.com/hughperkins/char-rnn , prior to merge. Note that I'm getting nans out currently, but that might just be because I havent trained for very long? (edit: when I say nans, I mean I get an error about multinomial summing to <= 0, but thats because it sums to nan)

csssuf commented 9 years ago

I'm seeing that as well. Thanks for the fix!

hughperkins commented 9 years ago

Ok. I will dig a bit...

hughperkins commented 9 years ago

(By the way, do you get nans for train_loss during training? or only during sampling?)

csssuf commented 9 years ago

I do also get nans for train_loss during training.

hughperkins commented 9 years ago

ah. ok. thats different from me. But I do have access to an AMD, which gives nans. Anyway, I will dig a bit...

hughperkins commented 9 years ago

Seems there are two issues:

save/load doesnt work currently on cltorch => I will fix this now
nans on amd, during training => will look at this after looking at load/save

hughperkins commented 9 years ago

save/load now works => sample works ok now

(for the nans during training, on amd, its a different issue, which I need to address)

hughperkins commented 9 years ago

(Note: have to update to the latest version of cltorch, ie commit 48ca96fac or above. I guess you can just type something like luarocks install cltorch to upgrade it?)

csssuf commented 9 years ago

Seems to be working for me, and i'm no longer seeing the NaNs during training, either. Thanks again!

hughperkins commented 9 years ago

i'm no longer seeing the NaNs during training, either.

oh! Interesting! :-)

hughperkins commented 9 years ago

Hmmm, right :-) No more nans on the AMD device here either :-)

hughperkins commented 9 years ago

Can we leave this open for now, in case other people encounter the same issue?

hughperkins commented 9 years ago

thanks :-)

hughperkins commented 8 years ago

I think this can be closed now, since the change has been merged for a while now.