english and french text together causes crash

wrapperband commented 9 years ago

I've been getting the error below with one particular (1.2MB) text file.

I have traced it back (by halving the file) to some French text that was in the file. (i.e it's working as soon as I deleted the French text) I'm using UTF-8 (which I thought might be the problem) but that should allocate a number for each character? so should work best for languages (with other characters)...

stack traceback: [C]: in function 'index' ./util/OneHot.lua:18: in function 'func' /home/myhome/torch/install/share/lua/5.1/nngraph/gmodule.lua:289: in function 'neteval' /home/myhome/torch/install/share/lua/5.1/nngraph/gmodule.lua:324: in function 'forward' train.lua:262: in function 'opfunc' /home/myhome/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop' train.lua:306: in main chunk [C]: in function 'dofile' ...myhome/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk [C]: at 0x00405ea0

hughperkins commented 8 years ago

Are you sure you're not somehow generating negative indexes, eg by storing a number larger than 127 in a signed byte?

wrapperband commented 8 years ago

I'm pretty sure I wouldn't even know how to do that? (store a number over 127 in a signed byte)

I tested this sort of problem with char-rnn by removing text from a large training data that failed, until I found the offending text, and the training data didn't fail. I then deleted all French, because the failed text had worked previously.

One point might be worth noting is that, it had previously worked with the same text in a smaller training file, which is why I thought it was a format error. i.e sometimes 2 characters are interpreted as one larger / illegal number ....

I'm pretty disabled, so, I have to limit to what I can do and it is disappointing to be diverted to when I was already struggling with trying to optimise the size and length of Char-rnn settings, for my use case. A big job in itself, and one that could be "self aligned" to the data.

One of the major things apparent in the way char-rnn was learning the format of my "training data" / use case. With my case, despite largest memory, most levels and big data, and un-like the folk format example with "abc", with the current char-rnn software version, it seemed to rapidly "unlearn" the long term structure. In the end, it could not learn back longer than a line. i.e I was doing bug fixing / testing not ML

Whilst I don't mind testing, as I seem to be suffering from a number of "known issues" , I still have all my data, ready to retry on future versions, or fixes .. am a bit limited in what I can do, physically.

hughperkins commented 8 years ago

You probably want to sprinkle print statements liberally in the code, to find out more about what is happening. For example, what is the vocabulary size? ... oh, after checking ,seems it's printed already https://github.com/karpathy/char-rnn/blob/master/train.lua#L114 What vocabulary size are you seeing?

wrapperband commented 8 years ago

I can look more into it (char-rnn) after the holidays. but for example (thanks for the info re: print)

input text 2.9MB, data 2.9MB vocab 2.4kb

text 6MB data 6MB vocab 2.4kB

hughperkins commented 8 years ago

basically, if vocabulary size is more than 255, then the input values are probably wrapping around. If they're being stored in a signed byte, they will wrap around if the vocabulary size is more than 127. The file sizes can probably be used to derive the vocabulary size, but better would be to find out what the program prints when you run it.

wrapperband commented 8 years ago

Cheers for that. I've moved to the new rig, with the R9 280, which has been flaky. I'll copy over a checkpoint from the old PC, and run it.

Char-rnn does echo the vocab size to the screen ...

Vocabulary = 114

for the 2.9 MB input data, 700 nodes 4 layers

It crashed with : The character vocabulary in this dataset and the one in the saved checkpoint are not the same. checkpoint_vocab_size = 113

The new PC / AMD drivers and char-rnn were more flaky, i couldn't start a 700 4 layer on the new pc with the R9 280, or smaller ones actually. I ran various input text files through, with the init from and never saw that before. It was my last stopper I though some issues might fix.

i.e. I need to do more work on that PC reinstall LInux / AMD drivers from scratch. I was going to do that again, once a couple of the "issues" with char-rnn, especially memory are sorted. Interested to try beam search with my "dodgy" checkpoints

hughperkins commented 8 years ago

vocabulary 114 should be fine. As long as it doesnt go above 127, you're almost certainly ok from vocab-size point of view. Actually, even up to 255 might be fine, but someone would need to check more carefully, ie is the label stored in signed or unsigned byte?

I'll copy over a checkpoint from the old PC, and run it.

Note that I'd be careful about copying checkpoints from one machine to the other. And especially between different GPUs and different CPU arcihtectures, since the endianness might be different. Not sure which situations will definitely cause problems, but might be easier just to recreate new checkpoints, so as not to have to think about this.

The character vocabulary in this dataset and the one in the saved checkpoint are not the same.

Recreating the checkpoint would also solve this issue.

The new PC / AMD drivers and char-rnn were more flaky, i couldn't start a 700 4 layer on the new pc with the R9 280, or smaller ones actually.

Hmmmm, error messages?

karpathy / char-rnn

english and french text together causes crash #116