karpathy / char-rnn

Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch
11.58k stars 2.58k forks source link

Word Level Encodings #16

Open YafahEdelman opened 9 years ago

YafahEdelman commented 9 years ago

Would there be any easy ways to add word level encodings? One way I could think of to get something similar is to pre-compress individual words, but I was interested to know if there were any better / easier ways to do this.

karpathy commented 9 years ago

This wouldn't be too difficult since most of the code doesn't know anything about characters, only about indexes. You'd have to modify the loader class to create word dictionaries instead, and that's about it.

YafahEdelman commented 9 years ago

I've opened a pull request for a modified version which has word level encodings that I wrote.

AjayTalati commented 9 years ago

@JacobEdelman - maybe you could do a test of your code on the Penn Tree Bank dataset?

That would be interesting, and a nice way to check it works.

YafahEdelman commented 9 years ago

(Also Responded in issue thread). Attempts to run on large datasets regrettably fail as of now due to RAM usage. One data set of only a few megabytes failed and lua returned an error saying it had tried to allocate 1365 gigabytes of RAM.

cbarokk commented 9 years ago

I have also transformed it to a "word-rnn". And I also see an explosion in RAM consumption (so it crashes). My vocab_size is 2.2k. I'm still trying to get my head around the code, and one thing I noticed was that the LSTM.lstm class seems to set the size of the input layer to be equal to vocab_size. In Andrej's case, this is just 65, while in mine it's 2.2k. Perhaps this could explain it. I don't understand why the input layer size is proportional to vocab_size. Someone knows?

YafahEdelman commented 9 years ago

I think that's because each neuron in the input layer corresponds to one vocab term. Normally this is just letters, but for words it can explode. I've gotten it to run for a vocab of size 7725 by setting -seq_length 5 and after an initial spike to using almost all my RAM it settled down to around half. Perhaps we could avoid this spike? I'm not sure why a few thousand neurons would take up so much RAM, even if copied a few times. Besides finding a way around the spike a correct way to fix this might actually be to encode each words as a vector representation using some encoder but I'm not exactly sure of how to do this (especially in lua, a language I'm not familiar with).

karpathy commented 9 years ago

This makes sense, it might be necessary to add an Embedding layer just before the RNN to embed the words into a smaller dimensional space. Since the LSTM operates linearly over its input X, then linearly embedding the 1-of-K words to some embedding dimension D first corresponds to the original case you tried to get working here, except the matrix is factored through a rank D-dimensional bottleneck. If that makes sense.

Also, it is common to keep track of frequency of all words and discard words that appear, say, less than 5 times in the dataset.

Here's the embedding layer: https://github.com/wojciechz/learning_to_execute/blob/master/layers/Embedding.lua you'd replace the OneHot layer with it. This, coupled with discarding infrequent words should help. Here's a snippet I have from some code I wrote a while ago, this needs to go into train.

if one_hot_input then
    print('using one-hot for input...')
    protos.embed = OneHot(vocab_size):cuda()
else
    print('using an embedding transform for input...')
    protos.embed = Embedding(vocab_size, opt.input_size)
end
cbarokk commented 9 years ago

Obviously. I forgot about the OneHot encoding. What we need is word embedings. Have a look here, here or here.

cbarokk commented 9 years ago

If I understand correctly, the Embedding layer proposed by Andrej will learn to represent words during training, right? So at the end of training, if we feed-forward words in the net and inspect the output of the Embedding layer, we will get D-dimensional vectors similar to these obtained with word2vec for example.

AjayTalati commented 9 years ago

What about using nn.LookupTable ? It seems to work for me?

It's standard thing used in Torch for word level language modelling, see the tutotrial in the DP package, here

YafahEdelman commented 9 years ago

So, I've tried using the Embedding module for word encodings but it keeps throwing an error when it first does the forward calculations for the criterion.

/usr/local/bin/luajit: bad argument #2 to '?' (out of range at /tmp/luarocks_torch-scm-1-6481/torch7/generic/Tensor.c:853) stack traceback: [C]: at 0x7faa9dc18b50 [C]: in function '__index' /usr/local/share/lua/5.1/nn/ClassNLLCriterion.lua:52: in function 'forward' train.lua:200: in function 'opfunc' /usr/local/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop' train.lua:251: in main chunk [C]: in function 'dofile' /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:131: in main chunk [C]: at 0x00406260

I'm having trouble figuring out what causes the error with my rudimentary knowledge of lua. I've updated my repository if you want to take a look at it.

YafahEdelman commented 9 years ago

I just noticed that the main repository was updated and now conflicts in how it is structured. I'll work on updating mine and see what that does.

YafahEdelman commented 9 years ago

I updated it but it still has the same error.

YafahEdelman commented 9 years ago

I think the error is that the data is encoded by the neural network but never decoded so the values it tests against don't match up with the encoded data the neural network outputs. I'm not sure how to go about fixing this.

cbarokk commented 9 years ago

Jacob, I have done as Ajay says, and used a nn.LookupTable. Now the code runs.

opt.i_size = vocab_size -- should be a parameter
if (opt.input_repr == 'one_hot_input') then
    print('using one-hot for input...')
    protos.embed = OneHot(vocab_size)
else
    opt.i_size = 50 
    print('using an embedding transform of size', opt.i_size)
    protos.embed = nn.LookupTable(vocab_size, opt.i_size)
end
print('creating an LSTM with ' .. opt.num_layers .. ' layers')
protos.rnn = LSTM.lstm(opt.i_size, opt.rnn_size, opt.num_layers, opt.dropout)

Also, I think you´re problem comes from the WordSplitLMMinibatchLoader. In WordSplitLMMinibatchLoader.text_to_tensor, when you put your data into the data tensor, don´t use a torch.ByteTensor. Because that has up to 256 capacity, while you're vocab_size is way bigger. Putting ints into a byte buffer will give you some 0s, which I suspect is the source of your problem. Just inspect data after stuffing it, and you'll see.

AjayTalati commented 9 years ago

I guess that now you could if you wanted to, test this repo on the Penn Tree Bank dataset. The first experiment, (the small model), in that repo uses no dropout regularization, so to reproduce it you would have to set opt.dropout=0. It should run in 45 mins.

Be very interested to see if you get the same results?

YafahEdelman commented 9 years ago

By default dropout is also 0 here. I've finally fixed it @AjayTalati was right and I needed to set the embedding layer output size to the rnn_size. I'll try running it on the Penn Tree Bank dataset but my computer is fairly slow at running the network since it lacks a GPU, but I can at least see if I get the same general results just as a sanity check. Also, as of now char-rnn doesn't support given test, train, and valid datasets but instead just takes in ratios but that shouldn't have any measurable effect on the output.

AjayTalati commented 9 years ago

It possible to use nnLookupTable to take in the indices of the last n words of the data, and for it to return their embeddings. There's a little description in the docs, the 2D tensor case. The only changes to the code you'd need to make are to load in the last word indices, and to resize/flatten the output of the nnLookupTable, from batches of matrices of size n x opt.rnn_size to vectors, and then resize this vector using a linear layer to size opt.rnn_size

This gives a LSTM word level model, where the input is now a linear transform of the embeddings of the previous n words, instead of just the previous word. IIRC it requires n times as much memory. Thus the number of timesteps you can unroll to has to be divided by n. I've seen some improvements in perplexity scores by doing this. It's basically a trade off - sometimes it works - other times it's just overkill.

YafahEdelman commented 9 years ago

That sounds like a cool idea to add to the project, feel free to make a pull request to my repo or to the main one adding it. The word level encodings I have set up work now. I've run it on the Penn Tree Bank data, but I notice a few problems. One is that the repo for the project you referenced only says the network is "small" and that it was run for 45 minutes on an unnamed computer. Another problem is that char-rnn doesn't not calculate perplexity. Because of this I'm not sure at all of how to compare the performance of my model and the one from the paper in an analytical way.

namp commented 9 years ago

I've made a fork of Graydyn's rnn (https://github.com/Graydyn/char-rnn) which parses the input at word-level and I made modifications so as to handle UTF8 encoded input. You can find it here: https://github.com/namp/char-rnn

YafahEdelman commented 9 years ago

Now that I look at it word2vec looks like an interesting possibility for encoding words to vectors.

Edit: Changed the link to a torch version. I'll look into using this.

cbarokk commented 9 years ago

I think using nn.LookupTable instead of the embedding layer would give you some equivalent to word2vec. Ajay?

AjayTalati commented 9 years ago

Hi Cyril,

sorry I hav'nt tried word4vec, so I can't say how close it is to using nn.LookupTable. It's easy to use nn.LookupTable, and it's basically the same thing as the embedding layer.

@JacobEdelman - I'm not sure what you mean by encoding words to vectors? Encoding is a different thing to an embedding, or a representation?

Anyhow, word2vec has good proven performance, and it's definitely an interesting idea to use it as preprocessing/a representation, before the data goes into nnLookupTable. That is, if I had the time I would try using it with nnLookupTable replacing the embedding in this repo.

Try asking Andrej? Maybe he's tried it, or knows someone who has?

Cheers, Aj

cbarokk commented 9 years ago

This post from NVidia mentions word2vec, and then goes on and use a lookupTable to obtain word embeddings. What I will try when I have time, is to train char-rnn on words with a lookupTable, and then probe the lookupTable for the word embeddings that have been learned. We can use cosine similarity to see what words get representations close to each other.

panchishin commented 8 years ago

Using a character level RNN to also be able to transition to a word level RNN shares a problem space with compression and also the same solution.

Say we have a corpus that contains only the characters that match the regex /^[A-Za-z ,."]$/ and we have a character level RNN train from it but we notice that it is English and the word "the " is very often seen (about 10% of the words in 'Alice in Wonderland' is 'the') and so we change the regex to be /^([A-Za-z ,."]|the )$/ and effectively make a new "character" for 'the '. We are able to blend the dynamic nature of character level RNN with a higher level word matching and effectively compress our text dramatically. The question is "is it worth the trade off for the word 'the '?" or more generally "is it worth the trade off for the word X?" From compression and information theory we can borrow answers to these questions.

The character level is an n-gram of 1. Each character has a frequency of occurrence. A character sequence 'the ' is a n-gram of 4 and has a frequency of occurrence. Adding 'the ' will compress the text by about 9% (we remove 'the ' and replace it with a single character) and add 1 new n-gram (in our example there are 26+26+1+1+1+1 = 56 to begin with) from 56 up to 57 n-grams, or approximately a 2% increase in representation. That is a good tradeoff.

If we see that there is the word "robust" and want to see if it is worth adding. If in our corpus the word "robust" constitutes 0.1% of the text and adding it would mean going from 57 n-grams to 58 n-grams (or approximate a 2% increase in representation) we can have a sense that the trade-off is probably poor.

We can generalize this by creating a single pass n-gram lookup and track frequency of our training corpus. An n-grams frequency times its length-1 in characters is its compression value. Add the highest compression value n-grams to your starting vocabulary or A-Z a-z etc.. one at a time so long as the compression is worth the increase in vocabulary.

You can use this method to test if it is even worth having some of your starting vocabulary. For example, let's say your starting vocabulary is /^([a-z .,"]|)*$/ where means that the next letter is upper cased. This vocabulary is only 26+1+1+1+1+1 = 31 n-grams, not 56. That is a serious reduction. You can answer the question: is having the upper case letters "X" "Y" or "Z" is 'worth it'? Probably the n-gram 'the ' is much more valuable to store than 'Z', but you'd have to check it against your corpus.

If you have enough RAM and depth to hold 100 n-grams, or 1000 n-grams then you can set that as an upper limit but, from a cost trade off, it would not make sense to add an n-gram that is not worth its own value even if you have enough space to hold it because it will use more cpu and memory than it is worth.

Using the concept of switching Z for z can be used for all the special characters by changing them (depending on frequency) to the unicode or URL escape sequence if their single character representation is not "worth it". If you start by encoding all characters other than a-zA-Z into unicode and compress as above, then you'll compress all the frequent codes such as "?" and "." and leave the rest as well as having phrases like 'the ' compressed and represented as well.

kaihuchen commented 7 years ago

@panchishin @karpathy @JacobEdelman @AjayTalati @cbarokk @namp

We can look at the idea of expanding the "character set" (e.g., the example of treating the string "the" as part of the character set, as mentioned above) not just from the angle of information compression, but also as a better way to learn a good language model.

For example, once we have decided that the string 'Brooklyn Nets' occurs frequently enough to be worthy of being treated as a 'character' (more on this later), then we will be able to discover meaningful probability distribution at a higher level (say, involving "Brooklyn Nets" and "Knicks" which are treated as atomic) which is otherwise somewhat obscured at the character level. With this we also get a way to learn to segment the text strings so that we can feed the reformatted strings into something like Word2vec to discover good word embeddings.

Why not just go directly for the word-level encoding, as the first post suggests? Because: 1) The solution is not very general, e.g. you can't easily do that with languages such as Chinese which offers no explicit word boundary; and 2) you lose the flexibility that comes from character-level encoding, so you can't easily account for new words or complex morphology. On the other hand, the "expanding character set" approach will offer a smooth path to go from character-level to word-level encoding.

It should be clear by now that the name "character set" used above shouldn't be taken literally as being about "characters". It is more of a concept about allowing the dynamic addition of qualified composite entities (i.e., text strings) to the "character" set used for learning the language model. Considering that languages such as the Chinese, which has a huge character set, no natural word boundaries (like the space character in English), and always requires a separate non-trivial segmentation process in order to work with Word2vec, doing so will give us a more general and flexible mechanism for learning the language model that can be better integrated with the likes of Word2vec for learning word embeddings.