Why is `embeddings` in `model.py` random, instead of one-hot?

sherjilozair / char-rnn-tensorflow

Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow

MIT License

2.64k stars 960 forks source link

Why is `embeddings` in `model.py` random, instead of one-hot? #4

Closed rhaps0dy closed 8 years ago

rhaps0dy commented 8 years ago

Hello,

If I understand correctly, the tridimensional inputs tensor is built by looking up the n-th row of embeddings for each number in the bidimensional self.input_data tensor. The rows of embeddings have the same size as the RNN's internal layers. This seems to be the way to input the different characters to the network.

The Tensorflow variable "embeddings" has nothing assigned to it explicitly, therefore it is drawn from a uniform distribution each time train.py is ran. Why is that? I would have expected embeddings to be a matrix of one-hot row vectors, encoding the different characters; and having that mapped to the internal layer by weights as in https://gist.github.com/karpathy/d4dee566867f8291f086 .

Also, printing embeddings at the end of every run, I notice that its value changes every time.

I would be very grateful if someone would explain to me what is going on here.

Yours truly, rhaps0dy

sherjilozair commented 8 years ago

Hi @rhaps0dy,

Both approaches are equally correct, and different implementations of the same model.

@karpathy's approach is to have a fixed embedding which is multiplied by a weight matrix, before being used by subsequent layers. Since the embedding is one-hot, the matrix multiplication is essentially an indexing operation.
In my approach, as well as the approach taken in tensorflow's PTB model (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/ptb_word_lm.py), we allocate an embedding matrix, and then perform embedding_lookup, which is essentially indexing the embedding matrix with integer indices. This is algorithmically the same as multiplying by a one-hot vector, just more efficient.

The values of embeddings change, since we are learning the embeddings. The values of the weight matrix used in your attached code will also change. Neither code is using pre-trained embeddings, but you could do that, with some small changes.

rhaps0dy commented 8 years ago

Hello @sherjilozair,

Ah, this makes a lot of sense. Also, you have to learn n_charsrnn_size for the embeddings, instead of n_charsrnn_size + n_chars for weights and biases for the one-hot, so there are less parameters to learn. (Did I understand that correctly?)

Many thanks!

sherjilozair commented 8 years ago

In principle, yes, although the biases can always be removed, or the embedding matrix multiply layer can be merged with the RNN layer, as has been done in the gist you attached.