is the Sequence to Sequence learning right?

EderSantana commented 9 years ago

Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values? There is no choice to pass a mask to the objective function. Won't this bias the cost function?

fchollet commented 9 years ago

We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values?

You could use a mask the hide your padded values from the network. Then you can discard the masked values in your sequence output. Currently masking is only supported via an initial Embedding layer, though. See: http://keras.io/layers/recurrent/

cc13ny commented 9 years ago

I'm a little new to recurrent network. When Eder talked about the sequence to sequence map, it only reminds me of the char-level LSTM (http://karpathy.github.io/2015/05/21/rnn-effectiveness/). In this case, even we can discard the masked values in your sequence output, the padding values still have effects on the hyper-parameters of the model itself. So is it enough to just discard the masked values? Again, as Eder has asked, won't this bias the cost function?

ghost commented 9 years ago

Maybe this issue is of your interest #382

simonhughes22 commented 9 years ago

This worked for me. Padding the inputs and then the outputs, and adding special sequence start and sequence stop symbols to book-end each sequence, then the following model structure:

embedding_size = 64
hidden_size = 512

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, embedding_size))
model.add(JZS1(embedding_size, hidden_size)) # try using a GRU instead, for fun
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(MAX_LEN))
model.add(JZS1(hidden_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, max_features, activation="softmax"))

model.compile(loss='mse', optimizer='adam')

If you have a sequence stop symbol, it should learn when to stop outputing non-zero values, and will output zero's thereafter. May not be ideal, but works within the current framework. I also tried replicating the output to the maxlen width (including the stop symbol) during training, and then just took the first valid sequence at test time.

simonhughes22 commented 9 years ago

btw, JZS1 is an RNN like a GRU or an LSTM, each of which could be used here instead in both the encoder and decoder.

EderSantana commented 9 years ago

Sounds like a good idea, but see that you are forcing your model to learn something else instead of your original problem. I think I know how to solve this problem with a masking layer after each regular layer and if we allow loss to be a custom function from. This way, instead of averaging the cost with .mean(), we divide it by the number of non zero elements on each time series. This non discriminant averaging is where a lot of bias come from as well.

@fchollet is there any chance that we can get "loss" to check if its input is callable? I really didn't want to have to add new stuff to my code repo every time I needed something custom made. I can write a PR for that if there is a general interest. Let me know.

EderSantana commented 9 years ago

PR #446 is relevant here. Now all that we need is that the cost functions ask for that mask in their calculations.

gautamb85 commented 9 years ago

@simonhughes22 Hello, I have been working with your code snippet (from a previous discussion, with example/toy data). While the code works, I am confused if it is doing what it is supposed to. I am new to Keras and deep learning, so please do bear with me.

As far as I understand, the idea is to have an 'encoder' process sequence X, and after the last time step of the sequence is processed, a 'decoder' starts to predict the new sequence Y. In the code you provided, how does the model know that sequence X is complete and now it should start predicting Y?

simonhughes22 commented 9 years ago

@gautamb85 the inputs are padded, the encoder RNN simply goes along the entire input, updating it's hidden state accordingly until it hits the end of the array, and then outputs a vector. That is then fed into the decoder. Btw, they just added masking to the loss function (see the recent commit history), so I'd make sure you are masking the loss function (you'll have to dig through the code or documentation to figure that out).

gautamb85 commented 9 years ago

@simonhughes22 Thanks for the reply. I had a couple more questions.

In your code there is a dense relu layer between the hidden layers of the encoder and decoder. Is this needed? (don't they just connect encoder hidden layer to decoder)
What exactly is the RepeatVector part of the code doing?
In the Sequence to Sequence paper, they make explicit use of an 'end of sequence' symbol. The way I understand it (as you mentioned), if one is included, the rnn should learn to stop predicting non-zero values. Does there need to be an explicit provision in the code that (during training) - looks for this end of sequence symbol, and tells the system that this is the hidden state to feed to the decoder, or is the model just naturally setup to do this? I am confused because I don't understand why the model will not make a prediction for sequence Y for EACH input of sequence X, rather than wait till the whole of X has been processed?

I would greatly appreciate clarification regarding these points.

simonhughes22 commented 9 years ago

The RepeatVector is repeating the final output vector from the encoding layer as a constant input to each timestep of the decoder. This is how the example works for the image captioning, so I copied the code for this.
Just make sure your training data has the end of sentence symbol as an additional work stuck on the end of sentences. You should also use the mask_zero=True in the embedding layer and cost function. The idea is that it processes the whole of x to produce a single vector representation of the sentence, and then uses that to generate an output sequence. That's the sort of the model you want when the sequences (x and y) are of different lengths, such as a translation model. If you want to instead build a word tagger, such as a POS tagger where there is a one to one mapping of input to output, you can use a simpler model of and embedding layer + a RNN of some kind (return_sequences=True) + a TimeDistributedDense layer. That's all you would need for that. Your ys would be 3D, row, colums, timestep. The columns would be a probability distribution over the classes, or a binary encoding if you can have more than one output label per word.

EderSantana commented 9 years ago

@simonhughes22 @gautamb85 Did you guys tried out the cost function masking proposed by #451?

wxs commented 9 years ago

I just mentioned this over in #451, but can't you use the sample_weight parameter to fit() and pass in 0 weight to the meaningless outputs?

benjaminklein commented 9 years ago

Hey,

I'm new to keras and I have a simple question:

Why you use the mse objective in model.compile(loss='mse', optimizer='adam')? Wouldn't it be more appropriate to use categorical_crossentropy, since you are using softmax in model.add(TimeDistributedDense(hidden_size, max_features, activation="softmax"))?

simonhughes22 commented 9 years ago

@benjaminklein for a while there was an issue in theano + keras with the binary cross entropy and this kind of model, so I used MSE instead. Now I use binary cross entropy as I have a multi-class multi-label classification problem (each word can have 0 to many classes). If you have a more vanilla multi-class problem then yes, categorical_crossentropy should work better. However, I haven't tested that on TimeDistributed dense output, so you'd need to verify it is expecting a single category per output token, not across all tokens, as the output is 3D not 2D, so you have len(tokens) categorical cross entropy calculations. binary cross entropy just works per label, it doesn't compute a distribution over labels, so for that the output shape doesn't matter.

benjaminklein commented 9 years ago

@simonhughes22 Thank you! Also could we change your code to use Graph instead of Sequential such that we'll have two inputs. One for the first sequence and another one for the second sequence?

The motivations is that in many papers about encoder-decoder, the decoding phase is using both the last hidden layer from the first sequence and the previous word of the second sequence. In your code we are only using the last hidden layer from the previous sequence.

gautamb85 commented 9 years ago

@EderSantana @simonhughes22 Have not yet had a chance to try it. Will let you know as soon as I do. @fchollet I have been reading some keras posts about 'stateful' RNNs. If I understand correctly, the hidden state of the recurrent layer is reset for every time_step of a sequence. (this appears to be the case based on outputs_info for various classes in recurrent.py)

Would this be an issue in the sequence-to-sequence paradigm? How does the resetting of the state affect the summary (final hidden state) of sequence 1 ?
In my case data is being provided in chronological order i.e. a grapheme sequence, so I would guess that the previous state would matter?
If I were writing my own LSTM class, what would be the easiest way to make it 'stateful'?

simonhughes22 commented 9 years ago

@gautamb85 I thought the issue is that it is reset for every row of input, meaning you can't test it easily by feeding its ouput predictions as inputs (you can, but you are re-feeding the entire input sequence plus the latest predicted for every subsequent time step, which is a little slower than if stateful). If the RNN reset itself between timesteps it wouldn't work...., AFAIK it maintains state across a row (sequence of timesteps) but resets itself once the row is processed. You can make it take it's output as input as I described, it's just slower than models that allow you to remember the state following a prediction.

Note for the example above, I am reading in one sequence, converting to a hidden state, and then predicting a whole second sequence, so you don't have the issue I mention here. However, you may get better results by training a model to predict the next word instead of the next sentence, and feeding each predicted word in as input to predict the next word to generate a sentence.

simonhughes22 commented 9 years ago

@benjaminklein as it's predicting a full sequence as output, it is remembering the previous word as it predicts the output sequence via retaining it's hidden state across the input sequence. What is repeated is the encoded representation of the entire previous sentence, but for each word it is predicting, it is also feeding in the hidden state from the previous word as that's how RNN's work. What you are describing is a slightly different type of model where you are predicting the next word and not the next sentence, and then adding that to the existing input and making a new prediction. You can do this with keras too very easily, just remove the last 3 layers from the model above and train it to predict the next word (or character). You'll have to write a bit of code to feed the output back in as input though.

gautamb85 commented 9 years ago

@simonhughes22 Thank you for the clarification. That was my impression. (How could it even work if it was being reset btw time steps). I am still a little confused about testing it in generative mode.

You touched on exactly what I am trying to do, and I am hoping you could help me on some of my concepts.

Taking the machine translation problem in Sutskever's paper as an example, first an english sentence (sequence 1) is converted to a hidden state, then the 'decoder' starts to predict each word of the french sentence (sequence 2). Thus the translated sentence is generated word by word. Is this correct? In this case, the 'decoder' is essentially a conditional language model (word level), conditioned on the english sentence, i.e. sequence 1. Thus the target for a given tilmestep is the input for the next one. For training such a model, after reading in sequence-1, the decoder is provided with the 'true' french word at each tilmestep (not the prediction). At test time (as you mentioned) in order to feed the prediction back in to predict the next french word, the entire sequence (eng+french?) would need to be read, in order to predict the next word.

If all that is correct, would I need a graph structure? A pseudo code / flowchart description of the network architecture would be much appreciated.

Thanks in advance!

simonhughes22 commented 9 years ago

@gautamb85 no you can use the model I listed above with the english sentence as input and the entire french sentence as output. The RNN model will maintain state across each timestep as it predicts the output sentence, no extra work required on your behalf. You will however need to one hot encode and zero pad the output sequence (the french sentence) and have it do a softmax over all possible words for the output at each time step. The ys then are 3D, each row is a matrix of height - number of french words, and width - number of time steps.

When I mentioned feeding the output as input, that's only if you want to train a language model to predict a word at a time, and then use that to generate text as a generative model. However, as in the example above, you can have it generate whole sequences for you from each input sequence.

gautamb85 commented 9 years ago

@simonhughes22 I have a model training (graphemes to phonemes). I am writing a beam search to see if its actually learning something useful. In Sutskiver' s paper, the decoder is described as a conditional language model (conditioned on the previous sentence). In the model you proposed, what is the input to the decoder RNN at each time-step? If I wanted to design a model where for every time step of the decoder, the target of the previous timestep (and generate the output sequence word by word), what modifications to the model would I need to make?

simonhughes22 commented 9 years ago

@gautamb85 I think you may be miss-understanding how that model is built. Each time step is a word (although could be a character, a phoneme or whatever). However, each row is an entire sequence, zero-padded to the left to the length of the longest sequence, and also each output row is a sequence, zero padded.

If you want a more traditional RNN like model, look at the Passage repo, however, that has less functionality and can't be used for tagging models (unless you predict word by word as described). When I say tagging models, I mean a one to one mapping of word or character to a tag.

benjaminklein commented 9 years ago

@simonhughes22 When running your code I'm getting: /usr/local/lib/python2.7/dist-packages/theano/gof/cmodule.py:293: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility rval = import(module_name, {}, {}, [module_name])

Any ideas?

Thank you!

simonhughes22 commented 9 years ago

It's a warning.... I've had that before, but i haven't noticed any issues. Theano is very hard to debug, so I am not able to figure out if that is serious, but every model that gave me that warning was still able to learn effectively on the data. Maybe @fchollet can shed more light on it.

gautamb85 commented 9 years ago

@simonhughes22 Firstly, thank you agin for all your help so far. I was confused about the model as I thought that training was being done by teacher forcing, i.e. the actual targets were being fed to the decoder at each time step. I guess the model could be trained that way, but it would not generalize as well.

I am trying to replicate the approach (standard sequence to sequence task) from here : http://arxiv.org/abs/1506.00196

I took the your model and replaced the last layer -

print('Build model...') model = Sequential() model.add(Embedding(max_features, embedding_size, mask_zero=True)) model.add(GRU(embedding_size, hidden_size)) model.add(Dense(hidden_size, hidden_size)) model.add(Activation('relu')) model.add(RepeatVector(maxlen)) model.add(GRU(hidden_size, hidden_size, return_sequences=True))

model.add(Dense(hidden_size, phn_output)) model.add(Activation('softmax'))

To get the prediction at each time step.
I was able to train the model. Though, after a certain number of training iterations, the loss started oscillating (not sure how to interpret / prevent that). Though for the first few iterations even the accuracy and loss on validation set was behaving properly. I think it may have to do with the size of my hidden layers etc. I also only took 30000 sequences for training.

I am now trying to figure out how to do a beam search to find the best phonetic transcription of a test grapheme sequence.

At each time step I get a probability distribution over the phoneme vocabulary.
Say I keep the 3 highest (beam width) candidate translations (phonemes), I now need to expand each of these nodes of the graph by feeding them (the predictions) back in. Is this correct?
I order to feed back the prediction/s can I do something like the character language model example? I would also need to re-feed the grapheme sequence (seq-1) to get the hidden activation at last time step. Could I just append the prediction to this sequence?

I am a little confused on how to go about setting this up. Any advice would be much appriciated.

simonhughes22 commented 9 years ago

@gautamb85 if you want to combine this approach with a re-ranking solution (for instance using beam search), I would have the model just output the probabilities over the phonemenes for every time step. You don't need to feed the predictions in at each time, for each input, the model will output a probability distribution over every phonemene for every time step. You can then use that table of probabilities to do something like beam search (although I'd recommend also trying a dynamic programming approach - see how CRF models work, you can fit this model into that framework). However, as the model is keeping track of it's previous prediction across each time step, this should not be necessary. Note that for each input, the model will iterate over each timestep and output a prediction, giving you a list of predictions per timestep without you needing to feed them back in. However, doing so may or may not improve matters.

In terms of the oscillating errors, that normally means that the learning rate is too high and the model is having trouble converging. I'd advise using something like adam or adagrad to optimize as these approaches are very good at setting and adjusting the learning rate for you, and i've never had an unstable model with these appraoches. That said, i've hit points where the test set performance oscillates, and that often means the accuracy may not improve further. However, at this point, the training accuracy is normally still improving, but the model is starting to over fit.

The oscillating may be due to drop out also. I'd advise not using drop out until you've got a model that does very well on the training data, and experimenting with it to reduce over-fitting. My dataset is pretty noisy, which I think is regularizing the model to some degree, so dropout did seem to hurt more than help but normally it is advantageous to do. But first you want to get your model to overfit very well on the training data. That confirms the model can learn effectively on the data. Then start to address over-fitting.

cjmcmurtrie commented 9 years ago

Hi Simon,

You seem to have had success with a sequence to sequence NLP task. I'm struggling and wondered if you could help.

I have a data set of sentences, which are sequences of Socher's pre-trained word vectors. These sequence map 1-to-1 to another series of vectors for each sentence that I have calculated to describe some aspects of the sentences. These vectors are such that their values add to one. For instance, an example mapping would be,

 wordvectors["headache"] (50 dim vector) -->> 20 dim vector, [0.0, ..., 0.0, 0.25, 0.25, 0.25, 0.25, 0.0... 0,0]

I am training on a small subset of my data (10,000 points) to get things working, but I cannot overfit even with this small data. In fact, I cannot even get accuracy above 0.2. For instance, running a training epoch on a version of the model you posted above (with the embedding layer removed, as my inputs are already vector embeddings),

embedding_size = 50
hidden_size = 512
output_size = 20
maxlen = 60

model = Sequential()
model.add(JZS1(embedding_size, hidden_size)) # try using a GRU instead, for fun
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
model.add(JZS1(hidden_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, output_size, activation="softmax"))

model.compile(loss='mse', optimizer='adam')

I have padded my inputs and outputs with zeros to fit the maxlen dimension, but I have not added these "stop symbols" you wrote about. What kind of vector would I be adding as a stop symbol? Are they essential to train correctly?

Training with the model above does yield a set of probabilities for each word, so that they sum to one, but the Keras accuracy measure will not go past 0.15. I don't think it is able to fit the data I have.

What do you think is my bottle neck for training towards these vectors? Any ideas?

Thanks!

cjmcmurtrie commented 9 years ago

Hi Simon,

You seem to have had success with a sequence to sequence NLP task. I'm struggling and wondered if you could help.

I have a data set of sentences, which are sequences of Socher's pre-trained word vectors. These sequence map 1-to-1 to another series of vectors for each sentence that I have calculated to describe some aspects of the sentences. These vectors are such that their values add to one. For instance, an example mapping would be,

 wordvectors["headache"] (50 dim vector) -->> 20 dim vector, [0.0, ..., 0.0, 0.25, 0.25, 0.25, 0.25, 0.0... 0,0]

I am training on a small subset of my data (10,000 points) to get things working, but I cannot overfit even with this small data. In fact, I cannot even get accuracy above 0.2. For instance, running a training epoch on a version of the model you posted above (with the embedding layer removed, as my inputs are already vector embeddings),

embedding_size = 50
hidden_size = 512
output_size = 20
maxlen = 60

model = Sequential()
model.add(JZS1(embedding_size, hidden_size)) # try using a GRU instead, for fun
model.add(Dense(hidden_size, hidden_size))
model.add(Activation('relu'))
model.add(RepeatVector(maxlen))
model.add(JZS1(hidden_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, output_size, activation="softmax"))

model.compile(loss='mse', optimizer='adam')

I have padded my inputs and outputs with zeros to fit the maxlen dimension, but I have not added these "stop symbols" you wrote about. What kind of vector would I be adding as a stop symbol? Are they essential to train correctly?

simonhughes22 commented 9 years ago

@cjmcmurtrie Sorry for the late reply, but i've gotten a lot of requests on this posting, and I've also been out of the coutnry. Note that while I've got it to learn something, I haven't tried using it to solve any problems, and I had to run it for a long time to get it to learn something useful. So it may not be the best approach for your particular problem (which I don't know enough about to suggest alternatives, although you could try the skip-thought sentence vectors from a pretty recent paper https://github.com/ryankiros/skip-thoughts).

Regarding the above - the stop symbol in my approach was just a special word, that would then get mapped to it's own embedding via the embedding layer and the network would learn that this meant stop outputting symbols. Sending in a zero length vector might be sufficient for this, or some randomly initialized vector. Why are you using socher's vectors? Those are likely fine-tuned for sentiment analysis (or are you using the Glove vectors). You are probably going to get the best performance by using a graph model and combining fixed vectors from the Glove or word2vec vectors (the latter I've had some success with using RNN's in keras), with an embedding layer that's able to learn it's own vectors. I haven't had chance to try that yet but in theory taht should work best from what i've read. In keras that would be achieve by having 1 input layer read hard-coded fixed pre-trained vectors, and merging that (concat) with an embedding layer where it can learn it's own vectors.

Also, if you are learning a 1-1 mapping, then I would use a differnt model. The model above is meant for where you have different length inputs and outputs. If you have a tagging problem, where the length of inputs matches the outputs, you can use a much simpler model:

model = Sequential()
model.add(JZS1(embedding_size, hidden_size, return_sequences=True))
model.add(TimeDistributedDense(hidden_size, output_size, activation="sigmoid"))
model.compile(loss='mse', optimizer='adam')

You might have to check some of the size's are correct, I just wrote that. If you are mapping to a second sequence of vectors that should work fine. Note that the output is 3D (#rows, vector_len, max sequence length), and you'll want to think about an appropriate loss function. If it's mapping to another set of vectors, mse should be fine. However, if you have some target symbols, you would be better mapping to that and using a softmax with categorical cross entropy. This model is much simpler and should work better (and I have gotten a similar model to do quite well on a supervised task).

Get that working then experiment with the Y-shaped model where you merge pre-trained fixed vectors with ones the model can learn from scratch.

NickShahML commented 9 years ago

Firstly, this thread has been a major help for me. Thank you everyone! Shout out to @simonhughes22

One thing I have read extensively about (for at least NLP), is that for your input, you want to reverse the order of your sequence. You can read more here: http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

For example, if you are coding in "the dog barked loudly" as an input vector, it is better to write in "loudly barked dog the". This may help learning. I think the idea is that the next sentence that you predict has more to do with the END of the input sentence rather that the BEGINNING of the input sentence.

Forgive me as I am a beginner, but I almost always use 200 to 300 vectors for Word2Vec. Fifty or 20 vectors for a word seems incredibly low to me. Thoughts on this? I am aware of the curse of dimensionality.

My goal is to predict the sentence given the first sentence:

Input: Give First Sentence as Sequence --> Ouput: Yield next sentence as sequence

Here are a few more questions:

Question 1: We do not need to worry about the output mask on the compile layer correct? Following this thread: https://github.com/fchollet/keras/pull/451, it seems that fchollet made the output layer automatically masked (if we intially mask the embedding layer). Therefore, if we pad the y output with zeros, we should be good correct?

Question 2: I really struggle with formatting the y_train data. Right now, I cluster my words, and assign unique id's to each word within each cluster. Therefore, each word has two numbers associated with it. (I clustered the words by applying Word2Vec plus K means -- more info here: (https://redd.it/3psqil).

I know its been mentioned that y_train is formatted in 3 dimensions, but how is this possible?

Currently This is what I input as my 2d X_train into the embedding layer in Keras. I'm hoping to do something similar for the y_train as well:

Dimension 1: number of samples Dimension 2: number of timesteps

12 3 4 5 6 3 0 0 0 0 0
6 5 4 23 3 5 1 4 2 0 0

Dimension 3: one hot each of the integers? I just thought there would be a more efficient way to do this. Or another thought is that you guys are using the third dimension to vectorize each word? This would explain why you're not doing 200 or 300 word dimensions but rather 20 or 50? But then the RNN would have to predict vectors? This is what confuses me.

Question 3: If we are indeed masking the zeros, then why do we need to tell the sequence where it stops and starts? I don't mind appending a "sentence start" and a "sentence end" number to each sequence. I just don't understand why we should do this if we are indeed masking?

For reference, this is my current model:

max_features = (len(chars)) #this number (always an integer) is either a cluster number or a word id number, both of them have a maximum of ~400

max_features = (len(chars)) #this number (always an integer) is either a cluster number or a word id number, both of them have a maximum of ~400

model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=512, input_length=maxlen, mask_zero=True))
model.add(LSTM(hidden_variables)) 
model.add(Dropout(dropout))
model.add(Dense(hidden_variables))
model.add(Activation('relu'))
model.add(RepeatVector(MAX_LEN))
for z in range(0,number_of_decoding_layers):
    model.add(LSTM(hidden_variables, return_sequences=True))
    model.add(Dropout(dropout))
model.add(TimeDistributedDense(max_features, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam')

simonhughes22 commented 9 years ago

@LeavesBreathe I've actually found smaller vectors to work better (64 and 128 - note the binary sizes, I think this helps performance by assisting theano with copying chunks of data to and from GPU), likely as that's fewer parameters to learn, and too many can make learning hard for a neural network. Cross validation would help you determine a good size, try varying in magnitude (32,64,128,256) rather than more linear scales, as the relationship between these items and performance tend to be more of an exponential than a linear nature.

Reversing the inputs is something that may help. However, be careful which end you zero pad if you do this Another common strategy is to mirror the inputs, so you duplicate the input in reverse too, allowing the network to get the best of both worlds. There's a name for that approach, but it escapes me right now. Again be wary of the zero padding, you'd want it on the outside of the mirrored input, not the middle, I believe, otherwise the RNN hidden state may get reset once it hits the zeros, depending on what it learns to do in this situation. That may not happen, you'd have to see.

Qu 1. - I haven't played with this, but AFAIK if you pad your 0 pad y's you should be good. You seem to be writing a lot of custom code. keras has libraries for zero padding, as well as determining id's for words, so i'd rely on those rather than hand rolling it all as they've likely been well tested by either unit tests or the community, and are known to work with the package.

Qu 2. The number and format of the Y's for a number of these networks is I think one of the biggest source of confusion when doing deep learning with one of these libraries, although I think that's more to do with the complexity of the problem, keras makes it about as easy at it could be. The output is 3d - (number of samples, size of each vector, length of the sequence). This may seem confusing, but think about it like this. Using the time distributed dense above, we are making a prediction for each time step. For each time step you are predicting a word, so that's a one-hot vector, of length == | Vocab | But we have 'max sequence length' number of time steps. So the dimensionality has to the the number of rows, the number of words \ labels to choose from at each time step, and finally the number of time steps (max sequence length). Anything less would not be possible as for each row you are predicting a word for each time step (which is the length of the output sentence in this case). HTH. Put another way, say you have a vocab of 10k words and the max sentence length is 100. For each row, you need to make 100 predictions (some zero if less than max sequence length), for each prediction you need it to choose from 10k words. So for each row you'd have a matrix (2D) of size 10k X 100, so it's 3D.

Qu 3. - You don't absolutely need the start and stop symbols. However, when you are doing sequential prediction, each prediction for each timestep is condtioned on the previous inputs and output. However, when you are at the start of the sequence, you have no previous inputs to condition on. The distribution of the start symbols is normally not random however, certain words or labels tend to be more likely at the start of a sentence, such as the word 'The' or 'However', and subsequently so do their labels, such as POS tags if that's what you are trying to predict. It's unlikely a sentence will start with a noun, such as 'antelope'. The same is also true for the last word in a sentence or label in a sequence. Adding the artificial start and stop symbols allows the system to learn 'rules' (patterns may be a better term) that govern how the data is distributed at the start and end of sequences. This is late for me, so I hope I am making sense. The other reason is with this sort of sequential prediction, as you have a fixed sized sentence with zero-padding, you need to help the model know when to start outputting zeros. Without outputting the stop character I was seeing it have problems learning this, it would just keep repeating the same word sequence, or always output zeros. LSTM's and variants are stateful, and so this sort of signal can help them learn to switch states.

Looking at your model structure, I'd recommend keeping it simple and having one LSTM in the decoding layer. Although i've heard of people having success with stacking RNN's, I haven't personally found that to do better, it usually under or overfits for me when I try that. Technically also you don;t need the Activation layer as you can specify the activation function in the dense layer, but whatever is easier to read and understand.

I tend to find better performance from the simpler GRU and JZS1 RNN's than LSTM's. These models are a little simpler than the LSTM.

NickShahML commented 9 years ago

First of all, a huge thanks for all the details. This response had alot of useful insights.

I've actually found smaller vectors to work better (64 and 128 - note the binary sizes, I think this helps performance by assisting theano with copying chunks of data to and from GPU)

That's a useful tidbit.

Reversing the inputs is something that may help. However, be careful which end you zero pad if you do this

Which side would you normally pad the zeros? From my experience, I also pad the zeros on the right side (as does keras's pad sequence function). When you reverse the input, are you suggesting you pad on the left side? What would be the advantage of that if the network expects to always recieve input in reverse order?

Let me clarify that when I say reverse the input, you reverse all of the inputs you give it. You never give the network the input in original order. Therefore, the network always expects to recieve in the input in reverse order. I kinda think of it as you reading a book in reverse, and you learning to read in reverse. You're always fed books that have words in reverse order.

Interesting that there's a mirroring strategy, but I feel that it would confuse the network. I'll go digging for some papers on it and understand it better.

Regarding Qu1:

I'm actually not writing any custom code for Keras. I do my own significant pre-processing to cluster the words and assign each word a unique integer id within the cluster. This allows me to get a 80k vocab down to ~400 different integers where it takes two integers to represent each word (a cluster id, and a word id). I chain the cluster id and word id together. So to represent 15 words, it takes a string of 30 numbers.

But I do heed your suggestion! Keras code is usually much more durable than custom code. I go with Keras code when possible.

Regarding Qu2:

This was my biggest question, so I appreciate all of your details.

for each prediction you need it to choose from 10k words. So for each row you'd have a matrix (2D) of size 10k X 100, so it's 3D.

Good to know it has to be 3d, in retrospect, I feel a little foolish for asking that. So to clarify, lets suppose you have a vocab of 10k words. Are you saying that you must one hot this 10k vocab? That seems super inefficient to me.

I guess what I'm proposing instead is that instead of one hotting, you give an integer, and you then you apply some sort of embedding (much like you did in your embedding layer for x_train). I'm cool with doing one hot, but it just seems inefficient computationally and ram-wise.

The output is 3d - (number of samples, size of each vector, length of the sequence).

Not to be a detail jerk, but from Keras's docs, I've seen the usual order to be:

(x: number of samples, y: number of timesteps, z: vector size of softmax output)

So in our case:

(x: number of samples, y: max number of words per sentence, z: one-hot of the 10k words)

Regarding Qu 3:

Adding the artificial start and stop symbols allows the system to learn 'rules' (patterns may be a better term) that govern how the data is distributed at the start and end of sequences.

This makes alot more sense. It gives the network a clear understanding of when there is a start and a stop. I will definitely be adding these in.

This is late for me, so I hope I am making sense.

You're making alot more sense than textbooks.

Although i've heard of people having success with stacking RNN's, I haven't personally found that to do better, it usually under or overfits for me when I try that.

I'll definitely start with just one rnn for my decoding layer. One thing I've learned from my experience is that stacking rnn's does perform significantly better, but only when you have big data. Many of my experiments have shown that to me. Many Google papers tend to use these big nets as well.

I plan on training on at least 30 million samples (sentences). If I did it on 3 million or less, I could see a single layer performing better. If you're doing better with just one lstm, that's a red flag to me that you don't have enough data IMHO.

I do have one more question from what you've written:

Qu 4: Why are you vectorizing words for your y_train if you're one hotting them in the end?

I apologize if this is an obvious answer, but as I understand it, you're one-hotting each of your words. If you have a 10k vocab, why don't you assign an integer per word, and then one hot each of the words respectively? How are you incorporating these 32 or 64 length word vectors?

I'm sorry if I really missed the boat on this one. I use word vectors to cluster words into groups as I mentioned above. But in the end, I assign each word an integer (word id). From that integer, I can then one hot. Wouldn't you all be doing the same? I'm talking strictly about the y_train data.

Apologies for the essay response, but this is such a good conversation. If you're interesting in skype chatting anytime, my skype name is the same as my username here. Hopefully I can help you at least a little bit for the help you have given me.

simonhughes22 commented 9 years ago

I think you'd want it left padded. Which is what I thought keras does, but I don't have time to check. The important point is to pad from the same side regardless of whether the input is reversed or not. I'd make sure it can work well without reversing before trying to reverse, with these tools you want to do the simplest thing possible to get it to start learning before you start getting creative. That way you have a baseline you can compare against to ensure you are improving things iteratively.

The reason I think you want to left pad is how this works. You have an encoding layer and a decoding layer. The encoding layer processing the input left to right to produce a vector representation of the entire sentence. That vector is then replicated (just due to how keras is built) so that that input is replicated for every predicted output label. Then the decoder (you have an encoded input sequence as a vector) runs an RNN to produce output, taking this repeated encoding and it's internal state (which has a feedback loop) and produces output. So if your input is right padded, the right most tokens are zeros, which normally causes the network to reset it's state, so your encoding is not great. I could be wrong on this, but that is my understanding. It's been a few months so I could be missing something. Let me know if keras right pads as I am not overriding this.

Why the clustering? Use the pre-trained word2vec embeddings as input rather than clusters. That way you can have a 300 element encoding without the need for a one hot representation. That's how most people handle this problem these days, they use a language model to learn word embeddings. You don't need to even do that yourself unless you have very domain specific data, you can use the ones google pre-trained on a massive corpus. That actually worked better for me than learning my own embeddings, although the optimal strategy is a combination of pre-trained and tuning using supervised embeddings.

Having it predict an output embedding is better than the one hot strategy. I didn't mention that as it further complicates matters. But having a dense layer in the output is actually the equivalent to doing that. The word2vec embeddings and similar are taken from a neural network like model that is trained on a one-hot encoding, the embeddings are the weights from each of the inputs to the hidden layer, as I understand it. There are many variants on this, but that's the basic idea. You can emulate that using a dense vector before you one hot the output. That is inefficient but you'll have a hard time doing something more efficient such as a hierarchical softmax (see word2vec) in keras without a lot of custom code.

Regarding the ordering- I think keras' changed somewhat from when I last used this (before you specified named dimensions). Make sure it matches whatever the docs say and you'll be good.

That's cool about stacking. I've gotten really good performance from small data on this, which is very unusual. The more model params, the more powerful the model and so by adding more layers you are giving more degrees of freedom. If you have big data then you can take adv of a much larger model. I'd start small and simple though, get reasonable performance and then starting making it more complex. Unfortunately I'd don't have the option of more data (which almost always helps), labelling my dataset is a very expensive operation and depends on a team of psychologists.

I think i discussed the last question above. To be honest, for performance i'd have it predict the pre-trainined word2vec vectors as outputs. Again, in my domain, smallish data, my vocab size is actually small enough that doing a one-hot is not an issue for me. The best thing from the literature is probably a hiearchical softmax, but you'd have to hand roll that.

simonhughes22 commented 9 years ago

Gensim's word2vec runs on theano now I think (at least i've been getting theano errors from it so I am asssuming so).You may be able to take their hiearchical softmax and plugin it into keras. If you get that working, please submit a pull request and give back to the community.

NickShahML commented 9 years ago

I'd make sure it can work well without reversing before trying to reverse, with these tools you want to do the simplest thing possible to get it to start learning before you start getting creative.

Words of wisdom. I'll do regular input first.

Let me know if keras right pads as I am not overriding this.

Keras does right pad. I can understand as its been a while for you. I've ran it several times, and it always pads on the right. To clarify, it places the zeros on the right side. Again, appreciate the detailed explanation.

Why the clustering? Use the pre-trained word2vec embeddings as input rather than clusters.

In the end, I think your right. The reason why I did the clustering is that my words are very specific to my domain. Pretrained vectors aren't very good at high biology and astronomy terms (my interest).

By clustering, I essentially train the net to know that certain terms are related. The word id within each cluster is ordered by word frequency. So the most frequent words in that specific cluster are used more often. Thus, if the net has to guess, it will guess a word id of 0, or 1, and it will choose a word that is more used.

When strictly talking about x_train input, I do not do any one hotting. I simply submit each word with its cluster id first, followed by its word id. Thus my 2d x input look like this:

[34, 2, 45, 0, 23, 1, 34, 4, 45, 3]

These ten numbers represents 5 words. In this way, the maximum integer used is ~400 (400 clusters with 400 words per cluster at most.) Therefore, when I do a softmax, I only have to do it over 400 options. This was the whole motivation behind clustering words. This 2d x input is fed into the embed layer.

The meaning of the 2nd and 3rd dimensions only matter if you are using certain loss functions (like categorical cross-entropy) that do a soft-max like operation over all classes,

Yes, I plan on using categorical cross-entropy which is why I asked. Good to know!

Unfortunately I'd don't have the option of more data (which almost always helps), labelling my dataset is a very expensive operation and depends on a team of psychologists.

Ahh I see. To add to the stacking idea, have you tried using a series of dense layers afterwards? They sometimes improve performance without the need for more data.

You've probably already considered this, but have you tried looking into unsupervised learning so that you can acquire far larger data sets? No idea what you're doing, but from my experience: always go with unsupervised if you can get at least 100x more data.

Having it predict an output embedding is better than the one hot strategy. I didn't mention that as it further complicates matters. But having a dense layer in the output is actually the equivalent to doing that.

That's what I figured. The line you said about the dense layer is really good to know.

That is inefficient but you'll have a hard time doing something more efficient such as a hierarchical softmax (see word2vec) in keras without a lot of custom code.

I'll stick the word clustering strategy right now so that the softmax is down to 400 choices. I read some threads on hierarchical softmax in keras, but it seems a bit painful to implement.

If you get that working, please submit a pull request and give back to the community.

I definitely want to be as helpful as I can to you guys. It will take me at least a week or two to fully implement these ideas. If I find anything interesting, I'll report my findings on this thread so that hopefully, you guys can benefit from it. I'll also be experimenting with using different learning rates and adding dense layers as well at the end of the decoder. If I do any keras modifying, I'll be sure to submit a pull request!

simonhughes22 commented 9 years ago

Instead of the clustering i'd try trainining word2vec or GLOVE (from Stanford) on your biology \ astronomy dataset, if it's large enough. It doesn't work well on small data, on my small PhD dataset it didn't perform well. But for work I have a much larger dataset and the vectors learned there were very good, and that's very domain specific, I wasn't able to use the pre-trainined ones either for that. Then you can use those vectors, if they're any good, as your target outputs. You can ask word2vec for the top 10 matching terms, running that for some important keywords in your domain will tell you if it's worked well. Make sure you train more than one iteration though, that's the default and that sucked for me, but at 10-20 iterations, I was getting good results.

I corrected by comment about the meaning of the 2D and 3D dimensions, it does matter in this case, I believe, as we're using an RNN to generate the output sequentially so the dimension that refers to the time dimension is important. Apologies.

In terms of unsupervised learning, yeah i think it's very useful. Using the pre-trained google word2vec vectors has helped somewhat and that is unsupervised, at least in the manner i'm using it (it wasn't trained for the purposes I am using it for). It's supervised in the sense of it's training a language model, but if you are using for some other task, then i'd argue in that context, it's not. That's the easiest thing I can do right now in this regard, and the top 10 similar terms when I query that model are very good even for my specific domain (science essays). You can think of it as a form of soft clustering, where each element in the 300 dimension vector is a cluster or topic in your domain, and the value is a score of how much that word represents that topic. It's also important in word2vec to extract common phrases and treat those as words, i've found that helps tremendously. I've used a variant of association rules (applying the downward closure principle) to build a common phrase extractor to detect commonly occurring phrases in my domain.

NickShahML commented 9 years ago

Instead of the clustering i'd try trainining word2vec or GLOVE (from Stanford) on your biology \ astronomy dataset, if it's large enough.

Yes, I completely agree with you that the word vectors do indeed beat the clustering. I think the main reason that I didn't do the word vectors intiallly is that I simply don't know how to incorporate them into the input for my x_train

At this point, I just feel so bad because you have already helped me a ton, and I haven't given much to you. So if you're busy, no need to reply to what I'm saying below:

I guess what confused me about each word is that there are 300 numbers (for a 300 length vector), so how do you input 300 numbers on a 2d input for your x_train? I assumed you were using an embedding layer. But even if you do use an embedding layer, the vectors are not integers. Keras requests integers to be used in the embedding layer: http://keras.io/layers/embeddings/#embedding.

The clustering I was doing was nice because it gave me two integers per word, so I could embed them really easily (and understand what I'm doing! But the clustering strategy feels so amateur.)

Don't get me wrong, I know how to use word2vec and glove to generate word vectors for each word. I just didn't know how to format the inputs into the x_train or for that matter, the y_train either. Let me try to do some reading to figure it out.

But for work I have a much larger dataset and the vectors learned there were very good,

Yes, I have experienced the same when I use word2vec on my training data. When I ask for the top 10 matching terms, they are incredibly close. For example:

["neutron_star", "pulsar", "neutrons", "high_density", "high_rotation", "aftermath"]

Make sure you train more than one iteration though, that's the default and that sucked for me, but at 10-20 iterations, I was getting good results.

You are referring to the Word2vec training correct? Not the actual seq to seq model? If so, I didn't even think of this and I should do this regardless!

You can think of it as a form of soft clustering, where each element in the 300 dimension vector is a cluster or topic in your domain, and the value is a score of how much that word represents that topic.

Yes, this is what lead me to the idea of doing a k-means cluster in the first place. But like I said, inputting direct word vectors into the model does beat the clustering idea.

It's also important in word2vec to extract common phrases and treat those as words, i've found that helps tremendously.

Yes, I do bi-gram, tri-gram, and quad-gram, and stop there. I figure penta-gram is just overkill, but heck I might try it sometime. But I have had major success with using the phrase extractor as well. Always good to hear that someone else is doing the same thing as you.

As a side note, something that helped cut down on extraneous words was pyenchant https://pythonhosted.org/pyenchant/api/enchant.html. Basic idea is to make your input text a list of words, and fix spelling errors (or recorrect words that shouldn't belong). First tokenize all the words of your input into a list using nltk. I don't know if this will be of any use to you, but if can help you, I'll feel better!

Here's my code. I apologize as its really messy right now:

words_not_vectorized = set()
all_words_untouched = set(tokenized_words_untouched)

print 'applying frequency distribution to original text'
word_freq = nltk.FreqDist(tokenized_words_untouched) 

for eachword7 in all_words_untouched:
    if word_freq[eachword7] < 2: #we choose 2 because a word is rarely mispelled incorrectly twice in the same way
        words_not_vectorized.add(eachword7)

words_that_are_common = all_words_untouched - words_not_vectorized

print 'creating personal spelling dictionary'
with open ('listofspelledwords.txt','w+') as listofspelledwords:
    for eachword12 in words_that_are_common: #add to dictionary for spelling corrector
        listofspelledwords.write(eachword12+'\n')

del words_that_are_common

d = enchant.DictWithPWL("en_US","listofspelledwords.txt")
spelled_tokenized_words_untouched =[]

number_of_corrected_spelling_errors = 0 
start_time = time.time()
print 'correcting spelling errors -- this will take a while'
for eachword13 in tokenized_words_untouched:
    if d.check(eachword13): #word spelled correctly
        spelled_tokenized_words_untouched.append(eachword13)
    else:  #word not spelled correctly
        try:
            spelled_tokenized_words_untouched.append(d.suggest(eachword13)[0])
            # print 'changed '+eachword13+' to '+(d.suggest(eachword13)[0])
            number_of_corrected_spelling_errors = number_of_corrected_spelling_errors +1
        except IndexError:
            spelled_tokenized_words_untouched.append(eachword13)
print 'the time for spell checking is below: '
print("--- %s seconds ---" % (time.time() - start_time))
print 'number of corrected words from spell check is: '+str(number_of_corrected_spelling_errors)
del number_of_corrected_spelling_errors

with the words that are left over, I find their respective hypernyms and replace them with the hypernyms. Note that I'm doing all of this before I apply word2vec. This helps tremendously on word2vec results because it increases the frequency of the words:

'''-------------------------------HYPERNYM CLASSIFICATION SCHEME-------------------------------------'''

total_words = set(spelled_tokenized_words_untouched)

number_of_unclassified_words = 0
number_of_hypernyms_found = 0

# my_regex = r"\b" + re.escape(eachword4) + r"\b"
# text = re.sub(my_regex, hypernym_of_word, text)

hypernyms_replaced_one_text =spelled_tokenized_words_untouched

start_time = time.time()
print 'you are now finding and replacing uncommon words with hypernyms'
for eachword4 in words_not_vectorized:
    use_hypernym = 0
    use_synonym = 0
    try: 
        synonym_set_of_word = (Word(eachword4)).synsets[0]
        hypernym_set_of_word = synonym_set_of_word.hypernyms()[0]
        hypernym_of_word = hypernym_set_of_word.name().partition('.')[0]
        for n, eachword10 in enumerate(spelled_tokenized_words_untouched):
            if eachword10 == eachword4:
                hypernyms_replaced_one_text[n] = hypernym_of_word
        number_of_hypernyms_found = number_of_hypernyms_found + 1
    except IndexError:
        number_of_unclassified_words = number_of_unclassified_words+1
print 'you completed the hypernym process in time below'
print("--- %s seconds ---" % (time.time() - start_time))

print 'below is the number of words originally not vectorized:'
print len(words_not_vectorized)
print 'below is the number of words with no hypernyms found'
print number_of_unclassified_words
print 'Below is the number of different words within original text'
print len(total_words)
print 'total number words in the original text below'
print len(spelled_tokenized_words_untouched)
del spelled_tokenized_words_untouched

print 'total number words in the hypernymed text below'
print len(hypernyms_replaced_one_text)
# print 'total number of words in hypernym filtered list'
# print len(replaced_uncommon_with_hypernyms_one_list)
print 'total number of hypernyms found and replaced:'
print number_of_hypernyms_found
print 'the number of different words BEFORE ANY PREPROCESSING WAS DONE including punctuation is '+str(len(all_words_untouched))

simonhughes22 commented 9 years ago

I'll keep this short. To use pre-trained word embeddings, just lop off the embedding layer. The input to the LSTM is 3D if I recall correctly. Different example (and not seq to seq) but here's a conv net that is using the pre-trained embeddings that I have used:

print('Build model...')
# input: 2D tensor of integer indices of characters (eg. 1-57).
# input tensor has shape (samples, maxlen)
nb_feature_maps = 32
n_ngram = 5 # 5 is good (0.7338 on Causer) - 64 sized embedding, 32 feature maps, relu in conv layer
embedding_size = emb_shape[0]

model = Sequential()
model.add(Convolution2D(nb_feature_maps, 1, n_ngram, embedding_size))
model.add(Activation("relu"))
model.add(MaxPooling2D(poolsize=(maxlen - n_ngram + 1, 1)))
model.add(Flatten())
model.add(Dense(nb_feature_maps, 1))
model.add(Activation("sigmoid"))

# NOTE: add in repeat layer and decoder here

You can adapt that to be the decoder part, or you can keep the RNN structure too, as I mentioned remove the embedding layer and load in your own embeddings in place of it.

Another option is doing convolutions over embeddings (allowing bi-gram, tri-gram, etc to be extracted as convolutions).

The iterations comments was for Word2Vec, correct.

NickShahML commented 9 years ago

Awesome thanks again @simonhughes22 , I really appreciate your help, and I'll definitely will do some digging around and experimentation! Removing the embedding layer was the part I was missing. It will take me about 3 weeks to really test the ideas you suggested. If I find something interesting or helpful, I"ll post it back here. Thanks again man!

simonhughes22 commented 9 years ago

One other thing, you could train a character rnn. That way you have only a small number of inputs and outputs. Like this: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py. Google character RNN for more papers. LeCun's team did some work on this with conv nets if I recall, and Karparthy (I think) with RNN's.

NickShahML commented 9 years ago

Thanks for the pointer, but that's actually where I started! I started predicting chars after inputting 30 chars. I didn't try seq to seq on it. But I do appreciate the suggestion! I think i'm gonna stick with words and try to make as much progress as I can. Currently building the seq to seq model. Hopefully training will start tomorrow.

Just as a side note, the predicting chars works well when you throw about 300mb to 3gb of text at it, and ramp your lstm layers to 8 or 10. If you give it 100 chars, and ask it to predict the next one, it will predict phrases of sentences pretty well. It takes forever to train though, so I suggest changing the learning rate on adam to 0.02 and decreasing it when the loss goes crazy.

NickShahML commented 9 years ago

Just as an update for anyone reading this thread -- I retested the padding_sequences, and it pads on the left side. So I was completely wrong, and Simon's memory is better than mine!

simonhughes22 commented 9 years ago

:). @LeavesBreathe I only remember that because it's important in how the LSTM \RNN works. As you run the input left to right, it outputs a vector based on that whole sequence, but that vector is a reflection of its internal state. However, that's more sensitive to the more recent values (which is why you want to output sequences a lot of the times as that's not). Once it hits the zeros it learns to reset it's state, so if the left padding were at the end, it wouldn't work very well as those are the last inputs processed, wiping its state. Which is why I said that's very important to consider especially when reversing the inputs. If you want to reverse the inputs, do so, but ensure that the padding is still on the left hand side. It 'may' not learn to reset it's state, but once it hits a zero it just needs to learn to predict more of them, so it doesn't need to remember anything else about the rest of the sentence at that point, so it will in practice use that to trigger the forget gate. Hope that makes sense

NickShahML commented 9 years ago

If you want to reverse the inputs, do so, but ensure that the padding is still on the left hand side.

Thank you, this makes much more sense. I didn't realize it reads them left to right, and that if the zeros were on the right side, it would wipe the state. Makes complete sense.

Sometime, I want to buy you a latte (grande)

simonhughes22 commented 9 years ago

@LeavesBreathe i realize you could be anywhere in the world so this is a long shot. But if you happen to be anywhere near the chicago area, or san jose (I visit there a few times a year), i'd take you up on that.

NickShahML commented 9 years ago

hahaha I'm actually in Cincinnati (looks like 4.5hr drive for me), so this could be a possibility =p If you want to add me on skype, my username is "leavesbreathe"

gkeskin07 commented 9 years ago

Hi guys,

I went through the whole conversation but I still have a simple question about the Embedding Layer.

Assume I have a vocabulary of size nvocab, and all my sequences have length nseq. Assume I have only one sample for simplicity. I want to find an embedding for each word with nembed dimensions.

1) Should the input to the embedding layer,xtrain, be a sequence of integers or one-hot encodings? Currently, my xtrain has a length of nseq, with each element an integer that can take a value up to nvocab.

2) If xtrain is a list of integers (rather than one-hot encodings), what does the weight matrix in the embedding layer look like? Let us look at one word in the sequence. What I understand so far is that the embedding layer internally converts this word (a single integer) to a one hot encoded vector of size nvocab (let's call this vector Vonehot). Then, the network learns an embedding weight matrix, Wemb, of size [ nembed x nvocab], and computes an embedding vector Vembed = Wemb x Vonehot. Vembed is a vector of length nembed. Is that true?

Thanks a lot.

simonhughes22 commented 9 years ago

@kg07 1). If you use an embedding layer, feed it a list of integers. Without that you'd need a one hot encoding 2). This is correct. An embedding layer just take a linear NNet layer, mapping a one hot encoding to a hidden layer. The 'embedding' is simply the weights associated with each input node. As there is a separate input node for every word, you get a separate embedding for each word. This is represented by a matrix of size (nembed * nvocab), although the 2 dimensions may be switched depending on implementation details. At least that's my understanding. Of course the literature never explains it that simply, which is a shame, but that's what I've inferred after much digging. I'd love for someone to correct me if that is wrong

NickShahML commented 9 years ago

@kg07

Give the embedding layer your integers -- the embedding layer only accepts integers -- and your input should be 2d. (nb_samples, sequence)

sorry just saw simon's comments -- nvm mine

keras-team / keras

is the Sequence to Sequence learning right? #395