is the Sequence to Sequence learning right?

EderSantana commented 9 years ago

Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values? There is no choice to pass a mask to the objective function. Won't this bias the cost function?

NickShahML commented 9 years ago

Hey guys, I just want to draw some attention to thread https://github.com/fchollet/keras/issues/957. The bottom line is that our output's are not masked, meaning the cost function is biased.

I think this explains why the results from linear regression tended to be "stretched". That is, it slowly drifted from word to word. I'm gonna talk to Eder, but I think to me, his explanations were quite clear.

simonhughes22 commented 9 years ago

@LeavesBreathe that sounds like a really interesting idea. When I want to test something that takes a long time, i run it on a small subset of data (if it's going to run a long time). If that works well or better than some other approach I am comparing it with on that same small subset then I try it on the full dataset or a larger subset. I'd advise that here. Smaller data should be easier for it to learn from too, although what it learns won't generalize nearly as well. Hope that helps.

EderSantana commented 9 years ago

RNN encoder-decoders do take really long to converge to good solutions... everybody seems to be reporting that. Another thing in the final layer decoder you may want an RNN with a readout (the final generated word is sent back to the inner RNN). I wrote a GRU with readout here: https://github.com/EderSantana/seya/blob/master/examples/imdb_readout.py#L53-L67

NickShahML commented 9 years ago

First of all, I can't believe seya now. Its so exciting to me that you have stateful GRU's and bidirectional rnn's. This is just amazing.

I need to readup a little bit on readouts to understand exactly what the advantage of this is, but it sounds very exciting. Thanks alot man.

@simonhughes22 I agree with you. Start small and grow bigger when testing anything major. Though I must say that the Sutskever RNN has much more promise.

EderSantana commented 9 years ago

c'mon you didn't know about Seya :D ??? That is the place where I'm cooking things before I push it to Keras. Some advanced examples that could crowd this repo are also there like Spatial Transformer Networks and DRAW. If more people use them and do suggestions, we could move them up here to main Keras.

To understand what I mean by readout see this figure by Cho et. al. screen shot 2015-11-06 at 9 56 39 am

see the difference between encoder and decoder??? the generated symbol is sent back to the RNN in the decoder.

NickShahML commented 9 years ago

ahhhh I gotcha. So basically if I understand it correctly, lets say your decoding layer produces sentences. For y1 it produces: "the"

For y2 it produces: "cow"

since it is a readable GRU, for y3, it sees that you have written "the" and "cow" so it is more likely to pick "jumped" as y3?

Of course, y1, y2, and y3, are a distribution of percentages (assuming your using a softmax). But it would see these percentages. I usually apply a temperature after the percentages are produced (so I'm not always picking the highest percentage choice).

Anyways, I look forward to the tutorial you mentioned in the other thread. There's just so much to try now!

gautamb85 commented 9 years ago

I was able to have a chat with K. Cho.. I believe that the readout is used at test time, the model is trained by 'teacher forcing' i.e. The decoder is fed with the true label (from previous time step as input) which is replaced by the readout at test time. Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state. For example, if you wanted to use the model to score a pair of sequences (and not generate the target sequence)

NickShahML commented 9 years ago

Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state.

I'm trying to generate text, which is why this is so critical to me. I believe @simonhughes22 is as well.

EderSantana commented 9 years ago

@gautamb85 I thought I was cheating when I teacher forced during training xD good to know!!! But note that not all the information needed is present in the hidden state, specially if you are using a deep readout. Several works report feeding back the readout.

gautamb85 commented 9 years ago

Heh. We have had this debate before (on this thread somewhere I think). The way I think about it is - so the input to your decoder RNN (if you dont feed in the prediction) is going to be the summary vector produced by the encoder. At every timestep.

Now, I know for a fact that this does work (atleast on simple tasks like numbers-> number strings). However, intuitively if the previous prediction is fed in, the decoder has a better idea of 'where it is' (if that made any sense). I would think this model would be more powerful. I am using the approach as generative model to score a pair of audio sequences. So I cant be confident about the text generation problem. @simonhughes22 can the sequence-to-sequence model you proposed in keras be used to score two sequences? I dont think so because there is no way to input the second sequence. I guess it would need a graph() model. It was actually easier to just write my own code (not easy, but easier ;) )

Sent from my iPhone

On Nov 6, 2015, at 10:20 AM, LeavesBreathe notifications@github.com wrote:

Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state.

I'm trying to generate text, which is why this is so critical to me. I believe @simonhughes22 is as well.

— Reply to this email directly or view it on GitHub.

gautamb85 commented 9 years ago

@EderSantana nope! You're good :) Are you doing this with keras? It would need a graph() model ? I wrote my own but it i only have sgd and momentum going and it would be nice to have more fancy optimization Thanks for the tip on the readout

simonhughes22 commented 9 years ago

@gautamb85 - use a y shaped graphical model, or concatenate the two sequences. I suspect that'll be really tough to learn. although hopefully the changes @EderSantana is suggesting will work better. I don't have time to check (busy with work, phd, and kaggle:) ).

EderSantana commented 9 years ago

Yes I did use a graph inputing the "input" and the "teacher" (input delayed one step) than I pass both to the decoder GRU with merge_mode=concat.

simonhughes22 commented 9 years ago

@EderSantana seya looks awesome. Please add to keras!

gautamb85 commented 9 years ago

Off topic. But is anyone coming for NIPS? i say coming since I live in Montreal :)

Sent from my iPhone

On Nov 6, 2015, at 10:53 AM, Simon Hughes notifications@github.com wrote:

@EderSantana seya looks awesome. Please add to keras!

— Reply to this email directly or view it on GitHub.

NickShahML commented 9 years ago

However, intuitively if the previous prediction is fed in, the decoder has a better idea of 'where it is' (if that made any sense). I would think this model would be more powerful.

This is exactly what I thought when @EderSantana explained the GRU(readout). Knowing where it it is in the readout I think would be incredibly powerful. I'll try to integrate this readout GRU, and report back if I get better val loss with the text generation. It will take me sometime to integrate in correctly.

I'll also graph my results publically if it helps anyone. I've started graphing here: https://plot.ly/~oxygen123/folder/home

wxs commented 9 years ago

@gautamb85 I've been sitting out for much of this chat, but I'll be at NIPS, we should organize a Keras meetup! We're doing a lot of sequence to sequence stuff as well, would be good to compare notes.

sergeyf commented 9 years ago

I will also be at NIPS!

On Fri, Nov 6, 2015 at 8:05 AM, Gautam Bhattacharya < notifications@github.com> wrote:

Off topic. But is anyone coming for NIPS? i say coming since I live in Montreal :)

Sent from my iPhone

On Nov 6, 2015, at 10:53 AM, Simon Hughes notifications@github.com wrote:

@EderSantana seya looks awesome. Please add to keras!

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/395#issuecomment-154449245.

gautamb85 commented 9 years ago

@wxs It would really be good to compare notes! I know that another keras contributor lives in Montreal, I will try to get in touch with him. Maybe lets do a post on the google-group for a meetup? Ps. Get ready for some serious cold :)

mphielipp commented 9 years ago

I will be in NIPS as well. I will interested in the meetup.

Best regards, Mariano

On Fri, Nov 6, 2015 at 10:46 AM, Gautam Bhattacharya < notifications@github.com> wrote:

@wxs https://github.com/wxs It would really be good to compare notes! I know that another keras contributor lives in Montreal, I will try to get in touch with him. Maybe lets do a post on the google-group for a meetup? Ps. Get ready for some serious cold :)

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/395#issuecomment-154483117.

Best Regards, Mariano

gautamb85 commented 9 years ago

@LeavesBreathe You might want to check out this ipython notebook https://github.com/DTU-deeplearning/day3-RNN/blob/master/RNN.ipynb

He has setup an encoder-decoder and one with an attention mechanism. Note that he is doing it in the same way (not feeding in the prediction), but the attention model can kinda compensate for this. There is also a toy-problem (text prediction :)) setup. You would have to learn lasagne (also a really good package), and use more theano.

EderSantana commented 9 years ago

wow, that NIPS meetup will be fun. I'll go home in Dec. but you guys write blogs or something to let us know what happened.

lukedeo commented 9 years ago

I'll be at NIPS -- I'd love a meetup.

NickShahML commented 9 years ago

@LeavesBreathe You might want to check out this ipython notebook https://github.com/DTU-deeplearning/day3-RNN/blob/master/RNN.ipynb

This is a really good find. You're right that I would need to learn lasagne, but it may be worth it if it allows more capabilities.

I think for right now, I'm gonna wait for @EderSantana 's tutorial and go from there. Keras has been a huge help to me, and I hope I can start to contribute to it. In the meantime, I'm gonna try to start implementing the readout GRU.

wxs commented 9 years ago

Moving NIPS discussion to #962!

farizrahman4u commented 9 years ago

I have done a seq2seq implementation, based on http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.

https://github.com/farizrahman4u/seq2seq

It has stateful LSTM encoder and decoder, hidden state transfer from encoder to decoder, feedback decoder (output at step t is input at step t+1), depth and all the fancy stuff.

EderSantana commented 9 years ago

@farizrahman4u I saw your code it looks pretty interesting. @LeavesBreathe I think you want to check that out.

One little thing, I saw that you use two LSTMs, an encoder and another one for decoder. Isn't that approach more similar to Socher's encoder-decoder than the model presented in the paper you cite? Did it provide you better results than using a single RNN? Tkx for sharing.

NickShahML commented 9 years ago

wow @farizrahman4u , I'm overwhelmed. This Keras community is ridiculous. A huge, huge thanks. I have a few questions if you have time:

Can you mask values? I know there is a input_length and output_length. But if we add a masking layer before the seq2seq layer, can we mask all zeros? This would be for both input and output. Output being really important (so cost function is not biased)
For beginners, can you give some understanding do what depth is and why you recommend 5?

If we want to add more GRU's or LSTM's to the decoding layer, would we make a network like this?

seq2seq = Seq2seq(input_length=x_maxlen, input_dim=word2vec_dimension,hidden_dim=hidden_variables_encoding,
             output_dim=hidden_variables_decoding, output_length=y_maxlen, batch_size=10, depth=5)

model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (x_maxlen, word2vec_dimension)
model.add(M)
model.add(seq2seq)
model.add(GRU(hidden_variables_decoding, return_sequences = True))
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax')
model.compile(loss='categorical_crossentropy', optimizer='adam')

NickShahML commented 9 years ago

One little thing, I saw that you use two LSTMs, an encoder and another one for decoder. Isn't that approach more similar to Socher's encoder-decoder than the model presented in the paper you cite?

@EderSantana, from my understanding of Google's Seq to Seq paper, they got the best results using four LSTM's to encode and four LSTMs to decode giving them a total of 8 LSTM's.

This is why I asked the question above. Ideally, you would want to use much data, along with multiple RNN's. The idea is that each one captures more salient features within your data.

Interestingly, one thing that improves my results from previous prediction experiments is to use not just one type of rnn. Instead, for my decoder layer, using something like:

LSTM, JZS1, GRU (in that order)

gives me better results. I always lead with a LSTM because I feel it captures the most amount of features. Anyways, my two cents for what its worth. LIke I said earlier, I'll be publically graphing all my experiments (and labelling them as best as I can) with @EderSantana and @farizrahman4u mods on Keras. Tonight, I'm gonna try leading with a bidirectional LSTM and see what happens.

farizrahman4u commented 9 years ago

Hi @LeavesBreathe. You have to mask only after the seq2seq2, not before.

1)Your training data should look like : ["How are you <EOL> <EOL> <EOL> <EOL> <EOL>", "I am fine <EOL> <EOL> <EOL> <EOL> <EOL>"]

If you mask the EOLs in the input, the decoder will find it difficult to terminate its outputs with EOLs. Instead, it will fill up its output length with garbage words.

2)Depth is the number of LSTM layers in the encoder and decoder. If depth is 2, you have 2 LSTMs for encoding and 2 LSTMs for decoding. Thanks for pointing out that the optimal value is 4 and not 5. You can also have different depths for the encoder and decoder. e.g, depth = [4,5]. Which means 4 LSTMs for encoding and 5 LSTMs for decoding.
3) Yes, you could do that, but use stateful GRU from seya. (Or you could use the depth argument). Also instead of using the ready made Seq2seq model, try playing around with the LSTMEncoder and LSTMDecoder layers and make your own custom Seq2seq models (for e.g, multiple dense layers in between encoder and decoder, one encoder - many decoders models etc.). Its fun!

farizrahman4u commented 9 years ago

@EderSantana In Cho et al, the output from encoder is fed to the decoder at every time step. And readout is also present. But in seq2seq, the output from the encoder is fed to the decoder in the first time step only. Also, the hidden state is copied from encoder to decoder. So my model is more similar to the seq2seq.

Update: I have added a new decoder: LSTMDecoder2, it is similar to Cho et al, the output from encoder is fed to the decoder at every time step, along with output of previous time step. You may or may not enable hidden state copying when using this decoder. (Should work better when not enabled in case of conversational model, as it could remember not only what was said by human in the previous time steps, but also what it said in the previous time steps. ).
One advantage of having 2 LSTMs is that the encoder learns a vector representation for sentences. So if you are doing a language translation task, you can reuse the encoder for multiple languages. You dont have to train the encoder from scratch for each language pair. Here is some pseudo-code:

EnglishToFrench = Seq2seq()
EnglishToFrench.compile()
EnglishToFrench.train()

encoder_data = EnglishToFrench.encoder.get_weights()

EnglishToSpanish=Seq2Seq()
EnglishToSpanish.encoder.set_weights(encoder_data)
EnglishToSpanish.compile()
EnglishToSpanish.train()

You can also train multiple language pairs simultaneously (Encode in English, decode in other languages):

EnglishEncoder = LSTMEncoder()
FrenchDecoder = LSTMDecoder()
SpanishDecoder = LSTMDecoder()
GermanDecoder = LSTMDecoder()
dense = Dense()

EnglishEncoder.decoders = [FrenchDecoder, SpanishDecoder, GermanDecoder] #Multiple decoders. Wow!

model = Graph()
model.add_input(EnglishEncoder, "english")
model.add_node(dense,"dense", input="english")
model.add_output(FrenchDecoder, "french", input="dense")
model.add_output(SpanisDecoder, "spanish", input="dense")
model.add_output(GermanDecoder, "german", input="dense")
model.compile()
model.train()

Apart from any advantage in accuracy/speed, multiple RNNs have the advantage of modularity and reusability.

NickShahML commented 9 years ago

If you mask the EOLs in the input, the decoder will find it difficult to terminate its outputs with EOLs. Instead, it will fill up its output length with garbage words.

Huh. I always thought you wanted to mask your input, but what you're saying makes sense. I guess the question is then, is how do you mask the output given the seq2seq model? Would it be something like this?

seq2seq = Seq2seq(input_length=x_maxlen, input_dim=word2vec_dimension,hidden_dim=hidden_variables_encoding,
              output_dim=hidden_variables_decoding, output_length=y_maxlen, batch_size=10, depth=5)

model = Sequential()
model.add(seq2seq)
model.add(Masking(mask_value=0))
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax')
model.compile(loss='categorical_crossentropy', optimizer='adam')

try playing around with the LSTMEncoder and LSTMDecoder layers and make your own custom Seq2seq models

Yes! I definitely want to do that. From last night, I actually got better results with using strictly bidirectional lstms from seya.

Once I have things working, I'll make a BidirectionalLSTMEncoder and Decoder. I'll either submit a pull request to you or @EderSantana. It will take me at least a few weeks to get there though. I have alot of matrix setup and testing I need to do.

I have added a new decoder: LSTMDecoder2, it is similar to Cho et al, the output from encoder is fed to the decoder at every time step, along with output of previous time step

This is really cool. I can't wait to try all of this out. I also saw you updated the conversational.py -- Thanks!

Thanks for pointing out that the optimal value is 4 and not 5.

Its not necessarily that the optimal value is 4. I actually went to med school for a while and studied neurology heavily. You'll find that the brain appropriates different amounts of neurons for different tasks (along with different types of neurons). And different types of conversations require different amounts of neurons.

The bottom line is that, for different areas of conversations (or topics), there is a sweet spot of the number of neurons you want, and the brain automatically optimizes to that level. Talking about the weather takes less neurons and watts than compared to theoretical physics discussions. This is also why people who speak 4 or 5 languages have a more neurons appropriated for language processing and creation.

Right now, we manually do that optimizing by trying different amounts of layers/hidden states. So in summary, its not that 4 is the most optimal for seq to seq. It is for that dataset, with the task of translating English to French, 4 happens to be the most optimal. If you tried translating English to Chinese, I bet the optimal level would be 6 or 7.

Didn't mean to ramble, but all I'm trying to say is: experiment with different levels. Sometimes, I get better results with lower amount of hidden states, but more levels. Sorry if I'm beating a dead horse.

farizrahman4u commented 9 years ago

@LeavesBreathe Thanks for clearing up the layer depth thing. Regarding masking:

Masking is done for values that do not contain information
An embedding layer outputs zeroes for words it have not seen or invalid words/numbers (The embedding layer has no 'information regarding that word). Such words could occur anywhere in a sentence.
EOL is different. It contains information. It represents the end of line. So it should be considered as a normal word, and should have a non zero word embedding.
Placing the EOLs in the right place is the first thing that your seq2seq model learns. It is quite fun to see the model accidentally place some EOLs in the middle of sentences occasionally during the early stage of training and gradually shift the EOLs towards the end of the sentences as training progresses. This effect is more prominent if your vocab size is relatively small.
As EOLs are real words, contain information and have a non zero embedding, they should not be masked.
You can mask the zeroes that come from invalid/out of vocab words, before feeding them to seqtoseq. I said do not mask the input in my previous comment because you were trying to mask the EOLs.
While training, if your y_train has zeroes (invalid words in output sentences), no matter what masking you do in between layers, your model will still output zeroes(invalid words), because you are training it to do so.
Note that if you place a TimeDistributedDense on top of seq2seq, its the TimeDistributedDense that gives out the output word embeddings, not seq2seq. In that case the output of seq2seq layer will be some hidden internal representation. In such internal representation, zero vector does not mean "invalid word". Its an internal representation of your output, so zeroes could mean anything, and encode information.. so no masking in between seq2seq and another layer.
Summing it up: mask input, do not mask output. Remove invalid words from training data if possible, or just leave them there and hope for the best.

EderSantana commented 9 years ago

one note, if anybody ever have to mask outputs, use sample_weight not masking to choose which values affect the cost function. In practice it doesn't matter what the network output after EOL so we don't actually need to make it learn to output zeros. But, if you are not familiar with sample_weight do as @farizrahman4u says and just use the model as he suggested.

NickShahML commented 9 years ago

@farizrahman4u Thanks for such a detailed description in regards to masking. I completely understand everything you're saying in regards to input. Fortunately, I do not have any 'out-of-vocab' words, so I won't be masking any input.

Thank you also for clarifying that EOL is a word. And is the first part to be predicted correctly. Makes complete sense.

I guess my main question is this: Is there a disadvantage to having your neural net predict repeated EOLs?

I feel like its additional thing your network has to learn. If you're predicting sentences, your network has to learn there is a length of 50 for each sentence predicted. Supposed the predicted sentence has 30 words. Then the network has to learn to predict EOL's for the rest of the 20 timesteps.

I'm wondering if this comes at a cost to the network. It might not, and I might just be worrying about nothing.

At this point, I feel bad asking more questions, so if you don't have time, don't bother addressing below:

There is another important aspect that I'm still cloudy on: transferring hidden states

In your seq2seq model, you transfer the hidden state from the encoder to the decoder, which makes complete sense. It eliminates the need for the RepeatVector. But beyond that, many people talk about transferring the hidden state from layer to layer within the decoder and encoder..

Suppose for the decoder layer, we stack 4 LSTMs (depth = [4,4]). Does the hidden state transfer from layer to layer?

If so, what is the difference between transferring the hidden states, and just regularly stacking 4 LSTM Keras layers? Why is transferring the hidden state advantageous?

Thanks alot again.

tttwwy commented 9 years ago

Is there any attention based model case with keras? Thank you very much

farizrahman4u commented 9 years ago

@LeavesBreathe

I guess my main question is this: Is there a disadvantage to having your neural net predict repeated EOLs?

Not much.

Then the network has to learn to predict EOL's for the rest of the 20 timesteps.

This is not as tough as it sounds. Your network DOES NOT learn like this:

After 1 EOL, I should output  EOL
After 2 EOLs, I should output  EOL
After 3 EOLs, I should output  EOL
......
After 19 EOLs I should output EOL

Instead, it just learns:

If my previous output is EOL:
      output EOL
else if I do not have anything more to say:
       output EOL

This rule is very simple to learn when compared to the complex stuff your seq2seq model learns, like translation, conversation etc. So dont worry.

NickShahML commented 9 years ago

Thanks @farizrahman4u for the clarification. The if statement makes much more sense. I'll get working on this and report back here if I find anything interesting that may help you guys.

gautamb85 commented 9 years ago

I was having a some trouble with masking my data a while back, and I was hoping someone could clarify a few things for me, before I try a large experiment.

I am working with real-valued speech data (mfcc). so my input data is a 3d tensor (n_batch, time_step, feat_dimension)
Utterances are of variable length so, i pad them with zeros to max_length

Q1. As my data is a 3D tensor, I pad the zeros at the BOTTOM of each individual feature matrix. Is this ok? (its a matrix of zeros)

I know the padding function in keras pads to the right, but in my case it has to be either above or below.

Otherwise I have a masking layer after my input layer with mask value set to (0.)

Thanks

farizrahman4u commented 9 years ago

@LeavesBreathe

In your seq2seq model, you transfer the hidden state from the encoder to the decoder, which makes complete sense. It eliminates the need for the RepeatVector. But beyond that, many people talk about transferring the hidden state from layer to layer within the decoder and encoder.

I have added a new function to StatefulRNN, called broadcast_state. So you can send your hidden state from any StatefulRNN to another. For e.g, Here is a Seq2seq model with depth 2. The hidden state of encoder1 is propagated throughout the model.

encoder1 = LSTMEncoder(.........return_sequences=True)
encoder2 = LSTMEncoder(..........)
dense = Dense(.......)
decoder = LSTMDecoder2(.........)
encoder3 = LSTMEncoder(.....return_sequences=True)
encoder4 = LSTMEncoder(.....return_sequences=True)

#Connect hidden layers

encoder1.broadcast_state(encoder2)
encoder2.broadcast_state(decoder)
decoder.broadcast_state(encoder3)
encoder3.broadcast_state(encoder4)

#Build model

seq2seq = Sequential()
seq2seq.add(encoder1)
seq2seq.add(encoder2)
seq2seq.add(dense)
seq2seq.add(decoder)
seq2seq.add(encoder3)
seq2seq.add(encoder4)

I will be updating Seq2seq and Conversational shortly, stay tuned!

NickShahML commented 9 years ago

Wow Fariz...seriously man you're awesome. This broadcast_state will be incredibly useful, and I definitely hope the main Keras gets it eventually.

I am a little confused as to the seq2seq model you built in your snippet of code.

You go from encode-> dense -> decode.

How are you going from the 2D output of the dense layer to the 3d input required by the decoder? Doesn't the LSTM/GRU of the decoder require a 3d input? In the past, we used a RepeatVector for it, but it is better to transfer hidden states instead.

Also: Small technicality. For your encoder1 and decoder , you should have return_sequences = True correct?

In the weeks to come, I really hope to give back to you (and everyone else) by testing a ton of seq2seq models. Hopefully I can give you some insight as I imagine you are doing seq to seq as well.

@gautamb85, I'm not familiar with mfcc, but couldn't you simply do a variation of what @farizrahman4u suggested to me? Couldn't you, instead of masking, place some sort of 'SILENT' token where the zeros are? In this way, you don't have to worry about affecting the seq2seq model or its output data.

gautamb85 commented 9 years ago

@LeavesBreathe hmm. worth a shot, but its trickier. I can't add a symbol, it has to be a floating point number (I hate real-valued data, why didn't I do NLP. lol). I'll keep you posted if something works.

@farizrahman4u First off.. stellar job on the seq2seq model!! :) Q. Can I use the model to SCORE a pair of sequences? In Cho's original paper, they use the model to re-score english-french translation pairs. So, say I have a trained model, and I have pairs of sequences to test - can I evaluate the conditional likelihood of seq-2 given seq-1 i.e. p(y|x) Doing something like this would not give me the correct data likelihood will it? (as I want the log likelihood and not the negative log likelihood, I would multiply by -1)

objective_score = -1*model.evaluate(X_test, Y_test, batch_size=32)

NickShahML commented 9 years ago

I can't add a symbol, it has to be a floating point number

It doesn't necessarily have to be a symbol, it can be a real number, just keep it consistent. Suppose you use "3" as your silent token. Assuming your doing speech to text, your model will learn that 3 is associated with the silent token output.

As an aside, I would choose a real number that is far away from your other numbers that represent your data. That way, it is treated a completely separate entity from the rest of your data. As an fyi, this is how the brain does it in your Auditory Cortex. It determines silence as a certain value, and you consciously recognize that value as silence. This is why conditions like chronic tinnitus can't be cured: the brain never hears the "silence" value and continues to output the "ringing-in-my-ears" value.

farizrahman4u commented 9 years ago

@LeavesBreathe

How are you going from the 2D output of the dense layer to the 3d input required by the decoder?

The decoder's input is 2D (Even though it inherits from LSTM). The output from each time step then becomes the input for the next time step.

Also: Small technicality. For your encoder1 and decoder , you should have return_sequences = True correct?

A decoder always has return_sequence = True by default. And yes, return_sequence = True for encoder1

farizrahman4u commented 9 years ago

@LeavesBreathe I have added a new class called DeepLSTM. It has built in hidden state propagation.

Example:

deep = DeepLSTM(input_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)

Notice the inner_return_sequences argument is False, which means the inner LSTMs will behave like Cho's enocder-decoder..(RepeatVector in between non-sequence-returning RNNs)

On a side note, you can also use broadcast_state to send hidden state from one LSTM to multiple LSTMs ..(simply pass them as a list)

gautamb85 commented 9 years ago

@farizrahman4u If you get a chance to answer my question, I would be most grateful :)

(pasted from above) Q. Can I use the model to SCORE a pair of sequences? In Cho's original paper, they use the model to re-score english-french translation pairs. So, say I have a trained model, and I have pairs of sequences to test - can I evaluate the conditional likelihood of seq-2 given seq-1 i.e. p(y|x) Doing something like this would not give me the correct data likelihood will it? (as I want the log likelihood and not the negative log likelihood, I would multiply by -1)

NickShahML commented 9 years ago

@gautamb85, don't mean to post over yours, but I do want to get back to Fariz.

@farizrahman4u ...I'm just running out of ways to complement and thank.

The output from each time step then becomes the input for the next time step.

Clever. Really clever.

Its great you can list LSTMs with broadcast_state -- This is gold.

So is the idea with DeepLSTM is that it saves you lines of code right? You could technically do the deep LSTM with the code snippet you gave above with broadcast_state correct?

To incorporate the DeepLSTM in the seq2seq model, would it be something like this?

encoder1 = DeepLSTM(input_dim=100, hidden_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)
dense = Dense(.......)
decoder = LSTMDecoder2(.........)
encoder2 = DeepLSTM(input_dim=100, hidden_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)

#Connect hidden layers

encoder1.broadcast_state(decoder)
decoder.broadcast_state(encoder2)

#Build model

seq2seq = Sequential()
seq2seq.add(encoder1)
seq2seq.add(dense)
seq2seq.add(decoder)
seq2seq.add(encoder3)
seq2seq.add(Dropout(dropout))
seq2seq.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax'))
seq2seq.compile(loss='categorical_crossentropy', optimizer='adam')

simonhughes22 commented 9 years ago

@LeavesBreathe @EderSantana @farizrahman4u @gautamb85 this is really interesting. I've had some work to do so missed the party somewhat. Can everyone post the model and library that they end up with as being the optimal approach for their dataset when all's said and done, and hopefully issue some pull requests to keras so we can get all this in one place? Great work guys

NickShahML commented 9 years ago

Can everyone post the model that they end up with as being the optimal approach when all's said and done

Glad you're back. I will post my best models along with the training graph as they come in! All my graphs are here: https://plot.ly/~oxygen123/folder/home

farizrahman4u commented 9 years ago

@LeavesBreathe Yes, saving lines of code is the idea.And it does all the RepeatVector stuff automatically. Also, for encoder1 return_sequences=False. Depth of encoder2 should be 3, so that for decoding you have 1 decoder + 3 encoders = 4 layers deep. @gautamb85 I saw your comment just now. I will come up with a detailed answer + code in a few hours.

keras-team / keras

is the Sequence to Sequence learning right? #395