Closed EderSantana closed 9 years ago
Hey guys, I just want to draw some attention to thread https://github.com/fchollet/keras/issues/957. The bottom line is that our output's are not masked, meaning the cost function is biased.
I think this explains why the results from linear regression tended to be "stretched". That is, it slowly drifted from word to word. I'm gonna talk to Eder, but I think to me, his explanations were quite clear.
@LeavesBreathe that sounds like a really interesting idea. When I want to test something that takes a long time, i run it on a small subset of data (if it's going to run a long time). If that works well or better than some other approach I am comparing it with on that same small subset then I try it on the full dataset or a larger subset. I'd advise that here. Smaller data should be easier for it to learn from too, although what it learns won't generalize nearly as well. Hope that helps.
RNN encoder-decoders do take really long to converge to good solutions... everybody seems to be reporting that. Another thing in the final layer decoder you may want an RNN with a readout (the final generated word is sent back to the inner RNN). I wrote a GRU with readout here: https://github.com/EderSantana/seya/blob/master/examples/imdb_readout.py#L53-L67
First of all, I can't believe seya now. Its so exciting to me that you have stateful GRU's and bidirectional rnn's. This is just amazing.
I need to readup a little bit on readouts to understand exactly what the advantage of this is, but it sounds very exciting. Thanks alot man.
@simonhughes22 I agree with you. Start small and grow bigger when testing anything major. Though I must say that the Sutskever RNN has much more promise.
c'mon you didn't know about Seya :D ??? That is the place where I'm cooking things before I push it to Keras. Some advanced examples that could crowd this repo are also there like Spatial Transformer Networks and DRAW. If more people use them and do suggestions, we could move them up here to main Keras.
To understand what I mean by readout see this figure by Cho et. al.
see the difference between encoder and decoder??? the generated symbol is sent back to the RNN in the decoder.
ahhhh I gotcha. So basically if I understand it correctly, lets say your decoding layer produces sentences. For y1 it produces: "the"
For y2 it produces: "cow"
since it is a readable GRU, for y3, it sees that you have written "the" and "cow" so it is more likely to pick "jumped" as y3?
Of course, y1, y2, and y3, are a distribution of percentages (assuming your using a softmax). But it would see these percentages. I usually apply a temperature after the percentages are produced (so I'm not always picking the highest percentage choice).
Anyways, I look forward to the tutorial you mentioned in the other thread. There's just so much to try now!
I was able to have a chat with K. Cho.. I believe that the readout is used at test time, the model is trained by 'teacher forcing' i.e. The decoder is fed with the true label (from previous time step as input) which is replaced by the readout at test time. Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state. For example, if you wanted to use the model to score a pair of sequences (and not generate the target sequence)
Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state.
I'm trying to generate text, which is why this is so critical to me. I believe @simonhughes22 is as well.
@gautamb85 I thought I was cheating when I teacher forced during training xD good to know!!! But note that not all the information needed is present in the hidden state, specially if you are using a deep readout. Several works report feeding back the readout.
Heh. We have had this debate before (on this thread somewhere I think). The way I think about it is - so the input to your decoder RNN (if you dont feed in the prediction) is going to be the summary vector produced by the encoder. At every timestep.
Now, I know for a fact that this does work (atleast on simple tasks like numbers-> number strings). However, intuitively if the previous prediction is fed in, the decoder has a better idea of 'where it is' (if that made any sense). I would think this model would be more powerful. I am using the approach as generative model to score a pair of audio sequences. So I cant be confident about the text generation problem. @simonhughes22 can the sequence-to-sequence model you proposed in keras be used to score two sequences? I dont think so because there is no way to input the second sequence. I guess it would need a graph() model. It was actually easier to just write my own code (not easy, but easier ;) )
Sent from my iPhone
On Nov 6, 2015, at 10:20 AM, LeavesBreathe notifications@github.com wrote:
Note that feeding the readout to the next step is not explicitly needed as the information should be contained in the hidden state.
I'm trying to generate text, which is why this is so critical to me. I believe @simonhughes22 is as well.
— Reply to this email directly or view it on GitHub.
@EderSantana nope! You're good :) Are you doing this with keras? It would need a graph() model ? I wrote my own but it i only have sgd and momentum going and it would be nice to have more fancy optimization Thanks for the tip on the readout
@gautamb85 - use a y shaped graphical model, or concatenate the two sequences. I suspect that'll be really tough to learn. although hopefully the changes @EderSantana is suggesting will work better. I don't have time to check (busy with work, phd, and kaggle:) ).
Yes I did use a graph inputing the "input" and the "teacher" (input delayed one step) than I pass both to the decoder GRU with merge_mode=concat
.
@EderSantana seya looks awesome. Please add to keras!
Off topic. But is anyone coming for NIPS? i say coming since I live in Montreal :)
Sent from my iPhone
On Nov 6, 2015, at 10:53 AM, Simon Hughes notifications@github.com wrote:
@EderSantana seya looks awesome. Please add to keras!
— Reply to this email directly or view it on GitHub.
However, intuitively if the previous prediction is fed in, the decoder has a better idea of 'where it is' (if that made any sense). I would think this model would be more powerful.
This is exactly what I thought when @EderSantana explained the GRU(readout)
. Knowing where it it is in the readout I think would be incredibly powerful. I'll try to integrate this readout GRU, and report back if I get better val loss with the text generation. It will take me sometime to integrate in correctly.
I'll also graph my results publically if it helps anyone. I've started graphing here: https://plot.ly/~oxygen123/folder/home
@gautamb85 I've been sitting out for much of this chat, but I'll be at NIPS, we should organize a Keras meetup! We're doing a lot of sequence to sequence stuff as well, would be good to compare notes.
I will also be at NIPS!
On Fri, Nov 6, 2015 at 8:05 AM, Gautam Bhattacharya < notifications@github.com> wrote:
Off topic. But is anyone coming for NIPS? i say coming since I live in Montreal :)
Sent from my iPhone
On Nov 6, 2015, at 10:53 AM, Simon Hughes notifications@github.com wrote:
@EderSantana seya looks awesome. Please add to keras!
— Reply to this email directly or view it on GitHub.
— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/395#issuecomment-154449245.
@wxs It would really be good to compare notes! I know that another keras contributor lives in Montreal, I will try to get in touch with him. Maybe lets do a post on the google-group for a meetup? Ps. Get ready for some serious cold :)
I will be in NIPS as well. I will interested in the meetup.
Best regards, Mariano
On Fri, Nov 6, 2015 at 10:46 AM, Gautam Bhattacharya < notifications@github.com> wrote:
@wxs https://github.com/wxs It would really be good to compare notes! I know that another keras contributor lives in Montreal, I will try to get in touch with him. Maybe lets do a post on the google-group for a meetup? Ps. Get ready for some serious cold :)
— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/395#issuecomment-154483117.
Best Regards, Mariano
@LeavesBreathe You might want to check out this ipython notebook https://github.com/DTU-deeplearning/day3-RNN/blob/master/RNN.ipynb
He has setup an encoder-decoder and one with an attention mechanism. Note that he is doing it in the same way (not feeding in the prediction), but the attention model can kinda compensate for this. There is also a toy-problem (text prediction :)) setup. You would have to learn lasagne (also a really good package), and use more theano.
wow, that NIPS meetup will be fun. I'll go home in Dec. but you guys write blogs or something to let us know what happened.
I'll be at NIPS -- I'd love a meetup.
@LeavesBreathe You might want to check out this ipython notebook https://github.com/DTU-deeplearning/day3-RNN/blob/master/RNN.ipynb
This is a really good find. You're right that I would need to learn lasagne, but it may be worth it if it allows more capabilities.
I think for right now, I'm gonna wait for @EderSantana 's tutorial and go from there. Keras has been a huge help to me, and I hope I can start to contribute to it. In the meantime, I'm gonna try to start implementing the readout GRU.
Moving NIPS discussion to #962!
I have done a seq2seq implementation, based on http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.
https://github.com/farizrahman4u/seq2seq
It has stateful LSTM encoder and decoder, hidden state transfer from encoder to decoder, feedback decoder (output at step t is input at step t+1), depth and all the fancy stuff.
@farizrahman4u I saw your code it looks pretty interesting. @LeavesBreathe I think you want to check that out.
One little thing, I saw that you use two LSTMs, an encoder and another one for decoder. Isn't that approach more similar to Socher's encoder-decoder than the model presented in the paper you cite? Did it provide you better results than using a single RNN? Tkx for sharing.
wow @farizrahman4u , I'm overwhelmed. This Keras community is ridiculous. A huge, huge thanks. I have a few questions if you have time:
input_length
and output_length
. But if we add a masking layer before the seq2seq layer, can we mask all zeros? This would be for both input and output. Output being really important (so cost function is not biased)If we want to add more GRU's or LSTM's to the decoding layer, would we make a network like this?
seq2seq = Seq2seq(input_length=x_maxlen, input_dim=word2vec_dimension,hidden_dim=hidden_variables_encoding,
output_dim=hidden_variables_decoding, output_length=y_maxlen, batch_size=10, depth=5)
model = Sequential()
M = Masking(mask_value=0)
M._input_shape = (x_maxlen, word2vec_dimension)
model.add(M)
model.add(seq2seq)
model.add(GRU(hidden_variables_decoding, return_sequences = True))
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax')
model.compile(loss='categorical_crossentropy', optimizer='adam')
One little thing, I saw that you use two LSTMs, an encoder and another one for decoder. Isn't that approach more similar to Socher's encoder-decoder than the model presented in the paper you cite?
@EderSantana, from my understanding of Google's Seq to Seq paper, they got the best results using four LSTM's to encode and four LSTMs to decode giving them a total of 8 LSTM's.
This is why I asked the question above. Ideally, you would want to use much data, along with multiple RNN's. The idea is that each one captures more salient features within your data.
Interestingly, one thing that improves my results from previous prediction experiments is to use not just one type of rnn. Instead, for my decoder layer, using something like:
LSTM, JZS1, GRU (in that order)
gives me better results. I always lead with a LSTM because I feel it captures the most amount of features. Anyways, my two cents for what its worth. LIke I said earlier, I'll be publically graphing all my experiments (and labelling them as best as I can) with @EderSantana and @farizrahman4u mods on Keras. Tonight, I'm gonna try leading with a bidirectional LSTM and see what happens.
Hi @LeavesBreathe. You have to mask only after the seq2seq2, not before.
["How are you <EOL> <EOL> <EOL> <EOL> <EOL>", "I am fine <EOL> <EOL> <EOL> <EOL> <EOL>"]
If you mask the EOLs in the input, the decoder will find it difficult to terminate its outputs with EOLs. Instead, it will fill up its output length with garbage words.
Seq2seq
model, try playing around with the LSTMEncoder and LSTMDecoder layers and make your own custom Seq2seq models (for e.g, multiple dense layers in between encoder and decoder, one encoder - many decoders models etc.). Its fun! @EderSantana In Cho et al, the output from encoder is fed to the decoder at every time step. And readout is also present. But in seq2seq, the output from the encoder is fed to the decoder in the first time step only. Also, the hidden state is copied from encoder to decoder. So my model is more similar to the seq2seq.
LSTMDecoder2
, it is similar to Cho et al, the output from encoder is fed to the decoder at every time step, along with output of previous time step. You may or may not enable hidden state copying when using this decoder. (Should work better when not enabled in case of conversational model, as it could remember not only what was said by human in the previous time steps, but also what it said in the previous time steps. ).EnglishToFrench = Seq2seq()
EnglishToFrench.compile()
EnglishToFrench.train()
encoder_data = EnglishToFrench.encoder.get_weights()
EnglishToSpanish=Seq2Seq()
EnglishToSpanish.encoder.set_weights(encoder_data)
EnglishToSpanish.compile()
EnglishToSpanish.train()
You can also train multiple language pairs simultaneously (Encode in English, decode in other languages):
EnglishEncoder = LSTMEncoder()
FrenchDecoder = LSTMDecoder()
SpanishDecoder = LSTMDecoder()
GermanDecoder = LSTMDecoder()
dense = Dense()
EnglishEncoder.decoders = [FrenchDecoder, SpanishDecoder, GermanDecoder] #Multiple decoders. Wow!
model = Graph()
model.add_input(EnglishEncoder, "english")
model.add_node(dense,"dense", input="english")
model.add_output(FrenchDecoder, "french", input="dense")
model.add_output(SpanisDecoder, "spanish", input="dense")
model.add_output(GermanDecoder, "german", input="dense")
model.compile()
model.train()
If you mask the EOLs in the input, the decoder will find it difficult to terminate its outputs with EOLs. Instead, it will fill up its output length with garbage words.
Huh. I always thought you wanted to mask your input, but what you're saying makes sense. I guess the question is then, is how do you mask the output given the seq2seq model? Would it be something like this?
seq2seq = Seq2seq(input_length=x_maxlen, input_dim=word2vec_dimension,hidden_dim=hidden_variables_encoding,
output_dim=hidden_variables_decoding, output_length=y_maxlen, batch_size=10, depth=5)
model = Sequential()
model.add(seq2seq)
model.add(Masking(mask_value=0))
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax')
model.compile(loss='categorical_crossentropy', optimizer='adam')
try playing around with the LSTMEncoder and LSTMDecoder layers and make your own custom Seq2seq models
Yes! I definitely want to do that. From last night, I actually got better results with using strictly bidirectional lstms from seya.
Once I have things working, I'll make a BidirectionalLSTMEncoder and Decoder. I'll either submit a pull request to you or @EderSantana. It will take me at least a few weeks to get there though. I have alot of matrix setup and testing I need to do.
I have added a new decoder: LSTMDecoder2, it is similar to Cho et al, the output from encoder is fed to the decoder at every time step, along with output of previous time step
This is really cool. I can't wait to try all of this out. I also saw you updated the conversational.py -- Thanks!
Thanks for pointing out that the optimal value is 4 and not 5.
Its not necessarily that the optimal value is 4. I actually went to med school for a while and studied neurology heavily. You'll find that the brain appropriates different amounts of neurons for different tasks (along with different types of neurons). And different types of conversations require different amounts of neurons.
The bottom line is that, for different areas of conversations (or topics), there is a sweet spot of the number of neurons you want, and the brain automatically optimizes to that level. Talking about the weather takes less neurons and watts than compared to theoretical physics discussions. This is also why people who speak 4 or 5 languages have a more neurons appropriated for language processing and creation.
Right now, we manually do that optimizing by trying different amounts of layers/hidden states. So in summary, its not that 4 is the most optimal for seq to seq. It is for that dataset, with the task of translating English to French, 4 happens to be the most optimal. If you tried translating English to Chinese, I bet the optimal level would be 6 or 7.
Didn't mean to ramble, but all I'm trying to say is: experiment with different levels. Sometimes, I get better results with lower amount of hidden states, but more levels. Sorry if I'm beating a dead horse.
@LeavesBreathe Thanks for clearing up the layer depth thing. Regarding masking:
one note, if anybody ever have to mask outputs, use sample_weight
not masking to choose which values affect the cost function. In practice it doesn't matter what the network output after EOL so we don't actually need to make it learn to output zeros. But, if you are not familiar with sample_weight
do as @farizrahman4u says and just use the model as he suggested.
@farizrahman4u Thanks for such a detailed description in regards to masking. I completely understand everything you're saying in regards to input. Fortunately, I do not have any 'out-of-vocab' words, so I won't be masking any input.
Thank you also for clarifying that EOL is a word. And is the first part to be predicted correctly. Makes complete sense.
I guess my main question is this: Is there a disadvantage to having your neural net predict repeated EOLs?
I feel like its additional thing your network has to learn. If you're predicting sentences, your network has to learn there is a length of 50 for each sentence predicted. Supposed the predicted sentence has 30 words. Then the network has to learn to predict EOL's for the rest of the 20 timesteps.
I'm wondering if this comes at a cost to the network. It might not, and I might just be worrying about nothing.
At this point, I feel bad asking more questions, so if you don't have time, don't bother addressing below:
There is another important aspect that I'm still cloudy on: transferring hidden states
In your seq2seq model, you transfer the hidden state from the encoder to the decoder, which makes complete sense. It eliminates the need for the RepeatVector
. But beyond that, many people talk about transferring the hidden state from layer to layer within the decoder and encoder..
Suppose for the decoder layer, we stack 4 LSTMs (depth = [4,4]
). Does the hidden state transfer from layer to layer?
If so, what is the difference between transferring the hidden states, and just regularly stacking 4 LSTM Keras layers? Why is transferring the hidden state advantageous?
Thanks alot again.
Is there any attention based model case with keras? Thank you very much
@LeavesBreathe
I guess my main question is this: Is there a disadvantage to having your neural net predict repeated EOLs?
Not much.
Then the network has to learn to predict EOL's for the rest of the 20 timesteps.
This is not as tough as it sounds. Your network DOES NOT learn like this:
After 1 EOL, I should output EOL
After 2 EOLs, I should output EOL
After 3 EOLs, I should output EOL
......
After 19 EOLs I should output EOL
Instead, it just learns:
If my previous output is EOL:
output EOL
else if I do not have anything more to say:
output EOL
This rule is very simple to learn when compared to the complex stuff your seq2seq model learns, like translation, conversation etc. So dont worry.
Thanks @farizrahman4u for the clarification. The if statement makes much more sense. I'll get working on this and report back here if I find anything interesting that may help you guys.
I was having a some trouble with masking my data a while back, and I was hoping someone could clarify a few things for me, before I try a large experiment.
Q1. As my data is a 3D tensor, I pad the zeros at the BOTTOM of each individual feature matrix. Is this ok? (its a matrix of zeros)
I know the padding function in keras pads to the right, but in my case it has to be either above or below.
Otherwise I have a masking layer after my input layer with mask value set to (0.)
Thanks
@LeavesBreathe
In your seq2seq model, you transfer the hidden state from the encoder to the decoder, which makes complete sense. It eliminates the need for the RepeatVector. But beyond that, many people talk about transferring the hidden state from layer to layer within the decoder and encoder.
I have added a new function to StatefulRNN, called broadcast_state
. So you can send your hidden state from any StatefulRNN to another. For e.g, Here is a Seq2seq model with depth 2. The hidden state of encoder1 is propagated throughout the model.
encoder1 = LSTMEncoder(.........return_sequences=True)
encoder2 = LSTMEncoder(..........)
dense = Dense(.......)
decoder = LSTMDecoder2(.........)
encoder3 = LSTMEncoder(.....return_sequences=True)
encoder4 = LSTMEncoder(.....return_sequences=True)
#Connect hidden layers
encoder1.broadcast_state(encoder2)
encoder2.broadcast_state(decoder)
decoder.broadcast_state(encoder3)
encoder3.broadcast_state(encoder4)
#Build model
seq2seq = Sequential()
seq2seq.add(encoder1)
seq2seq.add(encoder2)
seq2seq.add(dense)
seq2seq.add(decoder)
seq2seq.add(encoder3)
seq2seq.add(encoder4)
I will be updating Seq2seq and Conversational shortly, stay tuned!
Wow Fariz...seriously man you're awesome. This broadcast_state
will be incredibly useful, and I definitely hope the main Keras gets it eventually.
I am a little confused as to the seq2seq model you built in your snippet of code.
You go from encode-> dense -> decode.
RepeatVector
for it, but it is better to transfer hidden states instead.Also: Small technicality. For your encoder1 and decoder , you should have return_sequences = True
correct?
In the weeks to come, I really hope to give back to you (and everyone else) by testing a ton of seq2seq models. Hopefully I can give you some insight as I imagine you are doing seq to seq as well.
@gautamb85, I'm not familiar with mfcc, but couldn't you simply do a variation of what @farizrahman4u suggested to me? Couldn't you, instead of masking, place some sort of 'SILENT' token where the zeros are? In this way, you don't have to worry about affecting the seq2seq model or its output data.
@LeavesBreathe hmm. worth a shot, but its trickier. I can't add a symbol, it has to be a floating point number (I hate real-valued data, why didn't I do NLP. lol). I'll keep you posted if something works.
@farizrahman4u First off.. stellar job on the seq2seq model!! :) Q. Can I use the model to SCORE a pair of sequences? In Cho's original paper, they use the model to re-score english-french translation pairs. So, say I have a trained model, and I have pairs of sequences to test - can I evaluate the conditional likelihood of seq-2 given seq-1 i.e. p(y|x) Doing something like this would not give me the correct data likelihood will it? (as I want the log likelihood and not the negative log likelihood, I would multiply by -1)
objective_score = -1*model.evaluate(X_test, Y_test, batch_size=32)
I can't add a symbol, it has to be a floating point number
It doesn't necessarily have to be a symbol, it can be a real number, just keep it consistent. Suppose you use "3" as your silent token. Assuming your doing speech to text, your model will learn that 3 is associated with the silent token output.
As an aside, I would choose a real number that is far away from your other numbers that represent your data. That way, it is treated a completely separate entity from the rest of your data. As an fyi, this is how the brain does it in your Auditory Cortex. It determines silence as a certain value, and you consciously recognize that value as silence. This is why conditions like chronic tinnitus can't be cured: the brain never hears the "silence" value and continues to output the "ringing-in-my-ears" value.
@LeavesBreathe
How are you going from the 2D output of the dense layer to the 3d input required by the decoder?
The decoder's input is 2D (Even though it inherits from LSTM). The output from each time step then becomes the input for the next time step.
Also: Small technicality. For your encoder1 and decoder , you should have return_sequences = True correct?
A decoder always has return_sequence = True by default. And yes, return_sequence = True for encoder1
@LeavesBreathe I have added a new class called DeepLSTM. It has built in hidden state propagation.
Example:
deep = DeepLSTM(input_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)
Notice the inner_return_sequences argument is False, which means the inner LSTMs will behave like Cho's enocder-decoder..(RepeatVector in between non-sequence-returning RNNs)
On a side note, you can also use broadcast_state
to send hidden state from one LSTM to multiple LSTMs ..(simply pass them as a list)
@farizrahman4u If you get a chance to answer my question, I would be most grateful :)
(pasted from above) Q. Can I use the model to SCORE a pair of sequences? In Cho's original paper, they use the model to re-score english-french translation pairs. So, say I have a trained model, and I have pairs of sequences to test - can I evaluate the conditional likelihood of seq-2 given seq-1 i.e. p(y|x) Doing something like this would not give me the correct data likelihood will it? (as I want the log likelihood and not the negative log likelihood, I would multiply by -1)
@gautamb85, don't mean to post over yours, but I do want to get back to Fariz.
@farizrahman4u ...I'm just running out of ways to complement and thank.
The output from each time step then becomes the input for the next time step.
Clever. Really clever.
Its great you can list LSTMs with broadcast_state
-- This is gold.
So is the idea with DeepLSTM is that it saves you lines of code right? You could technically do the deep LSTM with the code snippet you gave above with broadcast_state
correct?
To incorporate the DeepLSTM
in the seq2seq model, would it be something like this?
encoder1 = DeepLSTM(input_dim=100, hidden_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)
dense = Dense(.......)
decoder = LSTMDecoder2(.........)
encoder2 = DeepLSTM(input_dim=100, hidden_dim=100, output_dim=100,depth=4, return_sequences=True, inner_return_sequences=False, remember_state=False, batch_size=32)
#Connect hidden layers
encoder1.broadcast_state(decoder)
decoder.broadcast_state(encoder2)
#Build model
seq2seq = Sequential()
seq2seq.add(encoder1)
seq2seq.add(dense)
seq2seq.add(decoder)
seq2seq.add(encoder3)
seq2seq.add(Dropout(dropout))
seq2seq.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax'))
seq2seq.compile(loss='categorical_crossentropy', optimizer='adam')
@LeavesBreathe @EderSantana @farizrahman4u @gautamb85 this is really interesting. I've had some work to do so missed the party somewhat. Can everyone post the model and library that they end up with as being the optimal approach for their dataset when all's said and done, and hopefully issue some pull requests to keras so we can get all this in one place? Great work guys
Can everyone post the model that they end up with as being the optimal approach when all's said and done
Glad you're back. I will post my best models along with the training graph as they come in! All my graphs are here: https://plot.ly/~oxygen123/folder/home
@LeavesBreathe Yes, saving lines of code is the idea.And it does all the RepeatVector stuff automatically. Also, for encoder1 return_sequences=False. Depth of encoder2 should be 3, so that for decoding you have 1 decoder + 3 encoders = 4 layers deep. @gautamb85 I saw your comment just now. I will come up with a detailed answer + code in a few hours.
Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values? There is no choice to pass a mask to the objective function. Won't this bias the cost function?