Closed EderSantana closed 9 years ago
@farizrahman4u You Sir, are a deep learning angel ! :) I did have a question related to a previous post. I think I read/saw that you are using 2 lstm's in the encoder. The idea behind this is a hierarchical RNN? I saw a paper recently, so the upper LSTM takes as its input (at each of its time steps) the encoding produced by the lower LSTM. So if the lower RNN is fed words, the last time step would represent a sentence. Consequently, the last tilmestep of the upper LSTM would encode the whole document.
Of course, as you mentioned, many seq2seq architectures can be experimented with.
PS. Its really late and I might be imaging all this. If that is the case. Please ignore.
@farizrahman4u I don't know if this helps, but I can confirm (for sure) at least in Cho et al.'s approach, that the decoder is trained by teacher forcing (though I guess you don't have to). That is, during training the decoder is fed the TRUE label of the previous time-step, whereas at test time the true label is replaced by the prediction. If we want to score a pair of sequences, we still have the true labels at test time, so those can be used. I am not sure if you need the readout or if it helps in this case, since you have the true label.
@farizrahman4u You Sir, are a deep learning angel ! :)
Amen
That is, during training the decoder is fed the TRUE label of the previous time-step, whereas at test time the true label is replaced by the prediction.
Maybe I'm missing something but isn't this what we already do? For example, right now, I give it 2 sentences and ask it to predict the next one. During training, I give it the next one it should predict (label). During testing, I test how close its prediction is to the actual sentence that is labelled.
I guess what I'm asking is: how else would you do this? You have to do teacher forcing?
The idea behind this is a hierarchical RNN? I saw a paper recently, so the upper LSTM takes as its input (at each of its time steps) the encoding produced by the lower LSTM.
I believe the whole idea is this:
You start off with with a basic encoder LSTM and decoder LSTM:
words --> Embedding --> Encoder LSTM --> Dense --> Decoder LSTM --> TimeDistributedDense--> Softmax
However, to make this neural net capture more salient featuers, we add another encoding level after the decoder:
words --> Embedding --> Encoder1 LSTM --> Dense --> Decoder LSTM --> Encoder2 LSTM --> TimeDistributedDense--> Softmax
Lastly, we want to ensure that our encoder1 and our encoder2 are big enough to capture all levels of abstraction, so we add multiple layers of LSTMs within encoder1 and encoder2:
words --> Embedding --> Encoder1 LSTM (4 LSTMs) --> Dense --> Decoder LSTM (1 LSTM) --> Encoder2 LSTM (3 LSTMs) --> TimeDistributedDense--> Softmax
As a side note, there are Dense
+ RepeatVector
in between each of the LSTM's within the encoder1 and encoder2 levels.
All hidden states are transferred (propagated) from each previous LSTM to the next LSTM. To do this, you need to use Fariz's broadcast_state
he has built in, so you can't use Keras to do this alone. I did not include broadcast state below because it would take up too much space. All layed out, it looks like this:
model = Sequential()
#Encoder 1 layer
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len))
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
#Decoder Layer
model.add(LSTMDecoder2(.........))
#Encoder 2 Layer -- notice the change from x_sent_len to y_sent_len
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(y_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(y_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = True))
#softmax
model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
Anyone feel free to correct me if I'm wrong. I know this might be overkill, but I figure being as clear as possible is best so we're all on the same page.
@LeavesBreathe Maybe I'm missing something but isn't this what we already do? For example, right now, I give it 2 sentences and ask it to predict the next one. During training, I give it the next one it should predict (label). During testing, I test how close its prediction is to the actual sentence that is labelled. I guess what I'm asking is: how else would you do this? You have to do teacher forcing?
The idea is that a RNN is being used as a generative model of your data. So I want to find the log likelihood (negative cross entropy) of seq-2 given seq-1 or p(seq-2 | seq-1). So its like I have the correct label (say it was produced by some other system) and I want to score how good it is.
Yes, you would teacher force. Like a language model. Say you have a sequence and a probability model P(seq) (given bt the RNN), if you wanted to find how 'likely' the sequence is given the model, you would feed in the current time-step and ask it to predict the next one. You wouldn't (perhaps additionally) feed the prediction to the next step (because you might end up finding the likelihood of a different sequence. I'm not sure about that last sentence.lol)
I need to look at the code again, but this model seems a little strange. Not to mention, I might be wrong. So the decoder produces an output at every time-step, which is being fed to a new encoder. This guy will/should encode a sequence of predictions produced by the decoder into a single vector (if it works as a standard encoder). Though a standard encoder produces a single output, so connecting it to a time-distributed layer without a repeatVector should break your code (which I assume it doesn't, so something else must be going on)
If you are connecting an encoder to a timedistributed dense layer, it is not really an 'encoder' as it must be producing an output at every time-step to feed to the dense layer.
model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False)) model.add(Dense(hidden_variables_encoding) model.add(RepeatVector(x_sent_len)) model.add(LSTM(hidden_variables_encoding, return_sequences = False)) model.add(Dense(hidden_variables_encoding) model.add(RepeatVector(x_sent_len) model.add(LSTM(hidden_variables_encoding, return_sequences = False)) model.add(Dense(hidden_variables_encoding) model.add(RepeatVector(x_sent_len) model.add(LSTM(hidden_variables_encoding, return_sequences = True)) model.add(Dense(hidden_variables_encoding)
I am not sure if this is an 'encoder' per say. As the last return sequences = True. I think the idea is to get a single vector encoding of the sequence
Hey @gautamb85 I think this is a good discussion, though it would require alot of typing to communicate back and forth, do you want to skype chat for a bit if you're free? I think talking this out would be easier, and we can post back here when we have a conclusion? My username is leavesbreathe
@LeavesBreathe The final layer of encoder1 should have return _sequences=False
@LeavesBreathe The final layer of encoder of 1 should have return _sequences=False
My error. You want a final vector in the end of the encoder, so setting return_sequences=False
makes complete sense.
Fariz, if you want to join the skype chat, feel free to!
@gautamb85 I havent tried the Cho's rescoring thing. Can you please explain it to me.. like what is your input, what is the output, what are you trying to optimize etc?
@LeavesBreathe You two skype and later comment your conclusions here for us. I have some work pending, #964, #928, and documentation of #893. So pretty busy:)
Sounds good -- I'll skype with @gautamb85 (if he's cool with that), and we'll get back as to what we agree is the most optimal model. I'll test that model tonight if I can get everything setup.
@farizrahman4u Lets say I have an english-french translation pair. So you would encode the english sentence (input 1). For the decoder, in Cho's paper (appendix) the GRU non-linearity gets 3 inputs
Does a graph() model make sense for what I described? Il post some pseudo code in a little while)
@LeavesBreathe Sure, Skype sounds good. Today is kinda busy, but I might be available later at night (if you are a late sleeper). Tomorrow evening should be cool. I'm in Montreal, hoping you're in an easy to co-ordinate with time-zone :)
I'm in Cincinnati, so we have the same time zone. I have some things I gotta do tonight, but I'm free now untill 5. Or 9pm to 11pm tonight. Whatever works best for you. Just add me on skype and we can figure out a time. Sometimes I step away from my desk for a while, but if we schedule a time, I'll be sure to be online then.
@gautamb85
Providing an extra input (so that the context vector is fed explicitly to each time-step) might prove a little problematic as it might need a new layer to be written
This is already done using the LSTMDecoder2 layer in seq2seq. The output from encoder is repeatedly input to the decoder at every time step.
@farizrahman4u O sweet! I didn't look at decoder2 yet. so I would do a graph() model sorta like this input-1 -> encoder input-2 -> LSTMdecoder2
Does a graph() architecture make sense ? I would need it to feed in the second sequence no? I'll try out and report back, but yes, it should solve the problem.
Q. I noticed that you have a dense layer between stacked LSTM in the encoder, and also between the encoder and decoder. Why do you do it this way, is it shape related?
Yes. the Dense
in between encoder and decoder is to make the shape compatible. There is no dense in between the LSTM stack layers btw. (Its only there in @LeavesBreathe 's comment). there are RepeatVector
s in between the stack layers, again, for shape compatibility.
@gautamb85 This could be done fairly easily.. all you have to do is add an additional input to LSTMDecoder2, so that it at a given time step it will have 3 inputs:
_Note: French word embedding size should be same as dimension of context vector from the encoder._
Now lets see some pseudo code:
french = Sequential()
french.add(Embedding(...))
english = Sequential()#this is your encoder
english.add(Embedding(...))
english.add(DeepLSTM(..., return_sequence=False))
#Cheat keras..make it think its a single input
english.add(Reshape(1,french_embedding_size))
merge=Merge(english, french, 'concat')
#Decoder
deocoder=LSTMDecoder3()
model = Sequential()
model.add(merge)
model.add(LSTMDecoder3(....))
#optionally
english.broadcast_state(decoder)
model.compile(...)
I will post code for LSTMDecoder3 shorty!
Done!
from seq2seq.lstm_decoder import LSTMDecoder2
class LSTMDecoder3(LSTMDecoder2):
def _step(self, si, sf, sc, so,
x_tm1,
h_tm1, c_tm1, v,
u_i, u_f, u_o, u_c, w_i, w_f, w_c, w_o, w_x, v_i, v_f, v_c, v_o, b_i, b_f, b_c, b_o, b_x):
#Inputs = output from previous time step, vector from encoder, french sentence
xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i
xf_t = T.dot(x_tm1, w_f) + T.dot(v, v_f) + sf + b_f
xc_t = T.dot(x_tm1, w_c) + T.dot(v, v_c) + sc + b_c
xo_t = T.dot(x_tm1, w_o) + T.dot(v, v_o) + so + b_o
i_t = self.inner_activation(xi_t + T.dot(h_tm1, u_i))
f_t = self.inner_activation(xf_t + T.dot(h_tm1, u_f))
c_t = f_t * c_tm1 + i_t * self.activation(xc_t + T.dot(h_tm1, u_c))
o_t = self.inner_activation(xo_t + T.dot(h_tm1, u_o))
h_t = o_t * self.activation(c_t)
x_t = T.dot(h_t, w_x) + b_x
return x_t, h_t, c_t
def get_output(self, train=False):
ip = self.get_input(train)
v = ip[0]#English context vector from encoder
S = ip[1:]#French Sentence
si = T.dot(S, self.S_i)
sf = T.dot(S, self.S_f)
sc = T.dot(S, self.S_c)
so = T.dot(S, self.S_o)
[outputs,hidden_states, cell_states], updates = theano.scan(
self._step,
sequences=[si, sf, sc, so],
outputs_info=[v, self.h, self.c],
non_sequences=[v, self.U_i, self.U_f, self.U_o, self.U_c,
self.W_i, self.W_f, self.W_c, self.W_o,
self.W_x, self.V_i, self.V_f, self.V_c,
self.V_o, self.b_i, self.b_f, self.b_c,
self.b_o, self.b_x],
truncate_gradient=self.truncate_gradient)
if self.state_input is None and self.remember_state:
self.updates = ((self.h, hidden_states[-1]),(self.c, cell_states[-1]))
for o in self.state_outputs:
o.updates=((o.h, hidden_states[-1]),(o.c, cell_states[-1]))
return outputs
def set_params(self):
super(LSTMDecoder3, self).set_params()
dim = self.input_dim
hdim = self.hidden_dim
self.S_i = self.init((dim, hdim))
self.S_f = self.init((dim, hdim))
self.S_c = self.init((dim, hdim))
self.S_o = self.init((dim, hdim))
self.params += [self.S_i,self.S_c, self.S_f, self.S_o]
def build(self):
self.set_params()
self._build()
Might have typos/indentation issues because I am typing this on my phone and can not test it right now.
I wil test out.
a couple of questions
englsh.add(Reshape(1,french_embedding_size)) merge=Merge(english, french, 'concat')
deocoder=LSTMDecoder3() model = Sequential() model.add(merge) model.add(LSTMDecoder3(....))
Q. So you are concatenating the encoder output and the embedding for the french word into a single vector. So decoder3 takes this guy (concatenated vector) as the new additional input ? It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?
Q. I don't get why you need to concatenate them.
Can't it be -
french = Sequential() french.add(Embedding(...))
english = Sequential()#this is your encoder english.add(Embedding(...)) english.add(DeepLSTM(...output_dim=xdim, return_sequence=False))
(maybe a reshape over here is needed) Q. can't I get the output from deepLSTM, like : context = output fro Deep_LSTM and then do model.add(context) instead of add(merge) Unless its not easy to get the layer output
deocoder=LSTMDecoder3() model = Sequential() model.add(context) model.add(LSTMDecoder3(....))
PS. You typed this on your phone? Mind = Blown :)
@gautamb85
PS. You typed this on your phone?
Most of it is copy-pasted from LSTMDecoder2.
. So you are concatenating the encoder output and the embedding for the french word into a single vector.
No. I am concatenating the encoder output and the French SENTENCE (NOT WORD) into a single matrix(not vector)
If the French sentence was [word1, word2, word3, word4], after merging it with the context vector it would look like : [context, word1, word2, word3, word4]
So decoder3 takes this guy (concatenated vector) as the new additional input ?
No. This guy is THE input. Not additional. Technically, the number of inputs for for LSTMDecoder2 and LSTMDecoder3 is same (which is 2). But logically,LSTMDecoder3 has one extra input(the merged input could be seen as 2 inputs)
It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?
YES
Q. I don't get why you need to concatenate them
- We are packing the inputs for the decoder into a single tensor. The decoder then separates them out.
- If your sentence pair is
["how are you", "comment allez-vous"]
, your merged guy will look like this:[ f("how are you"), e("comment"), e("allez"), e("vous")]
. Here f("how are you") is your context vector that you get from the encoder.
Lets analyze the input to the decoder and its output for 4 time steps.
Hope this helps. In your code, you are not using your french model at all !! Am I missing something?
My bad with the disconnected french model. (Also typed on my phone. Lol)
I see. Yea, thats why u concatenate them, as the context is getting replaced after the first timestep.
I thought it was like this :
You concatenate so that the context is there at every step. Is that correct?
Thanks again. I will update you when I get something going.
Sent from my iPhone
On Nov 10, 2015, at 2:48 PM, Fariz Rahman notifications@github.com wrote:
@gautamb85
PS. You typed this on your phone?
Most of it is copy-pasted from LSTMDecoder2.
. So you are concatenating the encoder output and the embedding for the french word into a single vector.
No. I am concatenating the encoder output and the French SENTENCE (NOT WORD) into a single matrix(not vector)
If the French sentence was [word1, word2, word3, word4], after merging it with the context vector it would look like : [context, word1, word2, word3, word4]
So decoder3 takes this guy (concatenated vector) as the new additional input ?
No. This guy is THE input. Not additional.
It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?
YES
Q. I don't get why you need to concatenate them
Lets analyze the input to the decoder for 4 time steps.
Time1 : x1 = LSTM(context, word1, context) Time2: x2 = LSTM(context, word2, x1) Time3: x3 = LSTM(context, word3, x2) Time4: x4 = LSTM(context, word4, x3) Hope this helps.
— Reply to this email directly or view it on GitHub.
@gautamb85 I think we are talking about slightly different models. Can you give me an example of your x_train, y_train etc?
@farizrahman4u I will have to get back to you on that in a little while.
From your code of LSTMdecoder3:
xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i
where v is the context vector from the encoder and v_i is the weight matrix (for input gate only) and x_tm1 is the prediction from the previous time-step.
Could I replace the prediction x_tm1 with the actual french word, or alternatively, add not the T.dot(x_t,W) where x_t is the current french word. Which I guess is the need to concatenate the context and the french sentence.
But I think what you are suggesting with the architecture (the concatenation) should achieve the same thing.
I need more clarity on what we are trying to do here. Lets start with what your training data will look like. I will then pick the best model for you.
I eventually want to use the model with speech. So my training data would be a pairs of recordings that are padded to the same length. So it would be a 3D matrix - (N_samp, maxlen, feat_dim) representing a mini N_samp examples (corresponding to seq-1) and a similar matrix corresponding to seq-2.
Now this model can be trained by teacher forcing, i.e. instead of feeding the prediction (as you do in LSTMdecoder2 I believe) you can feed in the true label from the previous time-step.
This is how Cho et. al trained there models, both for generating translations and scoring pairs of translations. In the generation case, you need to replace the true label with the prediction as you don't have the true label. However in the scoring case, you can use the model as it (if it is trained by teacher forcing), since we have access to both seq-1 and seq-2
This paper is pretty interesting: http://arxiv.org/pdf/1511.01432.pdf
I gotta read it some more to fully understand it, but looks like there's even more we can implement.
Fariz, I'm having some difficulty with getting your classes to work. I'm gonna try a few variations first, but if I can't get any of them working, I'll report back here tomorrow afternoon or so.
@farizrahman4u Thank you for your decoder code, does your decoder code is an attention model?
@tttwwy No.Its just a stateful LSTM with readout and hidden state broadcasting.
@LeavesBreathe Please open an issue in seq2seq for any problem you are facing. Try recloning. I just made an update.
@farizrahman4u I will be sure to open an issue on your seq2seq. Give me at least a full day as I'm testing alot of variations/debugging before I come to you with the final problem.
@tttwwy I agree with you that attention is very important. I'm working on some code that I hope to implement in two to three weeks to address this issue. Maybe add on to Fariz's seq to seq model so we all have one working model.
@gautamb85 did you want to still chat tonight? I think it would be good to share each other's ideas with each other. Doesn't have to be long. I can't pm you over github, so if you can just add me on skype and we can figure out a time. I'm free all day today.
Hey. Sorry for the late reply. Are u free at 10-1030 eastern time?
Sent from my iPhone
On Nov 11, 2015, at 12:29 PM, LeavesBreathe notifications@github.com wrote:
@farizrahman4u I will be sure to open an issue on your seq2seq. Give me at least a full day as I'm testing alot of variations/debugging before I come to you with the final problem.
@tttwwy I agree with you that attention is very important. I'm working on some code that I hope to implement in two to three weeks to address this issue. Maybe add on to Fariz's seq to seq model so we all have one working model.
@gautamb85 did you want to still chat tonight? I think it would be good to share each other's ideas with each other. Doesn't have to be long. I can't pm you over github, so if you can just add me on skype and we can figure out a time. I'm free all day today.
— Reply to this email directly or view it on GitHub.
@gautamb85 @LeavesBreathe @farizrahman4u have you seen this: http://www.tensorflow.org/tutorials/seq2seq/index.md? Google just open sourced a deep learning toolkit with a graphical interface. Includes a sequence to sequence model. I am unreasonably excited. Has a python interface and an attentional model, something I've really wanted and needed for my research. It also has a python interface
@simonhughes22 since you are interested in models besides Keras, have you seen blocks-examples
? They have a machine translation model with attention working out the box for en-cs. https://github.com/mila-udem/blocks-examples
Slightly off topic: @fchollet tweeted that keras will be supporting seamlessly both theano and tensorflow. Does this mean that keras models could run on android? Because tensorflow has an android example. In the mean time, is there any way to get a keras model work on anndroid as of now? Has anyone tried it (like turn of all the c++ stuff and run theano in pure python mode?)
@gautamb85 , sorry but we had a power outage yesterday -- internet is stil out but we can chat hopefully tonight. Do you mind adding me on skype (username is leavesbreathe) so that we don't need to take up space on this thread to schedule talking?
Hey guys so I pretty much spent the entire day reading up on TensorFlow. I think the bottom line is that they have more capabilities (attention mechanism), however it is so much more messy than Keras. So basically, I've decided to try out TensorFlow, but I still want to use Keras (as I like Keras's community and logic flow more).
With all of that being said, I think it would be interesting to compare results using Keras versus TensorFlow. I hope to have TF up in the next week or so to see what type of results I'm getting.
@EderSantana @simonhughes22 @melonista @LeavesBreathe there is an attention based nmt model , which may have some help for you . https://github.com/kyunghyuncho/dl4mt-material/tree/master/session3
@LeavesBreathe TensorFlow code is messy when compared to Keras. It is more easy to contribute to Keras, which means in the long run Keras will be filled with a lot of features. We even have a working implementation of Nural Turing Machine !( #990 ). We are the first to open source it.
It is more easy to contribute to Keras, which means in the long run Keras will be filled with a lot of features.
I totally agree with you. Keras is much cleaner and easier to contribute to (I don't even think tf is allowing PR's).
However, I want at least try it for a little bit because there may be a few things I learn from TF that we can implement in Keras. For example, they give an attention model mechanism here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/seq2seq.py#L453-520
Btw, I was looking at the neural turing machine earlier that Eder wrote and it looks so cool
I agree with @farizrahman4u !!! Keras is much easier to program, cause we are focused in offering higher level APIs.
To be more precise, I think we are the first to open source a NTMachine with RNN controllers. Others had implementations with feedforward controllers (which is less powerful).
For example, here is a simple LSTM classifying the MNIST (running row by row) in TensorFlow https://github.com/EderSantana/TwistedFate/blob/master/mnist_lstm.py It is fast to start running, but we have to hard code all dimensions (fix batch size, fix sequence length, etc)
@LeavesBreathe Sorry for not getting back to you. I don't have a skype account, aI will set it up over and we add you over the weekend.
@farizrahman4u I had a question about your code. Specifically realting to prediction feedback:
def _step(self, si, sf, sc, so, x_tm1, h_tm1, c_tm1, v, u_i, u_f, u_o, u_c, w_i, w_f, w_c, w_o, w_x, v_i, v_f, v_c, v_o, b_i, b_f, b_c, b_o, b_x):
#Inputs = output from previous time step, vector from encoder, french sentence
xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i
xf_t = T.dot(x_tm1, w_f) + T.dot(v, v_f) + sf + b_f
xc_t = T.dot(x_tm1, w_c) + T.dot(v, v_c) + sc + b_c
xo_t = T.dot(x_tm1, w_o) + T.dot(v, v_o) + so + b_o
i_t = self.inner_activation(xi_t + T.dot(h_tm1, u_i))
f_t = self.inner_activation(xf_t + T.dot(h_tm1, u_f))
c_t = f_t * c_tm1 + i_t * self.activation(xc_t + T.dot(h_tm1, u_c))
o_t = self.inner_activation(xo_t + T.dot(h_tm1, u_o))
h_t = o_t * self.activation(c_t)
x_t = T.dot(h_t, w_x) + b_x
return x_t, h_t, c_t
Q. In that code snippet, x_t is the prediction (which is getting fed back via scan) and it is intialized as v (the context produced by the encoder). correct?
Q. I am confused because, if this was regression, then x_t represents the actual prediction of the model. However for classification, this x_t would get fed to a softmax function, and then we would sample/argmax to get the actual prediction.
Is it equivalent to feedback just x_t? (without doing softmax etc) and does it work the same way at test time? I mean, at test time the x_t (before softmax) is fed back as the 'prediction', however the actual prediction (visible) is made by feeding these hidden predictions to a softmax layer after the 'scanning (theano)' is done.
Q. I assume the get_outputs function that (returns the outputs) feeds them to a dense (softmax) layer that makes the prediction ?
Ps. I know a lot of that may not sound clear, and I am happy to clarify.
@gautamb85 no problem no rush -- I don't know if its necessary that we talk, but if you want to chat, I think it would be good! add me whenever you want!
@gautamb85 x_tm1 is the output from the previous timestep(with initial value v), it need not be the actual prediction of the model at that timestep(which is y_tm1,because it is difficult to access). Still, x_tm1 is still a good representation of y_tm1. The above layer is a general layer - like the default LSTM in Keras. Whether it should do regression or classification is up to you. You simply stack activation layers over it. That being said, try doing a sigmoid/tanh over x_t and see if you find anything interesting.
Hey Guys, I'm back from exploring TensorFlow, and I'm fired up to keep working on @farizrahman4u's seq2seq. I'm having a few issues Fariz, which I will post to your seq2seq channel. However, here are a few takeaways, I got from TensorFlow:
@LeavesBreathe I am sort of offline right now. But as my seq2seq repo has got more attention than I anticipated, I would be spending more time on it once i am done with my upcoming exams:).
Best of luck with the exams -- I must say that your seq2seq repo has gotten much attention from my skype contacts -- they keep asking me about it and are constantly comparing it to tensorflow's seq2seq. I think you and Tensorflow have the best working seq2seq models right now.
@LeavesBreathe
. I think you and Tensorflow have the best working seq2seq models right now.
Thats a huge compliment. Thanks!
Regarding attention mechanism, I will be converting the following project to keras: https://github.com/npow/RNN-EM The API will be similar to that of an LSTM, so just replacing all LSTMs with the RNN_EM class would give you a seq2seq model with attention mechanism.
That's gonna be killer. Beyond that, the only major feature that I see that tensorflow has is a sampled_softmax, but I'm trying to work on a hierarchical softmax right now. It will definitely take me a while as it has been attempted already in Keras.
Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values? There is no choice to pass a mask to the objective function. Won't this bias the cost function?