keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.58k stars 19.42k forks source link

is the Sequence to Sequence learning right? #395

Closed EderSantana closed 9 years ago

EderSantana commented 9 years ago

Assume we are trying to learn a sequence to sequence map. For this we can use Recurrent and TimeDistributedDense layers. Now assume that the sequences have different lengths. We should pad both input and desired sequences with zeros, right? But how will the objective function handle the padded values? There is no choice to pass a mask to the objective function. Won't this bias the cost function?

gautamb85 commented 8 years ago

@farizrahman4u You Sir, are a deep learning angel ! :) I did have a question related to a previous post. I think I read/saw that you are using 2 lstm's in the encoder. The idea behind this is a hierarchical RNN? I saw a paper recently, so the upper LSTM takes as its input (at each of its time steps) the encoding produced by the lower LSTM. So if the lower RNN is fed words, the last time step would represent a sentence. Consequently, the last tilmestep of the upper LSTM would encode the whole document.

Of course, as you mentioned, many seq2seq architectures can be experimented with.

PS. Its really late and I might be imaging all this. If that is the case. Please ignore.

gautamb85 commented 8 years ago

@farizrahman4u I don't know if this helps, but I can confirm (for sure) at least in Cho et al.'s approach, that the decoder is trained by teacher forcing (though I guess you don't have to). That is, during training the decoder is fed the TRUE label of the previous time-step, whereas at test time the true label is replaced by the prediction. If we want to score a pair of sequences, we still have the true labels at test time, so those can be used. I am not sure if you need the readout or if it helps in this case, since you have the true label.

NickShahML commented 8 years ago

@farizrahman4u You Sir, are a deep learning angel ! :)

Amen

That is, during training the decoder is fed the TRUE label of the previous time-step, whereas at test time the true label is replaced by the prediction.

Maybe I'm missing something but isn't this what we already do? For example, right now, I give it 2 sentences and ask it to predict the next one. During training, I give it the next one it should predict (label). During testing, I test how close its prediction is to the actual sentence that is labelled.

I guess what I'm asking is: how else would you do this? You have to do teacher forcing?

The idea behind this is a hierarchical RNN? I saw a paper recently, so the upper LSTM takes as its input (at each of its time steps) the encoding produced by the lower LSTM.

I believe the whole idea is this:

You start off with with a basic encoder LSTM and decoder LSTM:

words --> Embedding --> Encoder LSTM --> Dense --> Decoder LSTM --> TimeDistributedDense--> Softmax

However, to make this neural net capture more salient featuers, we add another encoding level after the decoder:

words --> Embedding --> Encoder1 LSTM --> Dense --> Decoder LSTM --> Encoder2 LSTM --> TimeDistributedDense--> Softmax

Lastly, we want to ensure that our encoder1 and our encoder2 are big enough to capture all levels of abstraction, so we add multiple layers of LSTMs within encoder1 and encoder2:

words --> Embedding --> Encoder1 LSTM (4 LSTMs) --> Dense --> Decoder LSTM (1 LSTM) --> Encoder2 LSTM (3 LSTMs) --> TimeDistributedDense--> Softmax

As a side note, there are Dense + RepeatVector in between each of the LSTM's within the encoder1 and encoder2 levels.

All hidden states are transferred (propagated) from each previous LSTM to the next LSTM. To do this, you need to use Fariz's broadcast_state he has built in, so you can't use Keras to do this alone. I did not include broadcast state below because it would take up too much space. All layed out, it looks like this:

model = Sequential()

#Encoder 1 layer

model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len))
model.add(LSTM(hidden_variables_encoding, return_sequences = False)) 
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False)) 
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(x_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)

#Decoder Layer
model.add(LSTMDecoder2(.........)) 

#Encoder 2 Layer -- notice the change from x_sent_len to y_sent_len

model.add(LSTM(hidden_variables_encoding, return_sequences = False)) 
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(y_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = False))
model.add(Dense(hidden_variables_encoding)
model.add(RepeatVector(y_sent_len)
model.add(LSTM(hidden_variables_encoding, return_sequences = True))

#softmax

model.add(TimeDistributedDense(y_matrix_axis, activation = 'softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Anyone feel free to correct me if I'm wrong. I know this might be overkill, but I figure being as clear as possible is best so we're all on the same page.

gautamb85 commented 8 years ago

@LeavesBreathe Maybe I'm missing something but isn't this what we already do? For example, right now, I give it 2 sentences and ask it to predict the next one. During training, I give it the next one it should predict (label). During testing, I test how close its prediction is to the actual sentence that is labelled. I guess what I'm asking is: how else would you do this? You have to do teacher forcing?

The idea is that a RNN is being used as a generative model of your data. So I want to find the log likelihood (negative cross entropy) of seq-2 given seq-1 or p(seq-2 | seq-1). So its like I have the correct label (say it was produced by some other system) and I want to score how good it is.

Yes, you would teacher force. Like a language model. Say you have a sequence and a probability model P(seq) (given bt the RNN), if you wanted to find how 'likely' the sequence is given the model, you would feed in the current time-step and ask it to predict the next one. You wouldn't (perhaps additionally) feed the prediction to the next step (because you might end up finding the likelihood of a different sequence. I'm not sure about that last sentence.lol)

I need to look at the code again, but this model seems a little strange. Not to mention, I might be wrong. So the decoder produces an output at every time-step, which is being fed to a new encoder. This guy will/should encode a sequence of predictions produced by the decoder into a single vector (if it works as a standard encoder). Though a standard encoder produces a single output, so connecting it to a time-distributed layer without a repeatVector should break your code (which I assume it doesn't, so something else must be going on)

If you are connecting an encoder to a timedistributed dense layer, it is not really an 'encoder' as it must be producing an output at every time-step to feed to the dense layer.

gautamb85 commented 8 years ago

Encoder 1 layer

model.add(LSTM(hidden_variables_encoding, input_shape=(x_maxlen, word2vec_dimension), return_sequences = False)) model.add(Dense(hidden_variables_encoding) model.add(RepeatVector(x_sent_len)) model.add(LSTM(hidden_variables_encoding, return_sequences = False)) model.add(Dense(hidden_variables_encoding) model.add(RepeatVector(x_sent_len) model.add(LSTM(hidden_variables_encoding, return_sequences = False)) model.add(Dense(hidden_variables_encoding) model.add(RepeatVector(x_sent_len) model.add(LSTM(hidden_variables_encoding, return_sequences = True)) model.add(Dense(hidden_variables_encoding)

I am not sure if this is an 'encoder' per say. As the last return sequences = True. I think the idea is to get a single vector encoding of the sequence

NickShahML commented 8 years ago

Hey @gautamb85 I think this is a good discussion, though it would require alot of typing to communicate back and forth, do you want to skype chat for a bit if you're free? I think talking this out would be easier, and we can post back here when we have a conclusion? My username is leavesbreathe

farizrahman4u commented 8 years ago

@LeavesBreathe The final layer of encoder1 should have return _sequences=False

NickShahML commented 8 years ago

@LeavesBreathe The final layer of encoder of 1 should have return _sequences=False

My error. You want a final vector in the end of the encoder, so setting return_sequences=False makes complete sense.

Fariz, if you want to join the skype chat, feel free to!

farizrahman4u commented 8 years ago

@gautamb85 I havent tried the Cho's rescoring thing. Can you please explain it to me.. like what is your input, what is the output, what are you trying to optimize etc?

farizrahman4u commented 8 years ago

@LeavesBreathe You two skype and later comment your conclusions here for us. I have some work pending, #964, #928, and documentation of #893. So pretty busy:)

NickShahML commented 8 years ago

Sounds good -- I'll skype with @gautamb85 (if he's cool with that), and we'll get back as to what we agree is the most optimal model. I'll test that model tonight if I can get everything setup.

gautamb85 commented 8 years ago

@farizrahman4u Lets say I have an english-french translation pair. So you would encode the english sentence (input 1). For the decoder, in Cho's paper (appendix) the GRU non-linearity gets 3 inputs

Does a graph() model make sense for what I described? Il post some pseudo code in a little while)

gautamb85 commented 8 years ago

@LeavesBreathe Sure, Skype sounds good. Today is kinda busy, but I might be available later at night (if you are a late sleeper). Tomorrow evening should be cool. I'm in Montreal, hoping you're in an easy to co-ordinate with time-zone :)

NickShahML commented 8 years ago

I'm in Cincinnati, so we have the same time zone. I have some things I gotta do tonight, but I'm free now untill 5. Or 9pm to 11pm tonight. Whatever works best for you. Just add me on skype and we can figure out a time. Sometimes I step away from my desk for a while, but if we schedule a time, I'll be sure to be online then.

farizrahman4u commented 8 years ago

@gautamb85

Providing an extra input (so that the context vector is fed explicitly to each time-step) might prove a little problematic as it might need a new layer to be written

This is already done using the LSTMDecoder2 layer in seq2seq. The output from encoder is repeatedly input to the decoder at every time step.

gautamb85 commented 8 years ago

@farizrahman4u O sweet! I didn't look at decoder2 yet. so I would do a graph() model sorta like this input-1 -> encoder input-2 -> LSTMdecoder2

Does a graph() architecture make sense ? I would need it to feed in the second sequence no? I'll try out and report back, but yes, it should solve the problem.

Q. I noticed that you have a dense layer between stacked LSTM in the encoder, and also between the encoder and decoder. Why do you do it this way, is it shape related?

farizrahman4u commented 8 years ago

Yes. the Dense in between encoder and decoder is to make the shape compatible. There is no dense in between the LSTM stack layers btw. (Its only there in @LeavesBreathe 's comment). there are RepeatVectors in between the stack layers, again, for shape compatibility.

farizrahman4u commented 8 years ago

@gautamb85 This could be done fairly easily.. all you have to do is add an additional input to LSTMDecoder2, so that it at a given time step it will have 3 inputs:

_Note: French word embedding size should be same as dimension of context vector from the encoder._

Now lets see some pseudo code:


french = Sequential()
french.add(Embedding(...))

english = Sequential()#this is your encoder
english.add(Embedding(...))
english.add(DeepLSTM(..., return_sequence=False))

#Cheat keras..make it think its a single input
english.add(Reshape(1,french_embedding_size))
merge=Merge(english, french, 'concat')

#Decoder
deocoder=LSTMDecoder3()
model = Sequential()
model.add(merge)
model.add(LSTMDecoder3(....))

#optionally
english.broadcast_state(decoder)

model.compile(...)

I will post code for LSTMDecoder3 shorty!

farizrahman4u commented 8 years ago

Done!

from seq2seq.lstm_decoder import LSTMDecoder2

class LSTMDecoder3(LSTMDecoder2):

    def _step(self, si, sf, sc, so,
              x_tm1,
              h_tm1, c_tm1, v,
              u_i, u_f, u_o, u_c, w_i, w_f, w_c, w_o, w_x, v_i, v_f, v_c, v_o, b_i, b_f, b_c, b_o, b_x):

        #Inputs = output from previous time step, vector from encoder, french sentence
        xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i 
        xf_t = T.dot(x_tm1, w_f) + T.dot(v, v_f) + sf + b_f
        xc_t = T.dot(x_tm1, w_c) + T.dot(v, v_c) + sc + b_c
        xo_t = T.dot(x_tm1, w_o) + T.dot(v, v_o) + so + b_o

        i_t = self.inner_activation(xi_t + T.dot(h_tm1, u_i))
        f_t = self.inner_activation(xf_t + T.dot(h_tm1, u_f))
        c_t = f_t * c_tm1 + i_t * self.activation(xc_t + T.dot(h_tm1, u_c))
        o_t = self.inner_activation(xo_t + T.dot(h_tm1, u_o))
        h_t = o_t * self.activation(c_t)

        x_t = T.dot(h_t, w_x) + b_x
        return x_t, h_t, c_t

    def get_output(self, train=False):
        ip = self.get_input(train)
        v = ip[0]#English context vector from encoder
        S = ip[1:]#French Sentence
        si = T.dot(S, self.S_i)
        sf = T.dot(S, self.S_f)
        sc = T.dot(S, self.S_c)
        so = T.dot(S, self.S_o) 
        [outputs,hidden_states, cell_states], updates = theano.scan(
            self._step,
            sequences=[si, sf, sc, so],
            outputs_info=[v, self.h, self.c],
            non_sequences=[v, self.U_i, self.U_f, self.U_o, self.U_c,
                          self.W_i, self.W_f, self.W_c, self.W_o,
                          self.W_x, self.V_i, self.V_f, self.V_c,
                          self.V_o, self.b_i, self.b_f, self.b_c, 
                          self.b_o, self.b_x],
            truncate_gradient=self.truncate_gradient)
        if self.state_input is None and self.remember_state:
            self.updates = ((self.h, hidden_states[-1]),(self.c, cell_states[-1]))
        for o in self.state_outputs:
            o.updates=((o.h, hidden_states[-1]),(o.c, cell_states[-1]))           
        return outputs

    def set_params(self):
        super(LSTMDecoder3, self).set_params()
        dim = self.input_dim
        hdim = self.hidden_dim
        self.S_i = self.init((dim, hdim))
        self.S_f = self.init((dim, hdim))
        self.S_c = self.init((dim, hdim))
        self.S_o = self.init((dim, hdim))
        self.params += [self.S_i,self.S_c, self.S_f, self.S_o]

    def build(self):
        self.set_params()
        self._build()

Might have typos/indentation issues because I am typing this on my phone and can not test it right now.

gautamb85 commented 8 years ago

I wil test out.

a couple of questions

Cheat keras..make it think its a single input

englsh.add(Reshape(1,french_embedding_size)) merge=Merge(english, french, 'concat')

Decoder

deocoder=LSTMDecoder3() model = Sequential() model.add(merge) model.add(LSTMDecoder3(....))

Q. So you are concatenating the encoder output and the embedding for the french word into a single vector. So decoder3 takes this guy (concatenated vector) as the new additional input ? It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?

Q. I don't get why you need to concatenate them.

Can't it be -

french = Sequential() french.add(Embedding(...))

english = Sequential()#this is your encoder english.add(Embedding(...)) english.add(DeepLSTM(...output_dim=xdim, return_sequence=False))

(maybe a reshape over here is needed) Q. can't I get the output from deepLSTM, like : context = output fro Deep_LSTM and then do model.add(context) instead of add(merge) Unless its not easy to get the layer output

Decoder

deocoder=LSTMDecoder3() model = Sequential() model.add(context) model.add(LSTMDecoder3(....))

PS. You typed this on your phone? Mind = Blown :)

farizrahman4u commented 8 years ago

@gautamb85

PS. You typed this on your phone?

Most of it is copy-pasted from LSTMDecoder2.

. So you are concatenating the encoder output and the embedding for the french word into a single vector.

No. I am concatenating the encoder output and the French SENTENCE (NOT WORD) into a single matrix(not vector)

If the French sentence was [word1, word2, word3, word4], after merging it with the context vector it would look like : [context, word1, word2, word3, word4]

So decoder3 takes this guy (concatenated vector) as the new additional input ?

No. This guy is THE input. Not additional. Technically, the number of inputs for for LSTMDecoder2 and LSTMDecoder3 is same (which is 2). But logically,LSTMDecoder3 has one extra input(the merged input could be seen as 2 inputs)

It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?

YES

Q. I don't get why you need to concatenate them

  • We are packing the inputs for the decoder into a single tensor. The decoder then separates them out.
  • If your sentence pair is ["how are you", "comment allez-vous"], your merged guy will look like this: [ f("how are you"), e("comment"), e("allez"), e("vous")]. Here f("how are you") is your context vector that you get from the encoder.

Lets analyze the input to the decoder and its output for 4 time steps.

Hope this helps. In your code, you are not using your french model at all !! Am I missing something?

gautamb85 commented 8 years ago

My bad with the disconnected french model. (Also typed on my phone. Lol)

I see. Yea, thats why u concatenate them, as the context is getting replaced after the first timestep.

I thought it was like this :

You concatenate so that the context is there at every step. Is that correct?

Thanks again. I will update you when I get something going.

Sent from my iPhone

On Nov 10, 2015, at 2:48 PM, Fariz Rahman notifications@github.com wrote:

@gautamb85

PS. You typed this on your phone?

Most of it is copy-pasted from LSTMDecoder2.

. So you are concatenating the encoder output and the embedding for the french word into a single vector.

No. I am concatenating the encoder output and the French SENTENCE (NOT WORD) into a single matrix(not vector)

If the French sentence was [word1, word2, word3, word4], after merging it with the context vector it would look like : [context, word1, word2, word3, word4]

So decoder3 takes this guy (concatenated vector) as the new additional input ?

No. This guy is THE input. Not additional.

It is also getting the encoding explicitly at every time-step (as was the case for lstmdecoder2)?

YES

Q. I don't get why you need to concatenate them

Lets analyze the input to the decoder for 4 time steps.

Time1 : x1 = LSTM(context, word1, context) Time2: x2 = LSTM(context, word2, x1) Time3: x3 = LSTM(context, word3, x2) Time4: x4 = LSTM(context, word4, x3) Hope this helps.

— Reply to this email directly or view it on GitHub.

farizrahman4u commented 8 years ago

@gautamb85 I think we are talking about slightly different models. Can you give me an example of your x_train, y_train etc?

gautamb85 commented 8 years ago

@farizrahman4u I will have to get back to you on that in a little while.

From your code of LSTMdecoder3:
xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i

where v is the context vector from the encoder and v_i is the weight matrix (for input gate only) and x_tm1 is the prediction from the previous time-step.

Could I replace the prediction x_tm1 with the actual french word, or alternatively, add not the T.dot(x_t,W) where x_t is the current french word. Which I guess is the need to concatenate the context and the french sentence.

But I think what you are suggesting with the architecture (the concatenation) should achieve the same thing.

farizrahman4u commented 8 years ago

I need more clarity on what we are trying to do here. Lets start with what your training data will look like. I will then pick the best model for you.

gautamb85 commented 8 years ago

I eventually want to use the model with speech. So my training data would be a pairs of recordings that are padded to the same length. So it would be a 3D matrix - (N_samp, maxlen, feat_dim) representing a mini N_samp examples (corresponding to seq-1) and a similar matrix corresponding to seq-2.

Now this model can be trained by teacher forcing, i.e. instead of feeding the prediction (as you do in LSTMdecoder2 I believe) you can feed in the true label from the previous time-step.

This is how Cho et. al trained there models, both for generating translations and scoring pairs of translations. In the generation case, you need to replace the true label with the prediction as you don't have the true label. However in the scoring case, you can use the model as it (if it is trained by teacher forcing), since we have access to both seq-1 and seq-2

NickShahML commented 8 years ago

This paper is pretty interesting: http://arxiv.org/pdf/1511.01432.pdf

I gotta read it some more to fully understand it, but looks like there's even more we can implement.

Fariz, I'm having some difficulty with getting your classes to work. I'm gonna try a few variations first, but if I can't get any of them working, I'll report back here tomorrow afternoon or so.

tttwwy commented 8 years ago

@farizrahman4u Thank you for your decoder code, does your decoder code is an attention model?

farizrahman4u commented 8 years ago

@tttwwy No.Its just a stateful LSTM with readout and hidden state broadcasting.

farizrahman4u commented 8 years ago

@LeavesBreathe Please open an issue in seq2seq for any problem you are facing. Try recloning. I just made an update.

NickShahML commented 8 years ago

@farizrahman4u I will be sure to open an issue on your seq2seq. Give me at least a full day as I'm testing alot of variations/debugging before I come to you with the final problem.

@tttwwy I agree with you that attention is very important. I'm working on some code that I hope to implement in two to three weeks to address this issue. Maybe add on to Fariz's seq to seq model so we all have one working model.

@gautamb85 did you want to still chat tonight? I think it would be good to share each other's ideas with each other. Doesn't have to be long. I can't pm you over github, so if you can just add me on skype and we can figure out a time. I'm free all day today.

gautamb85 commented 8 years ago

Hey. Sorry for the late reply. Are u free at 10-1030 eastern time?

Sent from my iPhone

On Nov 11, 2015, at 12:29 PM, LeavesBreathe notifications@github.com wrote:

@farizrahman4u I will be sure to open an issue on your seq2seq. Give me at least a full day as I'm testing alot of variations/debugging before I come to you with the final problem.

@tttwwy I agree with you that attention is very important. I'm working on some code that I hope to implement in two to three weeks to address this issue. Maybe add on to Fariz's seq to seq model so we all have one working model.

@gautamb85 did you want to still chat tonight? I think it would be good to share each other's ideas with each other. Doesn't have to be long. I can't pm you over github, so if you can just add me on skype and we can figure out a time. I'm free all day today.

— Reply to this email directly or view it on GitHub.

simonhughes22 commented 8 years ago

@gautamb85 @LeavesBreathe @farizrahman4u have you seen this: http://www.tensorflow.org/tutorials/seq2seq/index.md? Google just open sourced a deep learning toolkit with a graphical interface. Includes a sequence to sequence model. I am unreasonably excited. Has a python interface and an attentional model, something I've really wanted and needed for my research. It also has a python interface

EderSantana commented 8 years ago

@simonhughes22 since you are interested in models besides Keras, have you seen blocks-examples? They have a machine translation model with attention working out the box for en-cs. https://github.com/mila-udem/blocks-examples

farizrahman4u commented 8 years ago

Slightly off topic: @fchollet tweeted that keras will be supporting seamlessly both theano and tensorflow. Does this mean that keras models could run on android? Because tensorflow has an android example. In the mean time, is there any way to get a keras model work on anndroid as of now? Has anyone tried it (like turn of all the c++ stuff and run theano in pure python mode?)

NickShahML commented 8 years ago

@gautamb85 , sorry but we had a power outage yesterday -- internet is stil out but we can chat hopefully tonight. Do you mind adding me on skype (username is leavesbreathe) so that we don't need to take up space on this thread to schedule talking?

NickShahML commented 8 years ago

Hey guys so I pretty much spent the entire day reading up on TensorFlow. I think the bottom line is that they have more capabilities (attention mechanism), however it is so much more messy than Keras. So basically, I've decided to try out TensorFlow, but I still want to use Keras (as I like Keras's community and logic flow more).

With all of that being said, I think it would be interesting to compare results using Keras versus TensorFlow. I hope to have TF up in the next week or so to see what type of results I'm getting.

tttwwy commented 8 years ago

@EderSantana @simonhughes22 @melonista @LeavesBreathe there is an attention based nmt model , which may have some help for you . https://github.com/kyunghyuncho/dl4mt-material/tree/master/session3

farizrahman4u commented 8 years ago

@LeavesBreathe TensorFlow code is messy when compared to Keras. It is more easy to contribute to Keras, which means in the long run Keras will be filled with a lot of features. We even have a working implementation of Nural Turing Machine !( #990 ). We are the first to open source it.

NickShahML commented 8 years ago

It is more easy to contribute to Keras, which means in the long run Keras will be filled with a lot of features.

I totally agree with you. Keras is much cleaner and easier to contribute to (I don't even think tf is allowing PR's).

However, I want at least try it for a little bit because there may be a few things I learn from TF that we can implement in Keras. For example, they give an attention model mechanism here:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/seq2seq.py#L453-520

Btw, I was looking at the neural turing machine earlier that Eder wrote and it looks so cool

EderSantana commented 8 years ago

I agree with @farizrahman4u !!! Keras is much easier to program, cause we are focused in offering higher level APIs.

To be more precise, I think we are the first to open source a NTMachine with RNN controllers. Others had implementations with feedforward controllers (which is less powerful).

For example, here is a simple LSTM classifying the MNIST (running row by row) in TensorFlow https://github.com/EderSantana/TwistedFate/blob/master/mnist_lstm.py It is fast to start running, but we have to hard code all dimensions (fix batch size, fix sequence length, etc)

gautamb85 commented 8 years ago

@LeavesBreathe Sorry for not getting back to you. I don't have a skype account, aI will set it up over and we add you over the weekend.

@farizrahman4u I had a question about your code. Specifically realting to prediction feedback:

def _step(self, si, sf, sc, so, x_tm1, h_tm1, c_tm1, v, u_i, u_f, u_o, u_c, w_i, w_f, w_c, w_o, w_x, v_i, v_f, v_c, v_o, b_i, b_f, b_c, b_o, b_x):

    #Inputs = output from previous time step, vector from encoder, french sentence
    xi_t = T.dot(x_tm1, w_i) + T.dot(v, v_i) + si + b_i 
    xf_t = T.dot(x_tm1, w_f) + T.dot(v, v_f) + sf + b_f
    xc_t = T.dot(x_tm1, w_c) + T.dot(v, v_c) + sc + b_c
    xo_t = T.dot(x_tm1, w_o) + T.dot(v, v_o) + so + b_o

    i_t = self.inner_activation(xi_t + T.dot(h_tm1, u_i))
    f_t = self.inner_activation(xf_t + T.dot(h_tm1, u_f))
    c_t = f_t * c_tm1 + i_t * self.activation(xc_t + T.dot(h_tm1, u_c))
    o_t = self.inner_activation(xo_t + T.dot(h_tm1, u_o))
    h_t = o_t * self.activation(c_t)

    x_t = T.dot(h_t, w_x) + b_x
    return x_t, h_t, c_t

Q. In that code snippet, x_t is the prediction (which is getting fed back via scan) and it is intialized as v (the context produced by the encoder). correct?

Q. I am confused because, if this was regression, then x_t represents the actual prediction of the model. However for classification, this x_t would get fed to a softmax function, and then we would sample/argmax to get the actual prediction.

Is it equivalent to feedback just x_t? (without doing softmax etc) and does it work the same way at test time? I mean, at test time the x_t (before softmax) is fed back as the 'prediction', however the actual prediction (visible) is made by feeding these hidden predictions to a softmax layer after the 'scanning (theano)' is done.

Q. I assume the get_outputs function that (returns the outputs) feeds them to a dense (softmax) layer that makes the prediction ?

Ps. I know a lot of that may not sound clear, and I am happy to clarify.

NickShahML commented 8 years ago

@gautamb85 no problem no rush -- I don't know if its necessary that we talk, but if you want to chat, I think it would be good! add me whenever you want!

farizrahman4u commented 8 years ago

@gautamb85 x_tm1 is the output from the previous timestep(with initial value v), it need not be the actual prediction of the model at that timestep(which is y_tm1,because it is difficult to access). Still, x_tm1 is still a good representation of y_tm1. The above layer is a general layer - like the default LSTM in Keras. Whether it should do regression or classification is up to you. You simply stack activation layers over it. That being said, try doing a sigmoid/tanh over x_t and see if you find anything interesting.

NickShahML commented 8 years ago

Hey Guys, I'm back from exploring TensorFlow, and I'm fired up to keep working on @farizrahman4u's seq2seq. I'm having a few issues Fariz, which I will post to your seq2seq channel. However, here are a few takeaways, I got from TensorFlow:

  1. In each sentence, they append a sentence start (Go) and a sentence end token. And then they pad with a separate symbol (PAD). Thus for the output, PAD is on the right. However, for the input, they reverse the entire sentence meaning the PAD is on the left. I thought this was interesting.
  2. They have two separate types of LSTM cells: a basic and a more advanced version. In their translation example, they used the basic one as it is much faster. Just something to take note on.
  3. The biggest feature they have over Fariz's Seq2Seq is attention mechanism, which helps for especially long input sentences. Once I get Fariz's seq2seq working, I plan on working hard on this to add it in.
  4. Another Feature they have is a sampled softmax. This is nice because it allows you to do a softmax of only a fraction of your words, and you get pretty good results. This allows you to do softmax over 200,000 words on a GPU. However, I think the hierarchical softmax (another area I want to investigate) works as well, so I don't think this is a killer feature.
  5. For efficiency, they use bucketing, which is sorting sentences by length and placing them in separate buckets. In the end, this results in a ~2x speedup in training.
  6. One thing that drove me insane is that it appears that the input to their seq2seq models are 2d tensors, including the non-embedding models. I am still confused on this.
  7. If anyone's interested in using TensorFlow, prepare for a huge mess. It has a lot of good tools, and you can easily tell which device to handle which op, but to replicate what we do here, it takes at least 3 separate classes, just for the framework.
farizrahman4u commented 8 years ago

@LeavesBreathe I am sort of offline right now. But as my seq2seq repo has got more attention than I anticipated, I would be spending more time on it once i am done with my upcoming exams:).

NickShahML commented 8 years ago

Best of luck with the exams -- I must say that your seq2seq repo has gotten much attention from my skype contacts -- they keep asking me about it and are constantly comparing it to tensorflow's seq2seq. I think you and Tensorflow have the best working seq2seq models right now.

farizrahman4u commented 8 years ago

@LeavesBreathe

. I think you and Tensorflow have the best working seq2seq models right now.

Thats a huge compliment. Thanks!

farizrahman4u commented 8 years ago

Regarding attention mechanism, I will be converting the following project to keras: https://github.com/npow/RNN-EM The API will be similar to that of an LSTM, so just replacing all LSTMs with the RNN_EM class would give you a seq2seq model with attention mechanism.

NickShahML commented 8 years ago

That's gonna be killer. Beyond that, the only major feature that I see that tensorflow has is a sampled_softmax, but I'm trying to work on a hierarchical softmax right now. It will definitely take me a while as it has been attempted already in Keras.