sgrvinod / a-PyTorch-Tutorial-to-Image-Captioning

Show, Attend, and Tell | a PyTorch Tutorial to Image Captioning
MIT License
2.75k stars 711 forks source link

predict linear layer's input is just hidden states but in original paper, they combined with [L(word_embed+Wh+Uc)] #20

Open joelxiangnanchen opened 5 years ago

joelxiangnanchen commented 5 years ago

Hi, Thx for your great tutorial with nice guide and code. After I read decoder's code, I found that you just use lstm's hidden states to compute the next word's prob. Following here: preds = self.fc(self.dropout(h)) In original paper, they combined with last word's vector and transed context vector and hidden states L(Eyt-1+Wht+Uz) and their Theano's as follows:

logit = get_layer('ff')[1](tparams, proj_h, options, prefix='ff_logit_lstm', activ='linear')
if options['prev2out']:
     logit += emb
if options['ctx2out']:
    logit += get_layer('ff')[1](tparams, ctxs, options, prefix='ff_logit_ctx', activ='linear'
logit = tanh(logit)

They controlled whether to use the other two vectors. So, did you compare the results of using and not using these extra information. Thx very much!

fawazsammani commented 5 years ago

Hi. I asked myself the same question at the beginning. But then I realized that the context vector is being fed to the input of the LSTM rather than combining it with the hidden state. I guess it yields the same results

sgrvinod commented 5 years ago

@Hayao41 Thanks, I see. It seems I missed equation (7) in the paper. Nope, haven't tried it. Do let me know if you do.

For others reading this issue, @Hayao41 is pointing out that the paper-authors are adding an additional transform on h_t, i.e. a"hidden-to-output function" as proposed in this other paper, and combining it with the context vector (transformed) and previous word's embedding. This combination is the input to the word-predicting linear layer (as opposed to just h_t like I have done). It probably does squeeze out some more performance.

@fawazsammani As I see now, the context vector and word embedding are being fed both to the LSTM and to the final linear transformation that predicts the word-scores, whereas I'm feeding them only into the LSTM.

joelxiangnanchen commented 5 years ago

@fawazsammani Hi, in original paper, author fed linear-transformed current hidden state, previous word embedding and current context vec into prediction layer like equation above. LSTM's input is last time step's hidden state, previous word embedding and current context vec, difference here.

@sgrvinod Thx for discussion, I will have a try to compare which is better and will post my results later. And I have another question is the coefficient Beta. In my opinion, Bata is a scalar to control when to attend to image, but in your code, Beta is a vector (batch size x encoder_dim). Thus, it means every element in context vector has a gate coefficient. But in original paper, Beta is a global scalar, thus: Beta * context_vector, where Beta is (batch size x 1). and the code in original implementation as following:

if options['selector']:
    sel_ = tensor.nnet.sigmoid(tensor.dot(h_, tparams[_p(prefix, 'W_sel')])+tparams[_p(prefix,'b_sel')])
    sel_ = sel_.reshape([sel_.shape[0]])
    ctx_ = sel_[:,None] * ctx_

here, the sel_ is beta. Is sel_'s shape is (batch size x 1) or (batch size x encoderdim) ? I'm not familiar with theano, so I didn't understand what the `sel`'s shape is. Thx for discussion.