shuohangwang / SeqMatchSeq

284 stars 79 forks source link

Details of the Match-LSTM layer #6

Closed Yevgnen closed 6 years ago

Yevgnen commented 6 years ago

Hi, I'm new to matching reading in recent days and read your two awesome papers

  1. Learning Natural Language Inference With Lstm
  2. Machine comprehension using match-lstm and answer pointer

But I'm a bit confused by the match-LSTM layer. (Below following the seq2seq nature, I'll call the second LSTM a 'decoder' temporary)

Situation 1: In paper 1, there's no LSTM preprocessing Layer mentioned in paper 2. And the premise is first encoded and the hypothesis is second. Thus when dealing with the kth word in the hypothesis, we have h_k^t while for any T > k, h_T^t is not computed yet since they are the hidden states of a recurrent neural network (decoder). The computation follows (h_k^t, h_{k-1}^m) -> \alpha_k -> m_k -> (h_{k+1}^t, h_k^m) ..., we need to compute h_{k+1}^t before we step into next decoding recurrent step.

Situation 2: As opposed to Situation 1, in paper 2, the passage and the question are preprocessed by a LSTM preprocessing layer, thus we have all the (encoded) hidden states from the passage (H^p) and the question (H^q). The computation follows (h_i^p, h_{i-1}^r) -> \alpha_i -> z_i -> (h_{i+1}^p, h_i^r) ...here, we dont't need to compute h_{i+1}^p since we it's in H^p, which is already computed after preprocessed.

Here's my questions:

  1. Am I right? Since match-LSTM was proposed in paper 1, which in a decode fashion when processing the hypothesis, however, was used in paper 2 which one do not need to compute subsequent h_{i+1}^p after preprocessed. So the match-LSTM layer is more an attention mechanism than a decode (recurrent) schema? Thought the original paper said, This LSTM models the matching between the premise and the hypothesis..

  2. If so, in situation 1, is h_{k+1}^t computed in a plain RNN/LSTM/GRU recurrent style, namely, h_{k+1}^t = RNN/LSTM/GRU(X_{t+1}^t, h_k^t) ?

Thanks!

shuohangwang commented 6 years ago

Thank you for the interest in our work!

In paper 1, there's also LSTM preprocessing layer. Actually, match-LSTM layer is same for both papers. Could you specify the equation numbers that are not clear to understand in either paper?

Best, Shuohang

Yevgnen commented 6 years ago

Hi, Thanks for your reply. img_0026 img_0027

As stated in section 2.1 in paper 1, X_t denotes embedded vectors and h_t denotes hidden state of the hypothesis and is computed follow a seq2seq decoder fashion. I don't see the word vectors of the premise and hypothesis in paper 1 are preprocessed by rnn/lstm as mentioned eq 1 of paper 2.

More specifically, what confused me comes from eq 6(1) in paper 1 and eq 2(2) in paper 2. The h_k^t in 1 is computed in recurrent fashion along with the computation of attention, while the h_i^p in 2 in computed in preprocessing, we just need to feed it in when computing attention. Isn't it?

shuohangwang commented 6 years ago

Hi,

Thank you for the further questions!

For paper1: Actually, h^s and h^t are the outputs of LSTM on the embeddings x^s and x^t respectively. h^m is like a "seq2seq decoder fashion". There're also 3 LSTMs in total for paper 1.

h^m (eqn. 6 & 8) in paper 1 or h^r (eqn.2 & 4) in paper 2 are computed in recurrent fashion along with the computation of attention.

Best, Shuohang

Yevgnen commented 6 years ago

Thanks for your reply again.

For paper1: Actually, h^s and h^t are the outputs of LSTM on the embeddings x^s and x^t respectively.

😁That idea seems a bit new for me since in machine translation, the computation of attention is computed along with the embedding -> encode/decode phrase. While in machine reading, the passage and question are first encoded, then one build attention in a recurrent fashion based on these encoded vectors. Am I missing something here?

shuohangwang commented 6 years ago

Yes, you're right!

Best, Shuohang

Yevgnen commented 6 years ago

Match-LSTM is really a good name for this. Thankyou!