Closed Yevgnen closed 7 years ago
Thank you for the interest in our work!
In paper 1, there's also LSTM preprocessing layer. Actually, match-LSTM layer is same for both papers. Could you specify the equation numbers that are not clear to understand in either paper?
Best, Shuohang
Hi, Thanks for your reply.
As stated in section 2.1 in paper 1, X_t denotes embedded vectors and h_t denotes hidden state of the hypothesis and is computed follow a seq2seq decoder fashion. I don't see the word vectors of the premise and hypothesis in paper 1 are preprocessed by rnn/lstm as mentioned eq 1 of paper 2.
More specifically, what confused me comes from eq 6(1) in paper 1 and eq 2(2) in paper 2. The h_k^t in 1 is computed in recurrent fashion along with the computation of attention, while the h_i^p in 2 in computed in preprocessing, we just need to feed it in when computing attention. Isn't it?
Hi,
Thank you for the further questions!
For paper1: Actually, h^s and h^t are the outputs of LSTM on the embeddings x^s and x^t respectively. h^m is like a "seq2seq decoder fashion". There're also 3 LSTMs in total for paper 1.
h^m (eqn. 6 & 8) in paper 1 or h^r (eqn.2 & 4) in paper 2 are computed in recurrent fashion along with the computation of attention.
Best, Shuohang
Thanks for your reply again.
For paper1: Actually, h^s and h^t are the outputs of LSTM on the embeddings x^s and x^t respectively.
😁That idea seems a bit new for me since in machine translation, the computation of attention is computed along with the embedding -> encode/decode
phrase. While in machine reading, the passage and question are first encoded, then one build attention in a recurrent fashion based on these encoded vectors. Am I missing something here?
Yes, you're right!
Best, Shuohang
Match-LSTM is really a good name for this. Thankyou!
Hi, I'm new to matching reading in recent days and read your two awesome papers
But I'm a bit confused by the match-LSTM layer. (Below following the seq2seq nature, I'll call the second LSTM a 'decoder' temporary)
Situation 1: In paper 1, there's no
LSTM preprocessing Layer
mentioned in paper 2. And the premise is first encoded and the hypothesis is second. Thus when dealing with thek
th word in the hypothesis, we haveh_k^t
while for anyT > k
,h_T^t
is not computed yet since they are the hidden states of a recurrent neural network (decoder). The computation follows (h_k^t
,h_{k-1}^m
) ->\alpha_k
->m_k
-> (h_{k+1}^t
,h_k^m
) ..., we need to computeh_{k+1}^t
before we step into next decoding recurrent step.Situation 2: As opposed to Situation 1, in paper 2, the passage and the question are preprocessed by a
LSTM preprocessing layer
, thus we have all the (encoded) hidden states from the passage (H^p
) and the question (H^q
). The computation follows (h_i^p
,h_{i-1}^r
) ->\alpha_i
->z_i
-> (h_{i+1}^p
,h_i^r
) ...here, we dont't need to computeh_{i+1}^p
since we it's inH^p
, which is already computed after preprocessed.Here's my questions:
Am I right? Since match-LSTM was proposed in paper 1, which in a
decode
fashion when processing the hypothesis, however, was used in paper 2 which one do not need to compute subsequenth_{i+1}^p
after preprocessed. So the match-LSTM layer is more an attention mechanism than a decode (recurrent) schema? Thought the original paper said, This LSTM models the matching between the premise and the hypothesis..If so, in situation 1, is
h_{k+1}^t
computed in a plain RNN/LSTM/GRU recurrent style, namely,h_{k+1}^t = RNN/LSTM/GRU(X_{t+1}^t, h_k^t)
?Thanks!