Consistency with the article (HATT)

miclatar commented 6 years ago

I've seen some discussion about it, but I'm afraid I still don't get it:

The tanh activation is applied in the original paper over an MLP layer which accepts only the bilstm vector as input (eq 5).

Assuming self.W is the context vector in our case, then tanh is applied on the multiplication of the bilstm vector with the context vector (the Dense layer does not have activation of itself).

What is the explanation for this? Thanks!

richliao commented 6 years ago

every sequence of LSTM output is a 2D, and the context vector is 1D. The product of them is 1D. The context vector is trained to assign weight to the 2D so that you can think of it as a weighted vector, such that ideally, it will give more weight to the important token.

miclatar commented 6 years ago

Hi, thanks for your answer. However I'm afraid I already understood this concept - my issue is with the tanh activation. In the paper, it is performed on the dense layer before the multiplication with the context vector. In your implementation, it is performed on the dot product of these vectors.

According to the code, we actually stack two linear operations on the output of the GRU layer - first the Dense layer, and then the dot multiplication with self.W, without non-linearity in between. Theoretically, this could be converted with a single linear layer (as explained here).

Again, maybe I miss something, will be glad for an explanation :)

richliao commented 6 years ago

Which equation are you referring to? The tanh activation at my code refers to equation (5) and (8). h_it is from GRU output.

miclatar commented 6 years ago

I'll try to be as rigorous as possible:

(194) l_lstm_sent = Bidirectional(GRU(100, return_sequences=True))(review_encoder) (195) l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent) (196) l_att_sent = AttLayer()(l_dense_sent)

These are lines 194-196 in the code, referring to the upper hierarchy layer.

(5) u_it = tanh(W_w * h_it + b_w) (6) a_it = exp(u_it * u_w) / sigma(exp(u_it * u_w))

And these are equations 5, 6 from the paper. The case is the same for lines 187-189 in the code and for equations 8-10, however I'll demonstrate only on these parts.

As you've said, h_it is the GRU output. In line 195, it is being passed through a Dense layer, therefore implementing the W_w * h_it + b_w part. My question refers to the next step.

According to the code, this output is now passed through the Attention layer. Note we do not have any activation in line 195, so we proceed only with the inner linear part of equation 5, rather than with u_it. More specifically, the next operation takes place in the call() method of the layer:

(174) eij = K.tanh(K.dot(x, self.W)) (175) (176) ai = K.exp(eij) (177) weights = ai/K.sum(ai, axis=1).dimshuffle(0,'x')

x being the input of the layer, or literally W_w * h_it + b_w. The next thing we know in the inner parantheses of line 174 is (W_w * h_it + b_w) * u_w, where u_w == self.W is the context vector. However this product happens, according to the paper, only in equation 6. We have skipped the tanh operation.

Only then, by line 174 in the code, we apply the tanh on the product. Note that this product is directly inserted to the exp in equation 6, without any non-linearities in between.

To my understanding, this is a different procedure than the one practiced in the paper. I may be wrong, or possibly this somehow leads to similar behavior, but I'd just like to hear why :)

Thanks!

richliao commented 6 years ago

Ha, you found a HUGE bug in my code that I didn't realize. I'm quite sure you are the first one to point out even someone asked why I use time distributed dense function (depricated).

The bug is I placed the tanh in the wrong place and wrong order. The TimeDistributed(Dense(200))(l_lstm_sent) is intended to do a one layer MLP, and as you said, there should be a tanh activation function before the dot product. The solution is either 1) (195) l_dense_sent = TimeDistributed(Dense(200, activation='tanh'))(l_lstm_sent) eij = K.dot(x, self.W) (by removing tanh) or 2) l_dense_sent = TimeDistributed(Dense(200))(l_lstm_sent) is the same then eij = K.dot(K.tanh(x), self.W) (by changing order)

It has been so long that I have to reread the paper to bring backs the memory. I hope I didn't make mistakes again. Let me know :)

richliao / textClassifier

Consistency with the article (HATT) #24