pochih / RL-Chatbot

🤖 Deep Reinforcement Learning Chatbot
MIT License
418 stars 140 forks source link

Did you share the parameters of LSTM in encoder and decoder #4

Closed yaolili closed 6 years ago

yaolili commented 6 years ago

In the python/model.py script, the decoding stage:

`with tf.variable_scope("LSTM1"): output1, state1 = self.lstm1(padding, state1)

with tf.variable_scope("LSTM2"): output2, state2 = self.lstm2(tf.concat([current_embed, output1], 1), state2)` which is the same in encoding stage. Did you share the same parameters between encoder and decoder?

pochih commented 6 years ago

The answer is YES.

When deal with sequence to sequence problems,people turn to encoder-decoder model。

Encoder-decoder model encodes the input sequence, extracts lower dimension feature, and decode the feature to output sequence.

The vanilla encoder-decoder model was proposed by NIPS2014 paper sequence to sequence learning with neural networks。

The encoder and the decoder in vanilla ones are separated, which means they has different weights。

The encoder and the decoder are both LSTM, encoder will feed its last hidden state into decoder, and decoder will generate sequence based on it.

But consider only the last hidden state loses lots of information, especially when the long input sequence.

The attention mechanism is born to solve this problem. The first attention was proposed by ICLR2015 paper neural machine translation by jointly learning to align and translate

The attention mechanism considers all hidden states, every hidden state will get a weight, base on some rules. Then we can count the weighted sum of all hidden states, and feed this weighted feature into decoder.

Now, the disadvantage caused by long input sequence is reduced. Another problem come into existence. The connection between encoder and the decoder is implicit, the decoder may have no idea about the feature generated by encoder.

So somebody thought off that why not just share weights between the encoder and the decoder ? And this idea performs well amazingly, and becomes a popular method.

Siddhant7 commented 4 years ago

The connection between encoder and the decoder is implicit, the decoder may have no idea about the feature generated by encoder.

Why do you say the connection is implicit? The decoder network's first state is the final state of the encoder. In fact if we share weights, then by intuition there is even a weaker connection between encoder and decoder, as decoder may not know at what time step the encoded input is ready (as all time steps are merged into one network). Or am I missing something?