pochih / RL-Chatbot

🤖 Deep Reinforcement Learning Chatbot
MIT License
419 stars 140 forks source link

RL training #14

Open estelleaf opened 6 years ago

estelleaf commented 6 years ago

Hi, Thanks for having shared your implementation of the RL chatbot. I might ask stupid questions since I am not an expert in RL neither in NLP so sorry in advance! 1- In python/RL/train.py l307, saver.restore(sess, os.path.join(model_path, model_name)) seems to intialize the weight of the model with some pretrained params, correct? Is it the ones given by the Seq2seq trained as usual in a supervised way? I dont find anywhere the 'model-55' you are using for this... Am I missing something?

2- In python/RL/rl_mpdel.py Why do we have build_model and build_generator, it seems to have the same setup but not the same output. Is it RL specific?

3- In the paper Also, in the paper they specified that for the reward they use a seq2seq2 model and not the RL model. Is this taken into consideration in your code?

Thanks a lot for your answers!

pochih commented 6 years ago
  1. If the checkpoint exists, saver can restore the trained parameters

  2. build_model will construct the graph for training, build_generator will construct the graph for inferring. The most of the parts of two graphs is same. Separate two graphs can make the development easier.

  3. In the paper, they first train the model with seq2seq until convergence, then use policy gradient to train the model. The graph of seq2seq and RL is similar, but the reward function is used for the later.

estelleaf commented 6 years ago

Thanks a lot for your answers but I still dont get the 3. 1 - Which are the weights that are used for the reward ? The ones that are used in Seq2Seq after convergence or the ones of the policy that are being updated ? 2 - When I test the RL method, I dont have the same results as you show in the README when using model-56-3000. Is it normal? 3 - Here you have a file with sentences. You dont have an incoming flow of data, does it act as a replay memory?