jiweil / Neural-Dialogue-Generation

MIT License
830 stars 214 forks source link

Reward measurement in Adversarial learning for Neural Dialogue Generation #2

Open ghost opened 7 years ago

ghost commented 7 years ago

Many thanks for your great work. I am trying to reimplement your work via tensorflow. But I am a little confused about the reward measurement. As mentioned in your paper, the policy gradient is ∇ J(\theta)=[Q+({x, y}) − b({x, y})]∇\sum{t}logp(y{t}|x, y_{1:t-1}). I have looked into your code. I just find how [Q+({x, y}) − b({x, y})] is measured in here. But as for the term ∇\sum{t}logp(y{t}|x, y_{1:t-1}), I have no idea. Could you please tell me how to measure it? And given the policy gradient value, shall I send it back to the generator as the optimization target directly? Indeed I have no experience of LUA before. Thus I may misunderstood your implementation. Thanks in advance!