Closed GeneZC closed 6 years ago
That is correct. We want to maximize log(P(y_t|Y_1:Y_{t-1})) * Q
, i.e. encourage tokens that receive a high reward Q. But we minimize the loss- hence we add the negative of it to the loss
.
however , if we want to maximize out[j][target.data[i][j]]*reward[j], since out[j][target.data[i][j]] is negative while reward is positive, the reward may be smaller and smaller, which is not what we expected.
I am not sure I understand. The Policy Gradients loss is used to train the generator. The Q
s are obtained from the discriminator and are fixed when the generator is being trained.
The discriminator has its own objective function which tries to push Q
to 1 if it is from the real dataset, and to 0 if it's a fake (from the generator).
that is, the generator expects the reward be larger, while out[j][target.data[i][j]] is negative and reward is positive. for example, out is -0.5, reward is 0.5. since we want to maximize out[j][target.data[i][j]]*reward[j], it may go out is -0.4, reward is 0.4.
Reward will stay fixed at 0.5, and out[j]
which is log(P(y_t|Y_1:Y_{t-1}))
will try to go towards 0 (P(y_t|Y_1:Y_{t-1})
-> 1).
i don't know why...
why reward stays 0.5
oh, reward is just a constant while training generator, right? btw, thanks for your reply!
Yes, exactly! We train the generator and discriminator taking turns. So when we are training the generator the reward is fixed.
i'm confused about that in loss += -out[j][target.data[i][j]]reward[j] # log(P(y_t|Y1:Y{t-1})) Q we want to minimize the loss but reward is something we want to maximize