suragnair / seqGAN

A simplified PyTorch implementation of "SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient." (Yu, Lantao, et al.)
642 stars 149 forks source link

curious about batchPGLoss #3

Closed GeneZC closed 6 years ago

GeneZC commented 6 years ago

i'm confused about that in loss += -out[j][target.data[i][j]]reward[j] # log(P(y_t|Y1:Y{t-1})) Q we want to minimize the loss but reward is something we want to maximize

suragnair commented 6 years ago

That is correct. We want to maximize log(P(y_t|Y_1:Y_{t-1})) * Q, i.e. encourage tokens that receive a high reward Q. But we minimize the loss- hence we add the negative of it to the loss.

GeneZC commented 6 years ago

however , if we want to maximize out[j][target.data[i][j]]*reward[j], since out[j][target.data[i][j]] is negative while reward is positive, the reward may be smaller and smaller, which is not what we expected.

suragnair commented 6 years ago

I am not sure I understand. The Policy Gradients loss is used to train the generator. The Qs are obtained from the discriminator and are fixed when the generator is being trained.

The discriminator has its own objective function which tries to push Q to 1 if it is from the real dataset, and to 0 if it's a fake (from the generator).

GeneZC commented 6 years ago

that is, the generator expects the reward be larger, while out[j][target.data[i][j]] is negative and reward is positive. for example, out is -0.5, reward is 0.5. since we want to maximize out[j][target.data[i][j]]*reward[j], it may go out is -0.4, reward is 0.4.

suragnair commented 6 years ago

Reward will stay fixed at 0.5, and out[j] which is log(P(y_t|Y_1:Y_{t-1})) will try to go towards 0 (P(y_t|Y_1:Y_{t-1}) -> 1).

GeneZC commented 6 years ago

i don't know why...

GeneZC commented 6 years ago

why reward stays 0.5

GeneZC commented 6 years ago

oh, reward is just a constant while training generator, right? btw, thanks for your reply!

suragnair commented 6 years ago

Yes, exactly! We train the generator and discriminator taking turns. So when we are training the generator the reward is fixed.