Closed npark closed 7 years ago
Yes, the intent of normalized_rewards is to be "how much better/worse is this sequence than the average sequence".
Could you please explain how you calculate the reward and loss? Thanks.
The loss for the discriminator is a standard sigmoid cross-entropy loss on whether it can binary classify a sequence at each token as real/fake.
The reward for the generator at each token is the "real"-ness prediction (between 0 and 1) that the discriminator gave the sequence at that token. That is, we reward the generator for making the discriminator think that the sequence is real.
Hi, "normalized_rewards" value becomes sometimes negative. I think it is because "self.expected_reward" is bigger than "rewards / _backwards_cumsum(decays, self.sequence_length)". Is this okay?
normalized_rewards = \ rewards / _backwards_cumsum(decays, self.sequence_length) - self.expected_reward
Thanks.