This is a very minor change to correct the policy gradient when calculating z_normalization. I think Rewards should be normalized not only per sequence but also per item in the minibatch. So, the number of items in a minibatch will really impact the learning behaviour of the policy gradient.
This is a very minor change to correct the policy gradient when calculating z_normalization. I think Rewards should be normalized not only per sequence but also per item in the minibatch. So, the number of items in a minibatch will really impact the learning behaviour of the policy gradient.