PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.56k
stars
831
forks
source link
Why add entropy to loss when its gradient is zero? #194
As the differential entropy of a Gaussian distribution is dependent only on the standard deviation, how does adding it to the actor loss function promote exploration? Isn't the gradient of the differential entropy of a Gaussian distribution w.r.t policy parameters 0?
h(X) = ln[σ(2π)1/2] + 0.5
(σ is a constant vector of ones)
As the differential entropy of a Gaussian distribution is dependent only on the standard deviation, how does adding it to the actor loss function promote exploration? Isn't the gradient of the differential entropy of a Gaussian distribution w.r.t policy parameters 0?
h(X) = ln[σ(2π)1/2] + 0.5 (σ is a constant vector of ones)