Why add entropy to loss when its gradient is zero?

ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

MIT License

3.56k stars 831 forks source link

Why add entropy to loss when its gradient is zero? #194

Closed Akella17 closed 5 years ago

Akella17 commented 5 years ago

As the differential entropy of a Gaussian distribution is dependent only on the standard deviation, how does adding it to the actor loss function promote exploration? Isn't the gradient of the differential entropy of a Gaussian distribution w.r.t policy parameters 0?

h(X) = ln[σ(2π)^1/2] + 0.5 (σ is a constant vector of ones)

ikostrikov commented 5 years ago

In this implementation, a Gaussian distribution has a learnable standard deviation parameter: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/distributions.py#L79