ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.53k stars 832 forks source link

No softmax before categorical loss? #282

Open nirweingarten opened 3 years ago

nirweingarten commented 3 years ago

Hi Thanks so much for sharing this, what a great repo. I've noticed that the final actor layer is not really activated, rather a distribution object (say categorical) is used. Later the log probabilities are taken to compute the actor's loss. Don't we lose the desired mesh that the softmax function gives us in this case? IE we encourage good actions and discourage bad actions less then if we'd used softmax, right? Just wanted to ask is this on propose or did I misunderstand the code?

Thanks!