ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.56k stars 831 forks source link

Does the ACKTR implementation use the empirical fisher? #222

Closed joshromoff closed 4 years ago

joshromoff commented 4 years ago

Does the current implementation use the empirical Fisher?

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/a6afcb3d24f57cdafdbddc305e5213f8444c8678/a2c_ppo_acktr/algo/a2c_acktr.py#L54

This is probably not the best idea - see this paper:

https://papers.nips.cc/paper/8669-limitations-of-the-empirical-fisher-approximation-for-natural-gradient-descent

Also, the implementation in the ACKTR paper uses the actual Fisher:

https://github.com/openai/baselines/blob/9ee399f5b20cd70ac0a871927a6cf043b478193f/baselines/acktr/acktr.py#L47

ikostrikov commented 4 years ago

I thought that OpenAI use imperial Fisher as well, due to sampling here: https://github.com/openai/baselines/blob/9ee399f5b20cd70ac0a871927a6cf043b478193f/baselines/acktr/acktr.py#L49

joshromoff commented 4 years ago

its a poor naming convention. the sampling does not imply that they're using the empirical fisher.

ikostrikov commented 4 years ago

I took a closer look. Both implementations use the actual Fisher. Although the actions are stored in some replay buffer, for ACKTR they correspond to samples from the current policy.

joshromoff commented 4 years ago

I think you're right. might be best to remove / change the comment here:

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/a6afcb3d24f57cdafdbddc305e5213f8444c8678/a2c_ppo_acktr/algo/a2c_acktr.py#L54

ikostrikov commented 4 years ago

Done in https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/commit/62e8d5896db155839056deb0fe60e0c05db0bf16