ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.54k stars 831 forks source link

GAIL uses AIRL reward function #236

Open HareshKarnan opened 4 years ago

HareshKarnan commented 4 years ago

I noticed that the predict reward function uses log(D(.)) - log(1-D(.)) as the reward to update the generator. However, this is the reward function proposed in the AIRL paper which minimizes the reverse KL divergence instead of JS divergence as in GAIL. is it common for implementations to swap out the GAIL loss with AIRL loss ?

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/84a7582477fb0d5c82ad6d850fe476829dddd2e1/a2c_ppo_acktr/algo/gail.py#L103

ruleGreen commented 3 years ago

I am also confused, what should I do if I just wanna GAIL loss? just reward = - (1 - s).log()

HareshKarnan commented 3 years ago

image

if we look at the algorithm section for GAIL, the proposed loss function is log(D(.)) so just use that. For stability reasons, add 1e-8 inside the log term, like : log(D(.) + 1e-8) to ensure you dont get a huge negative reward when output of the discriminator is zero.

You can also try -log(1-D(.) + 1e-8) [the alternative GAN loss]