I couldn't get good result for GAIL in any environments except HalfCheetah.

ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

MIT License

3.59k stars 830 forks source link

I couldn't get good result for GAIL in any environments except HalfCheetah. #204

Open slee01 opened 5 years ago

slee01 commented 5 years ago

Hi, first of all, thank you for sharing your code.

I've been trying to implement GAIL using expert demonstrations from your Google Drive. I used the hyper-parameters from gail_experts/readme and I got good result from HalfCheetah. But, I got bad result than I expected from others such as Hopper, Ant, Walker2d(I coudn't test for Reacher. I guess the expert data, which is only 240KB has some problem.) I tried again with different hyper-parameters including seed, but unfortunately still got the same result. So could you share the parameters you used for these environments I failed? It would help comparison test for my research a lot.

ikostrikov commented 5 years ago

For the moment, the easiest way to fix the problem is to change the reward function and turn normalization off: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/algo/gail.py#L98

See the comments here: https://github.com/openai/imitation/blob/99fbccf3e060b6e6c739bdf209758620fcdefd3c/policyopt/imitation.py#L146

You need to use this reward specifically:

rewards_B = -tensor.log(1.-tensor.nnet.sigmoid(scores_B))

slee01 commented 5 years ago

This was very helpful to me.

I figured out the standard deviation of reward from discriminator is much higher than that from mujoco simulators.

I also understood that the reward range should be different depending on the episode end option.

I finally got good results after modified the reward function.

But I'm not sure why the value network can be trained without reward normalization.

And I'm wondering that there is some reason why you normalize the reward from discriminator knowing the standard deviation of that reward is too high.

I think clipping is more proper than normalization for the reward function in discriminator.

Could you comment on these questions, please?

Thanks!

wang88256187 commented 4 years ago

hi, I meet similar problem, my results is always bad in the GAIL. Can you share your experiences on this problem in detail? Thank you very much!