ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.57k stars 829 forks source link

GAIL with Wasserstein distance #205

Closed slee01 closed 5 years ago

slee01 commented 5 years ago

Hi, thank you for your great repo!

I've been trying to implement InfoGAIL based on your repo and I'm wondering about your opinion on GAIL with Wasserstein distance. InfoGAIL is based on GAIL with Wasserstein distance and I checked that the reward function of a discriminator in the authors' repo was just scaled the discriminator output. However, as far as I know, it is possible that even the perfectly trained discriminator can have only positive or only negative values for agent and expert data because the loss function for WGAN considers only the gap between outputs for fake and true data. I'm not sure, but I guess this is the reason why you didn't implement the discriminator with Wasserstein distance in your repo...

Could you share your opinion on what I'm confused about? I would really appreciate any help.

Kailiangdong commented 4 years ago

Hello, I am trying to implement a GAIL with wasserstein distance. But I don't know how to do with reward from discriminator. I try much version, not works. I also see the code of InfoGAIL, but can't good understand of it. Can you explain more on "in the authors' repo was just scaled the discriminator output."?

Kailiangdong commented 4 years ago

In openai/baselines code ,the GAIL reward is "self.reward_op = -tf.log(1-tf.nn.sigmoid(generator_logits)+1e-8)" generator_logits is the output of discriminator network for generator data. I read some paper and decided to change it to "self.reward_op = generator_logits", but not works

Kailiangdong commented 4 years ago

path["rewards"] = np.ones(path["raws"].shape[0]) 1.2 + \ output_d.flatten() 0.2 + \ np.sum(np.log(output_p) * path["encodes"], axis=1)

Is this the code in InfoGAIL you mentioned? Thank you