uidilr / gail_ppo_tf

Tensorflow implementation of Generative Adversarial Imitation Learning(GAIL) with discrete action
MIT License
112 stars 29 forks source link

roll out of the policy and collecting data, especially reward #20

Closed mehdimashayekhi closed 5 years ago

mehdimashayekhi commented 6 years ago

Hi, thanks for sharing this. Quick question, here where you are collecting rewards https://github.com/uidilr/gail_ppo_tf/blob/1dc3c3400d5b6329b49f3f18bdc89f3e475a023a/run_gail.py#L59 I guess, this is not right, your actions and reward are not related, you are relating old reward to new actions, I think you need to move appending reward after the step https://github.com/uidilr/gail_ppo_tf/blob/1dc3c3400d5b6329b49f3f18bdc89f3e475a023a/run_gail.py#L62, then you can append reward here. For run_gail.py it is okay doing this, because we are not using reward to update (rather we use discriminator reward), but for run_ppoI guess it is not correct, especially when we initially train the expert

uidilr commented 5 years ago

Hello @mehdimashayekhi, sorry for the late reply and thank you for pointing it out!

I just solved this issue and confirmed that it learns appropriately.