uidilr / gail_ppo_tf

Tensorflow implementation of Generative Adversarial Imitation Learning(GAIL) with discrete action
MIT License
111 stars 29 forks source link

some issue about the set of 'stochastic' #1

Closed LinBornRain closed 6 years ago

LinBornRain commented 6 years ago

Hi Yusuke San: I really admire your coding skills. I have reviewed another GAIL which is written by TRPO. After reading your GAIL codes, I find a common set about parameter 'stochastic'. In run_ppo or run_gail, you use 'stochastic = True' in 'Policy.act( )'. But you use 'stochastic = False' in 'Policy.act( )' in test_policy. So why you use STOCHASTIC policy when training but DETERMINISTIC policy when testing? Thanks!

uidilr commented 6 years ago

Hi LinBornRain san: Thank you for asking! As you said, it is common to use stochastic policy in training and test of models. Stochastic policy is preferred in the partially observable setting(POMDP), where optimal deterministic policy does not exist.
When environment is MDP, there is always optimal deterministic policy. The reason why I used deterministic policy is that I was interested if learned policy via GAIL and PPO is usable as an optimal deterministic policy in the CartPole-v0(MDP). If you want to test policy for POMDP, you should test with stochastic policy. Using stochastic policy in training is for learning stability, by the way.

EDIT: I added args --stochastic and set default True in test_policy.py for consistency. Thanks for suggestion!

LinBornRain commented 6 years ago

Thank you for your detailed reply and you are welcome! I am really inspired with your opinion about stochastic policy for POMDP. I haven't try stochastic policy in testing in your codes yet. But when I try to recover the results of the GAIL TRPO version: https://github.com/andrewliao11/gail-tf. It's totally strange when I using stochastic in evaluating, the results that the consistency leads just turned out to be wired. So I still confused with the deterministic policy in evaluating. Guess I should to check it one more time...

uidilr commented 6 years ago

I have just tested stochastic gail policy 3 times in my code, and evaluated episodes to beat env were 227 episodes, 99 episodes (optimal), and 99 episodes (optimal). When I use deterministic policy, I believe that obtained results were 99 episodes all the time. From this experience, I think it is easier to evaluate policy with deterministic mode. In the sense of required experiment times to have reliable evaluation. Hope it helps!

LinBornRain commented 6 years ago

Really appreciate your efforts. Actually I get the same opinion that it seems can get better and reasonable performance to evaluate stochastic policy with deterministic mode in recovering GAIL-TRPO. But it is just not consistency as you said. By the way, I have tried to use deterministic policy in training GAIL in mujoco Hopper-v1 env, it turns out to be bad in GAIL-TRPO. Looking forward to see your next work!