nikhilbarhate99 / PPO-PyTorch

Minimal implementation of clipped objective Proximal Policy Optimization (PPO) in PyTorch
MIT License
1.67k stars 343 forks source link

Learning from scratch without using pre-trained model #15

Closed EnnaSachdeva closed 4 years ago

EnnaSachdeva commented 4 years ago

I tried running test.py (PPO.py) from scratch on LunarLander-v2 Environment, without using the pre-trained model, but it does not seem to learn till 15000episodes. The episodic returns are negative even after 15000 episodes. How many episodes did it take to get the trained model?

nikhilbarhate99 commented 4 years ago

Hey, have you tried training it multiple times? or did you change the hyper-parameters? I have been able to train it within 1500 episodes on average (although it gets stuck in a local maxima sometimes) with the current hyper-parameters. Also, I have added 2 commits to address some issues mentioned in #10 and #8 , and have not tested the algorithm after. Can you please try with the earlier version and let me know?

EnnaSachdeva commented 4 years ago

I am running the master branch test.py and PPO.py (I hope all the recent changes are pushed in these), and I ran the code as it is, Just commented on the "load_state_dict" line in the code, with no changes in hyperparameters. These are some of the rewards I am getting.

Episode: 14994 Reward: -51 Episode: 14995 Reward: -188 Episode: 14996 Reward: -214 Episode: 14997 Reward: -403 Episode: 14998 Reward: -169 Episode: 14999 Reward: -64 Episode: 15000 Reward: -252

Also, I am using this version of code with a small grid world environment, and it does not seem to learn at all there as well.

nikhilbarhate99 commented 4 years ago

Ahh...I see, The test.py file is NOT for training, it is a utility file to load and run pre trained policies. Please run the PPO.pyfile for training.

Also, I ran some tests now on the Lunar Lander env and it seems to train just fine.

EnnaSachdeva commented 4 years ago

Ohh, my bad. I was using only PPO.py for my custom environment (with obvious hyperparameter changes), and it does not seem to work. Anyway, Thanks!