Closed EnnaSachdeva closed 4 years ago
Hey, have you tried training it multiple times? or did you change the hyper-parameters? I have been able to train it within 1500 episodes on average (although it gets stuck in a local maxima sometimes) with the current hyper-parameters. Also, I have added 2 commits to address some issues mentioned in #10 and #8 , and have not tested the algorithm after. Can you please try with the earlier version and let me know?
I am running the master branch test.py and PPO.py (I hope all the recent changes are pushed in these), and I ran the code as it is, Just commented on the "load_state_dict" line in the code, with no changes in hyperparameters. These are some of the rewards I am getting.
Episode: 14994 Reward: -51 Episode: 14995 Reward: -188 Episode: 14996 Reward: -214 Episode: 14997 Reward: -403 Episode: 14998 Reward: -169 Episode: 14999 Reward: -64 Episode: 15000 Reward: -252
Also, I am using this version of code with a small grid world environment, and it does not seem to learn at all there as well.
Ahh...I see, The test.py
file is NOT for training, it is a utility file to load and run pre trained policies. Please run the PPO.py
file for training.
Also, I ran some tests now on the Lunar Lander env and it seems to train just fine.
Ohh, my bad. I was using only PPO.py for my custom environment (with obvious hyperparameter changes), and it does not seem to work. Anyway, Thanks!
I tried running test.py (PPO.py) from scratch on LunarLander-v2 Environment, without using the pre-trained model, but it does not seem to learn till 15000episodes. The episodic returns are negative even after 15000 episodes. How many episodes did it take to get the trained model?