Open rajfly opened 1 month ago
@sven1977 Hi, I was wondering if you could take a look at this? I've tried running PPO via the tuned examples given and it seems to work fine. It however, fails to work when I customize it to fit certain hyperparameters from the original PPO paper, which makes me suspect that there is either something wrong with my configuration or one of RLlib's functions used during the customization process. It might be the latter because I've already found 2 potential bugs, one with preprocessor_pref
and another with .evaluation
, also discussed in the issue above. I've posted this on discuss.ray.io here about 3 weeks ago and have yet to receive a response on it.
What happened + What you expected to happen
I can’t seem to replicate the original PPO algorithm's performance when using RLlib's PPO implementation. The hyperparameters used are listed below. It follows the hyperparameters discussed in an ICLR Blog in aims to replicate the results from the original PPO paper (without LSTM).
Hyperparameters
I have tried these same hyperparameters with the Baselines, Stable Baselines3, and CleanRL implementations of the PPO algorithm and they all achieved the expected results. For example, on the Alien and Pong environments, the agents were able to achieve more than 1000 and 20 mean rewards respectively. However, the RLlib agent fails to train at all. The trained RLlib agent achieves approx. 240 and -20 mean rewards for the Alien and Pong environments respectively, as seen in the training curves below. Am I missing something in my RLlib configuration (see reproduction script 1) or is there a bug (or intentional discrepancy) in RLlib's PPO implementation?
RLlib PPO Alien Mean Reward Training Curve
RLlib PPO Pong Mean Reward Training Curve
![image](https://github.com/ray-project/ray/assets/35727146/241c2806-a168-48ba-a3e1-0333c6471c52)
Other Issues Found On a side note, I also found 2 other issues during my experiments.
Firstly, setting
preprocessor_pref="deepmind"
in.env_runners
does not seem to work at all and thus, I had to manually configure the environment via a custom functionmake_atari
and attach the wrappers there. To test this issue, use the following code in reproduction script 2. The modifications made with respect to reproduction script 1 is that now I am not using a custom functionmake_atari
with the required wrappers and instead usingpreprocessor_pref="deepmind"
to demonstrate the issue.Secondly, while evaluating, specifying the number of episodes in
.evaluation
seems to have no effect at all. The evaluation function seems to return evaluation over a different number of episodes than what was initially specified in.evaluation
. This is why there is awhile
loop towards the end of the codes given in both reproduction scripts, in order to accumulate the correct number of episodes. You can easily observe this in theinfo.json
file automatically generated after running the scripts and looking at theinitial_eval_episodes
field.I have also posted all of the aforementioned issues encountered at discuss.ray.io here. Any help/feedback on this would be greatly appreciated. Thanks in advance!
Versions / Dependencies
Python version: 3.11 Ray version: 2.20.0 OS: Ubuntu LTS
Reproduction script
To run the python file for both scripts:
python file_name.py --env Alien --gpu 0 --trials 1
Reproduction script 1
Reproduction script 2
Issue Severity
High: It blocks me from completing my task.