Open rajfly opened 1 month ago
@rajfly Are you able to use examples.atari.atari_ppo.py
passing the arguments from your custom configuration or do you need to build your own custom ppo trainer on atari?
@rajfly Are you able to use
examples.atari.atari_ppo.py
passing the arguments from your custom configuration or do you need to build your own custom ppo trainer on atari?
@dantp-ai Hi, thanks for your prompt reply. In fact, I used examples.atari.atari_ppo.py
as a base and modified from there to fit the original PPO implementation. For example, in the original PPO implementation by Baselines, the two output heads in the PPO architecture were initialized similarly to the convolutional layers (but with an orthogonal std of 0.01 instead of np.sqrt(2)
). This is also something done by StableBaselines3 and CleanRL by default, and as far as I could tell, it was not done in Tianshou. Thus, I had to add this custom functionality, as seen in the actor_init()
and critic_init()
functions in ppo_atari.py
above. So you might see some similarities when looking through the code above with examples.atari.atari_ppo.py
.
TLDR: No, I was not able to just use examples.atari.atari_ppo.py
by passing in arguments since they were insufficient and instead, I had to modify examples.atari.atari_ppo.py
to better fit the original PPO implementation.
Thanks! I will look into it and see how I can help.
@dantp-ai Thanks for your help! Also perhaps this might help narrow down the issue: I tested on 56 Atari games and Tianshou failed to learn anything at all on the majority of games and was only able to do well for very simple games (approximately 5 out of the 56 games). For example, the Boxing game shown below with Tianshou in green. So the agent can learn and is able to compete with the implementations by Baselines, Stable Baselines3, and CleanRL, but only in very simple environments, which is weird.
Thanks for reporting! It is of highest priority to us to keep good performance of algorithms and examples (otherwise, what's the point ^^).
Seems like the small performance tests that run in CI were not enough to catch this. I have been training PPO agents on mujoco in the current tianshou version with no issues, so maybe it's only for discrete envs.
We will look into it asap. The first thing to clarify is whether the problem is caused by the recent refactorings, thus going back to version 0.5.1 and running on atari there.
Btw, before the 2.0.0 release of tianshou we will implement #935 and #1110, as well as check in a script that reproduces the results that are currently displayed in the docs. From there on, all releases will guarantee to have no performance issues. At the moment we're not there yet
@rajfly Were you able to verify that the reward scalings and reward outputs are consistent across the experiments using the different RL libraries (OpenAI Baselines, StableBaselines3, CleanRL) ?
@dantp-ai Yes. I followed the same Atari wrappers for all of the experiments with the Atari games. Thus, the rewards were only clipped and were clipped for all RL libraries. Furthermore, when comparing the reward outputs, I used statistical techniques such as stratified bootstrap confidence intervals (SBCI) to combat the stochasticity for more accurate estimates. In particular, for each RL library tested, I ran 5 trials for each of the 56 Atari environments. This evaluates to a total of 56 x 5 trials per RL library. Subsequently, I took the mean reward from the last 100 training episodes as the score for a single trial and human-normalized it. The plot below shows the comparison of the human-normalized score attained from different RL libraries across 56 environments, using SBCI. The bands are 95% confidence interval bands and it can be seen that Baselines, Stable Baselines3, and CleanRL are consistent with their scores for most metrics (IQM refers to interquartile mean).
I can’t seem to replicate the original PPO algorithm's performance when using Tianshou's PPO implementation. The hyperparameters used are listed below. It follows the hyperparameters discussed in an ICLR Blog in aims to replicate the results from the original PPO paper (without LSTM).
Hyperparameters
I have tried these same hyperparameters with the Baselines, Stable Baselines3, and CleanRL implementations of the PPO algorithm and they all achieved the expected results. However, the Tianshou agent fails to train at all, as seen in the training curves below (Tianshou's PPO trials are shown in green). Am I missing something in my Tianshou configuration (see reproduction scripts) or is there a bug (or intentional discrepancy) in Tianshou's PPO implementation?
Tianshou training curves in green for 5 games when compared to other implementations
NOTE: The y-axis and x-axis represents mean reward and in-game frames (total of 40 million) respectively.
Other Issues Found Also, another issue found is that for some games like Atlantis, BankHeist, or YarsRevenge, training can sometimes randomly stop with the following error, though I am not entirely sure why:
Reproduction Scripts Run command:
python ppo_atari.py --gpu 0 --env Alien --trials 5
_Main Script (ppoatari.py):
Dependencies of Main Script (include these 3 scripts in same directory as the main script):
atari_network.py
atari_wrapper.py
common.py