openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.77k stars 4.88k forks source link

Reward is much lower when using "--play" #1171

Open Llermy opened 3 years ago

Llermy commented 3 years ago

I'm training models on Mujoco environments with the PPO2 algorithm on the tf2 branch of the project. During training, reward is slowly getting higher as expected. What is not expected is, when training has finished (or loading a previously trained model) the model seems not to perform well and reward is much lower than reported during training phase.

As an example, I trained a model for 2e5 in the HalfCheetah-v2 environment and during training reward increased to about between 200 and 300. However, just after training finished and this model was being run, the reported reward was about just a little more than 10. In this picture you can see the results:

屏幕截图 2021-09-17 152224

This results were obtained by running: python -m baselines.run --alg=ppo2 --env=HalfCheetah-v2 --network=mlp --num_timesteps=2e5 --log_path=logs\cheetah --play

Training different Mujoco environments or more steps doesn't change this issue. Or is it I am doing something wrong?

198808xc commented 3 years ago

Solution: looking into the info variable and checking the existence of info[...]['episode'] -- if it exists, then the entire game has ended and we can output info[...]['episode']['r'] as the reward.

======

I also had the same issue.

I dug into the running process and found that the step-wise rew value returned at the --play mode is always 0 or 1, regardless the actual reward.

To make things clear, one has the following code in run.py: obs, rew, done, _ = env.step(actions)

When I test the SpaceInvaders game, when an invader is killed, the reward should be 5, 10, ..., 30, but the returned rew value is always 1.0. The same thing happens when I test the Breakout game, where a upper-level block should be rewarded 4, but the returned rew value is always 1.

Also, the done variable is set to be True every time when the agent is killed, but sometimes it has more lives (so the game has not ended). That is to say, the showed reward value at the --play mode is actually the reward you get using every life, but not the entire game.

I managed to solve the second bug by using the code: obs, rew, done, info = env.step(actions), and checking the info[...]['ale.lives'] value to see if it is REALLY done. However, I have not found the way to solve the first bug yet. Still working.

======

I finally got the key.

The fact is, when the entire game ends, the system will return the result using the info variable. It is something like: info = [{'ale.lives': 0, 'episode': {'r': 65.0, 'l': 273, 't': 3.922398}}], where the variable r contains the REAL reward, i.e. the score shown on the screen.

So, continuing the previous post, my solution is to check the info[...]['ale.lives'] value -- if it is 0, I check the info[...]['episode']['r'] value, which should be what we want (aligning with eprewmean variable in the training log, and same as the results reported in the RL papers).

So, the final answer is here: directly checking the existence of info[...]['episode'] -- if it exists, then the entire game has ended and we can output info[...]['episode']['r'] as the reward.

BTW, I just made it work on SpaceInvaders and Breakout, but I am not sure if the solution works for all games.

Llermy commented 3 years ago

Thanks @198808xc, I've tested your answer on some mujoco environments and it seems it is also like that. When training, the reported reward mean is calculated from info[...]['episode']['r'], while when playing afterwards the reported reward is extracted from the reward variable of the most outer env.step(actions).

Mujoco environments are wrapped by a Monitor wrapper and then by an outer VecEnv wrapper. Digging the code, it seems the info[...]['episode']['r'] is populated in the Monitor wrapper, which directly takes the reward of the mujoco environment and rounds it only. But then the VecEnv wrapper applies the following transformation to the reward:

rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew),

which seems to be some kind of normalization plus clipping. And this is the reward that comes out of the outer env.step(actions) and is reported in the --play part, so that is why is different.