Open Llermy opened 3 years ago
Solution: looking into the info
variable and checking the existence of info[...]['episode']
-- if it exists, then the entire game has ended and we can output info[...]['episode']['r']
as the reward.
======
I also had the same issue.
I dug into the running process and found that the step-wise rew
value returned at the --play mode is always 0 or 1, regardless the actual reward.
To make things clear, one has the following code in run.py
:
obs, rew, done, _ = env.step(actions)
When I test the SpaceInvaders
game, when an invader is killed, the reward should be 5, 10, ..., 30, but the returned rew
value is always 1.0. The same thing happens when I test the Breakout
game, where a upper-level block should be rewarded 4, but the returned rew
value is always 1.
Also, the done
variable is set to be True every time when the agent is killed, but sometimes it has more lives (so the game has not ended). That is to say, the showed reward value at the --play mode is actually the reward you get using every life, but not the entire game.
I managed to solve the second bug by using the code:
obs, rew, done, info = env.step(actions)
,
and checking the info[...]['ale.lives']
value to see if it is REALLY done. However, I have not found the way to solve the first bug yet. Still working.
======
I finally got the key.
The fact is, when the entire game ends, the system will return the result using the info
variable. It is something like:
info = [{'ale.lives': 0, 'episode': {'r': 65.0, 'l': 273, 't': 3.922398}}]
,
where the variable r
contains the REAL reward, i.e. the score shown on the screen.
So, continuing the previous post, my solution is to check the info[...]['ale.lives']
value -- if it is 0, I check the info[...]['episode']['r']
value, which should be what we want (aligning with eprewmean
variable in the training log, and same as the results reported in the RL papers).
So, the final answer is here: directly checking the existence of info[...]['episode']
-- if it exists, then the entire game has ended and we can output info[...]['episode']['r']
as the reward.
BTW, I just made it work on SpaceInvaders
and Breakout
, but I am not sure if the solution works for all games.
Thanks @198808xc, I've tested your answer on some mujoco environments and it seems it is also like that. When training, the reported reward mean is calculated from info[...]['episode']['r'],
while when playing afterwards the reported reward is extracted from the reward variable of the most outer env.step(actions)
.
Mujoco environments are wrapped by a Monitor
wrapper and then by an outer VecEnv
wrapper. Digging the code, it seems the info[...]['episode']['r']
is populated in the Monitor
wrapper, which directly takes the reward of the mujoco environment and rounds it only. But then the VecEnv
wrapper applies the following transformation to the reward:
rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)
,
which seems to be some kind of normalization plus clipping. And this is the reward that comes out of the outer env.step(actions)
and is reported in the --play
part, so that is why is different.
I'm training models on Mujoco environments with the PPO2 algorithm on the tf2 branch of the project. During training, reward is slowly getting higher as expected. What is not expected is, when training has finished (or loading a previously trained model) the model seems not to perform well and reward is much lower than reported during training phase.
As an example, I trained a model for 2e5 in the HalfCheetah-v2 environment and during training reward increased to about between 200 and 300. However, just after training finished and this model was being run, the reported reward was about just a little more than 10. In this picture you can see the results:
This results were obtained by running:
python -m baselines.run --alg=ppo2 --env=HalfCheetah-v2 --network=mlp --num_timesteps=2e5 --log_path=logs\cheetah --play
Training different Mujoco environments or more steps doesn't change this issue. Or is it I am doing something wrong?