hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

[Question] total_episode_reward_logger is wrongly handled due to the way of storing dones #1049

Closed zhaopansong closed 3 years ago

zhaopansong commented 3 years ago

Take example of code in PPO2. Suppose the we experience N steps and get done =True from envirnment at step N or n_step=N-1. In the code handling reversel calculating advantage: at subscrip t, i.e, step t, env return done t=False which is store in done[t+1] you use 1-done[t+1] to multiply value[t+1]

All above is fine. However,since the final done at subscript N-1 is not store in the array done[], mb_dones.append(self.dones) self.obs[:], rewards, self.dones, infos = self.env.step(clipped_actions) (PPO2 Line 477,482) (the first done stored is default, the last done stored is not the final done) in the total_episode_reward_logger function, you use done[] to judge any True existed to see whether a episode is completed and that is a problem causing at least the first episode reward is not logged. This problem is caused by the way of storing done. It seems in Stable-Baseline3, you use a same way of storing dones. If my understanding is right, you can adjust SB3 accordingly.

araffin commented 3 years ago

Hello,

for total_episode_reward_logger, there are already some issues opened: https://github.com/hill-a/stable-baselines/issues/236

For SB3, you are saying that the advantage computation is wrong?

Probably related: https://github.com/DLR-RM/stable-baselines3/issues/105

zhaopansong commented 3 years ago

Hello,

for total_episode_reward_logger, there are already some issues opened: #236

For SB3, you are saying that the advantage computation is wrong?

Probably related: DLR-RM/stable-baselines3#105

My question is not same as #236. he would like to an average reward. My point is, even if the first episode is completed, since the final done is not stored in array dones[], the logger function never see that the first episode is done. This situation would repeat if the envrionment turn-based with num_round equal to num_step exactly (in m case). So if you addreess this by chaing the way of storing done[], the value calculation code should be changed accordingly.

araffin commented 3 years ago

My question is not same as #236. he would like to an average reward.

but then, what do you want? It is still unclear to me...

What I wanted to say is that total_episode_reward_logger is buggy (and it's known), please use a wrapper, like a Monitor wrapper for proper logging.

zhaopansong commented 3 years ago

My question is not same as #236. he would like to an average reward.

but then, what do you want? It is still unclear to me...

What I wanted to say is that total_episode_reward_logger is buggy (and its known), please use a wrapper, like a Monitor wrapper for proper logging.

OK, I will switch to SB3,thank you