Closed zhaopansong closed 3 years ago
Hello,
for total_episode_reward_logger
, there are already some issues opened: https://github.com/hill-a/stable-baselines/issues/236
For SB3, you are saying that the advantage computation is wrong?
Probably related: https://github.com/DLR-RM/stable-baselines3/issues/105
Hello,
for
total_episode_reward_logger
, there are already some issues opened: #236For SB3, you are saying that the advantage computation is wrong?
Probably related: DLR-RM/stable-baselines3#105
My question is not same as #236. he would like to an average reward. My point is, even if the first episode is completed, since the final done is not stored in array dones[], the logger function never see that the first episode is done. This situation would repeat if the envrionment turn-based with num_round equal to num_step exactly (in m case). So if you addreess this by chaing the way of storing done[], the value calculation code should be changed accordingly.
My question is not same as #236. he would like to an average reward.
but then, what do you want? It is still unclear to me...
What I wanted to say is that total_episode_reward_logger
is buggy (and it's known), please use a wrapper, like a Monitor
wrapper for proper logging.
My question is not same as #236. he would like to an average reward.
but then, what do you want? It is still unclear to me...
What I wanted to say is that
total_episode_reward_logger
is buggy (and its known), please use a wrapper, like aMonitor
wrapper for proper logging.
OK, I will switch to SB3,thank you
Take example of code in PPO2. Suppose the we experience N steps and get done =True from envirnment at step N or n_step=N-1. In the code handling reversel calculating advantage: at subscrip t, i.e, step t, env return done t=False which is store in done[t+1] you use 1-done[t+1] to multiply value[t+1]
All above is fine. However,since the final done at subscript N-1 is not store in the array done[],
mb_dones.append(self.dones) self.obs[:], rewards, self.dones, infos = self.env.step(clipped_actions)
(PPO2 Line 477,482) (the first done stored is default, the last done stored is not the final done) in the total_episode_reward_logger function, you use done[] to judge any True existed to see whether a episode is completed and that is a problem causing at least the first episode reward is not logged. This problem is caused by the way of storing done. It seems in Stable-Baseline3, you use a same way of storing dones. If my understanding is right, you can adjust SB3 accordingly.