PPO Envpool doesn't account for episode change

pseudo-rnd-thoughts commented 1 month ago

Problem Description

For PPO rollouts, it is critical to account for episode changes that occur when an environment autoresets. For Gym (and Gymnasium < 1.0), the vector environment autoreset within the step and included in the actual final observation within the info with the observation returned for done==True being the next episode's observation. This means that the cleanrl gym based rollout's correctly account for this, as the next episode's observation is nullified by next_done. This is discussed in https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/

However, this strategy doesn't work for EnvPool (and most critically gymnasium == 1.0 alphas). These vector environments shift the next episode reset for the step after done==True. Meaning that on the current implementation within the rollout observations, we have obs=[t_0, t_1, t_2, t_0] and done=[False, False, True, False] for an episode of three observations and one observation of the next.

In my head (I still need to confirm working example), we need to ignore the loss for the done+1 if this makes sense, i.e., the obs/next obs for after an episode ends. I've been scrolling through the Gym and EnvPool and don't see any code that already does this (I might be wrong, apologies if I am).

TL;DR

Current Behaviour: EnvPool PPO computes the loss between the last and first observations of new episodes (critical as Gymnasium 1.0 is shifting to an EnvPool style reset). Expected Behavior: The loss between the last and first observation should be zeroed / masked out

Possible Solution

Include a mask on the loss / advantage for done + 1 to ignore these values.

Steps to Reproduce

To do, largely theoretically thinking on my side. I suspect this hasn't been realised as EnvPool is relatively niche and episodes are long enough that it is difficult to spot

sdpkjc commented 1 month ago

I encountered this problem while using the EnvPool formal interface in my self-driving project. I decided to mask the done+1 samples, implementing the masking as the samples are extracted from the replay buffer to ensure the continuity of the environment iteration remains unaffected.

EdanToledo commented 2 weeks ago

Hey, i thought I'd also chime in here. I realised this difference and i simply made a wrapper to achieve the same auto-reset style as Gym API. My wrapper is here if that helps.

vwxyzjn / cleanrl