Closed JulianoLagana closed 5 years ago
This is still an issue. Any idea if this will be addressed?
The baselines TimeLimit wrapper seems to solve this: https://github.com/openai/baselines/blob/d80acbb4d1f8b3359c143543dd3f21a3b12679c8/baselines/common/retro_wrappers.py#L6
Would that work for you? We may want to consider putting that in gym.
Their wrapper has exactly the same problem, since it overwrites the done
flag when a timeout occurs.
By the way, I already solved it for my project by overwriting the gym's time_limit
wrapper and adding some custom code. I created the issue because, based on my current understanding, the current implementation makes all time-limited environments unnecessarily more difficult to train. This is not a matter of adding functionality. It's about implementing time limits in the "correct" way (as in, time limits that don't transform the MDP into a POMDP, for instance).
Well if you can recover (env_done
, timeout_done
) from the info dict (which you can almost do with the one I linked), then the information would not be lost, right? If you include the current step count along with the observation, that seems like it would make the MDP gym environments still MDPs. Though for instance, the Atari environments are POMDPs.
You are right, @christopherhesse. I overlooked the fact that the info
dictionary has the relevant information for deciding if the done
flag was overwritten.
You are also correct in saying that by adding the current step count to the observation retains observability in the MDP. However, the article I mentioned explains that this will actually yield a policy which is time-dependent (as in e.g., use the knowledge of whether or not it is close to the end of an episode to change its behavior). There are cases where this is indeed what we want (if the original environment is actually supposed to end after a certain amount of time-steps). However, in most scenarios we're adding the TimeLimit
wrapper because the original environment does not terminate after a certain amount of time-steps, but we still want the episodes to not take too long, such that the training data has more diversity.
To illustrate what I wrote above, imagine a continuing environment where the agent is supposed to learn how to move horizontally as fast as possible (something like Hopper, for instance). The optimal policy we're looking for is something that can be applied for an infinite amount of steps, and would give us the highest expected return. Probably something that learns how to move efficiently. If we add a time-limit to this environment (maybe to speed up training), at say, 500 time-steps, and add time as part of the observation, the agent could probably learn to make a big leap forward when the episode is close to an end, to maximize its horizontal displacement in exchange for probably falling. This is because the consequence of falling would never be accounted by the agent, since after falling (or maybe right before) the episode would end. This would definitely not be the policy we're looking for.
It would be much better in this particular case to do what they prescribe in Time Limits in Reinforcement Learning for these cases, which is bootstrapping from all transitions that ended in a terminal state due to a timeout (and not bootstrapping from all other terminal states, as usual). They showed that this actually yields the time-invariant policy we're looking for in this type of environments.
Sounds good, but I guess my question is, what needs to change in the TimeLimit wrapper to make it possible to do that? I think the Gym TimeLimit wrapper should likely support this case, but bootstrapping from particular transitions sounds like it's up to the algorithm.
One idea is: when a time-out occurs, instead of overwriting the done
flag, add another key to the info
dictionary that is set to True
.
Additionally, I agree with you that deciding whether or not to bootstrap from a particular transition is a decision made by the algorithm, independent of the environment. However, this decision will depend on whether or not a terminal state was reached or if a time-out occurred. At the moment the environment conflates both cases, making it impossible for the algorithm to decide according to what I described previously.
@JulianoLagana Maybe it's a bit off topic here, I was quite curious about the effects on Time Limit in policy optimization.
I've tested VPG (+GAE) on HalfCheetah-v2
and Hopper-v2
, where the bootstrapping continues with the episode cut by Time Limit, i.e. do not set V(s_T) to zero if done=True at T is a result from Time Limit.
But it seems this does not give better final performance. I am wondering if this is expected.
@zuoxingdong I suggest you read the Time-limits paper I mentioned. It's not a hard read at all, and skimming it would probably give you a lot of insight already. In short, correctly handling the time limits won't necessarily give you a performance boost (it may even do the opposite, like I mentioned in the Hopper example previously). Sometimes it does facilitate learning, since learning an MDP is easier than a POMDP in general, which can result in better performance. But that's not the point of it. It's more about making sure that the policy you learn does not exploit artificial time-limits that you created only to increase data variety (or making sure that it does exploit time-limits that are not artificial, but are actually important parts of the task).
@JulianoLagana @zuoxingdong has a PR here: https://github.com/openai/gym/pull/1402 any objections?
@christopherhesse I'm glad this is being addressed! I've added my thoughts to the PR.
Just wondering, why make it so complicated? A quick fix is just never reaches the time limit during training. Honestly, 1000 time steps for a typical Mujoco environment is too long for training purpose.
@ZhaomingXie People usually cap the allowed number of time-steps as an easy way to increase data diversity.
The new timelimit wrapper from https://github.com/openai/gym/pull/1402 should let you recover the underlying done
of the environment now.
I recently read the paper Time Limits in Reinforcement Learning, where the authors discuss what are the correct ways of dealing with time limits in reinforcement learning. Unfortunately, it seems that
gym
is not adhering to these recommendations.As it currently stands, the
time_limit
wrapper overwrites thedone
flag returned by the environment toTrue
if a timeout ocurred. This makes it impossible for an agent to know whether an episode is being terminated because of a timeout or because the agent reached a terminal state, thus conflating both scenarios.The problem with this is that now environments which are time limited are not fully observable anymore, hence not MDPs, since their termination function depends on something (time in this case), which is not observable for the agent.
In the paper the authors discuss the consequences of this to learning algorithms, and also show two ways to fix this issue. The first one is to simply add time as part of the state of the agent, if you really want the optimal policy of the MDP to be time-aware. I don't think this is in the spirit of most environments, and it increases the dimensionality of the state, which creates its own problems.
The second one works for value-based RL algorithms, like DDQN. In this, whenever the agent uses a transition to compute the gradients in which the resulting state (
s_prime
) is terminal, the agent computes its target as only the immediate reward, not bootstrapping from the value of the next state (so as to ground the value of the next state, which is terminal, to zero). However, if the agent uses a transition which resulted in a timeout, it should bootstrap from the value of the next state. This is why it seems necessary to either separate timeout flags from terminal flags, or leave the responsibility of restarting the episode for the user of the environment.