openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.45k stars 8.59k forks source link

Time limits being handled incorrectly? #1230

Closed JulianoLagana closed 5 years ago

JulianoLagana commented 5 years ago

I recently read the paper Time Limits in Reinforcement Learning, where the authors discuss what are the correct ways of dealing with time limits in reinforcement learning. Unfortunately, it seems that gym is not adhering to these recommendations.

As it currently stands, the time_limit wrapper overwrites the done flag returned by the environment to True if a timeout ocurred. This makes it impossible for an agent to know whether an episode is being terminated because of a timeout or because the agent reached a terminal state, thus conflating both scenarios.

The problem with this is that now environments which are time limited are not fully observable anymore, hence not MDPs, since their termination function depends on something (time in this case), which is not observable for the agent.

In the paper the authors discuss the consequences of this to learning algorithms, and also show two ways to fix this issue. The first one is to simply add time as part of the state of the agent, if you really want the optimal policy of the MDP to be time-aware. I don't think this is in the spirit of most environments, and it increases the dimensionality of the state, which creates its own problems.

The second one works for value-based RL algorithms, like DDQN. In this, whenever the agent uses a transition to compute the gradients in which the resulting state (s_prime) is terminal, the agent computes its target as only the immediate reward, not bootstrapping from the value of the next state (so as to ground the value of the next state, which is terminal, to zero). However, if the agent uses a transition which resulted in a timeout, it should bootstrap from the value of the next state. This is why it seems necessary to either separate timeout flags from terminal flags, or leave the responsibility of restarting the episode for the user of the environment.

JulianoLagana commented 5 years ago

This is still an issue. Any idea if this will be addressed?

christopherhesse commented 5 years ago

The baselines TimeLimit wrapper seems to solve this: https://github.com/openai/baselines/blob/d80acbb4d1f8b3359c143543dd3f21a3b12679c8/baselines/common/retro_wrappers.py#L6

Would that work for you? We may want to consider putting that in gym.

JulianoLagana commented 5 years ago

Their wrapper has exactly the same problem, since it overwrites the done flag when a timeout occurs.

By the way, I already solved it for my project by overwriting the gym's time_limit wrapper and adding some custom code. I created the issue because, based on my current understanding, the current implementation makes all time-limited environments unnecessarily more difficult to train. This is not a matter of adding functionality. It's about implementing time limits in the "correct" way (as in, time limits that don't transform the MDP into a POMDP, for instance).

christopherhesse commented 5 years ago

Well if you can recover (env_done, timeout_done) from the info dict (which you can almost do with the one I linked), then the information would not be lost, right? If you include the current step count along with the observation, that seems like it would make the MDP gym environments still MDPs. Though for instance, the Atari environments are POMDPs.

JulianoLagana commented 5 years ago

You are right, @christopherhesse. I overlooked the fact that the info dictionary has the relevant information for deciding if the done flag was overwritten.

You are also correct in saying that by adding the current step count to the observation retains observability in the MDP. However, the article I mentioned explains that this will actually yield a policy which is time-dependent (as in e.g., use the knowledge of whether or not it is close to the end of an episode to change its behavior). There are cases where this is indeed what we want (if the original environment is actually supposed to end after a certain amount of time-steps). However, in most scenarios we're adding the TimeLimit wrapper because the original environment does not terminate after a certain amount of time-steps, but we still want the episodes to not take too long, such that the training data has more diversity.

To illustrate what I wrote above, imagine a continuing environment where the agent is supposed to learn how to move horizontally as fast as possible (something like Hopper, for instance). The optimal policy we're looking for is something that can be applied for an infinite amount of steps, and would give us the highest expected return. Probably something that learns how to move efficiently. If we add a time-limit to this environment (maybe to speed up training), at say, 500 time-steps, and add time as part of the observation, the agent could probably learn to make a big leap forward when the episode is close to an end, to maximize its horizontal displacement in exchange for probably falling. This is because the consequence of falling would never be accounted by the agent, since after falling (or maybe right before) the episode would end. This would definitely not be the policy we're looking for.

It would be much better in this particular case to do what they prescribe in Time Limits in Reinforcement Learning for these cases, which is bootstrapping from all transitions that ended in a terminal state due to a timeout (and not bootstrapping from all other terminal states, as usual). They showed that this actually yields the time-invariant policy we're looking for in this type of environments.

christopherhesse commented 5 years ago

Sounds good, but I guess my question is, what needs to change in the TimeLimit wrapper to make it possible to do that? I think the Gym TimeLimit wrapper should likely support this case, but bootstrapping from particular transitions sounds like it's up to the algorithm.

JulianoLagana commented 5 years ago

One idea is: when a time-out occurs, instead of overwriting the done flag, add another key to the info dictionary that is set to True.

Additionally, I agree with you that deciding whether or not to bootstrap from a particular transition is a decision made by the algorithm, independent of the environment. However, this decision will depend on whether or not a terminal state was reached or if a time-out occurred. At the moment the environment conflates both cases, making it impossible for the algorithm to decide according to what I described previously.

zuoxingdong commented 5 years ago

@JulianoLagana Maybe it's a bit off topic here, I was quite curious about the effects on Time Limit in policy optimization.

I've tested VPG (+GAE) on HalfCheetah-v2 and Hopper-v2, where the bootstrapping continues with the episode cut by Time Limit, i.e. do not set V(s_T) to zero if done=True at T is a result from Time Limit.

But it seems this does not give better final performance. I am wondering if this is expected.

JulianoLagana commented 5 years ago

@zuoxingdong I suggest you read the Time-limits paper I mentioned. It's not a hard read at all, and skimming it would probably give you a lot of insight already. In short, correctly handling the time limits won't necessarily give you a performance boost (it may even do the opposite, like I mentioned in the Hopper example previously). Sometimes it does facilitate learning, since learning an MDP is easier than a POMDP in general, which can result in better performance. But that's not the point of it. It's more about making sure that the policy you learn does not exploit artificial time-limits that you created only to increase data variety (or making sure that it does exploit time-limits that are not artificial, but are actually important parts of the task).

christopherhesse commented 5 years ago

@JulianoLagana @zuoxingdong has a PR here: https://github.com/openai/gym/pull/1402 any objections?

JulianoLagana commented 5 years ago

@christopherhesse I'm glad this is being addressed! I've added my thoughts to the PR.

ZhaomingXie commented 5 years ago

Just wondering, why make it so complicated? A quick fix is just never reaches the time limit during training. Honestly, 1000 time steps for a typical Mujoco environment is too long for training purpose.

JulianoLagana commented 5 years ago

@ZhaomingXie People usually cap the allowed number of time-steps as an easy way to increase data diversity.

christopherhesse commented 5 years ago

The new timelimit wrapper from https://github.com/openai/gym/pull/1402 should let you recover the underlying done of the environment now.