openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.23k stars 8.59k forks source link

[Bug Report] TimeLimit wrapper logic has changed #3102

Open araffin opened 1 year ago

araffin commented 1 year ago

Describe the bug Before gym 0.26: https://github.com/openai/gym/blob/3498617bf031538a808b75b932f4ed2c11896a3e/gym/wrappers/time_limit.py#L51-L56

After: https://github.com/openai/gym/blob/54b406b7991e49ab917ed355d12971009e8248b7/gym/wrappers/time_limit.py#L53-L54

It should be truncated = truncated or not terminated to match previous behavior.

test was also updated in https://github.com/openai/gym/commit/54b406b7991e49ab917ed355d12971009e8248b7#diff-15620e7669a4869759c87e2b6a4f51ad87609ed2680a54047e2da2ff7fcc4149

If the change is intended, then it should also be mentioned in the release.

pseudo-rnd-thoughts commented 1 year ago

This was an intentional change that we should have mentioned in the release notes more. We did this as there is no reason why to restrict the truncation (in the general case) to only be True when terminated is False.

I could see us modify this assumption in the TimeLimit wrapper but I don't see why other than backward compatibility which was due to the limitations of the previous implementation

araffin commented 1 year ago

I would disagree on the semantic of it. If an environment is already terminated due to some condition, it is not over because of truncation, you should not bootstrap, so truncated should be False.

See comment by @XuehaiPan in https://github.com/openai/gym/issues/2510#issuecomment-1070027709 "truncated (the sampling is ended due to time limit): if True, stop further step() call due to time limit, the next observation should be valid (and done=True)"

The current TimeLimit implementation seems also to be inconsistent with https://github.com/openai/gym/blob/master/gym/utils/step_api_compatibility.py

EDIT: I agree terminated and truncated can be true, but here we know that termination is not due to truncation, so I think it should be False

pseudo-rnd-thoughts commented 1 year ago

I think a comment has gone missing in the compatibility file which explains I think the issue

As the old API only allowed 3 states, not allowing (termination=True and truncation=True) then in the conversion we favor termination over truncation if both are true. However, termination and truncation refer to different reasons that an episode might have ended. If users need to use bootstrapping or not, then users should just use terminated and ignore the truncation result.

Is there a separate issue to the backward compatibility?

araffin commented 1 year ago

We did this as there is no reason why to restrict the truncation (in the general case) to only be True when terminated is False.

who is "we"? You seem to disagree with @vwxyzjn (please read https://github.com/openai/gym/issues/2510#issuecomment-1092214830) who disagrees with @XuehaiPan (see https://github.com/openai/gym/issues/2510#issuecomment-1070027709)

This API change was meant to makes things clearer...

For me a clear option (that would also match previous behavior) would be:

However, termination and truncation refer to different reasons that an episode might have ended.

In the case here of the time limit, if an episode is already terminated, then you should not set truncated=True because it is already terminated (termination is not due to truncation), marking it truncated=True would mean you could bootstrap, which is not the case here. And because the episode is already terminated, the reason for termination is not the timelimit, so truncated should be False.

XuehaiPan commented 1 year ago

For the old API, the variable done = terminated or truncated. The TimeLimit wrapper overrides the inner environment's done output:

env.done=False env.done=True
truncated=False valid next obs wrapper.done=False invalid next obs wrapper.done=True
truncated=True valid next obs wrapper.done=True N/A

Now, in the new API, terminated is a new separate variable. So, we do not need to flip it in TimeLimit anymore.

terminated=False terminated=True
truncated=False valid next obs invalid next obs, need reset (gamma=0)
truncated=True valid next obs (reset is optional) invalid next obs, need reset (gamma=0)

I think terminated has a higher priority than truncated, so we should always check it in if statement before truncated.

For API conversion:

   old                  new

done      = terminated or truncated
truncated = not terminated and truncated
pseudo-rnd-thoughts commented 1 year ago

The "we" is myself, @RedTachyon and @arjun-kg who discussed this on discord.

This API change was meant to makes things clearer...

I agree this was the aim, but it seems that clarifying the meaning of the variables has caused confusion that I suspect already existed but did not have a place to show itself.

And because the episode is already terminated, the reason for termination is not the timelimit, so truncated should be False.

This seems to confuse the meanings of terminated and truncated as I believe that terminated means that a condition internal to the environment has ended an episode whereas truncated means a condition outside of the environment results in the episode ends. Therefore, I agree that it can be confusing for the episode to end with two results but they are for separate reasons that shouldn't interact with each other.

As @XuehaiPan notes above, I don't believe this a bug. It does break backward compatibility and requires training libraries from correctly only using terminated in their bootstraping (they shouldn't need to think about the truncation beyond stepping through the environments so this shouldn't be an issue imo). Therefore, I am 50\50 on if we should change the current implementation.

RedTachyon commented 1 year ago

My perspective is just that by allowing truncated==terminated==True we get strictly more information. If the terminal state happens to be reached at the same time as truncation would have occurred, we get the full picture with this semantic. The alternative is merging this outcome with regular termination, which I'll agree is probably good enough for a vast majority of uses. But having more information is generally preferable over actively removing some of that information.

araffin commented 1 year ago

The variable truncated means: hint time limit, reset is optional

truncation is not only due to timelimit, see below:

This seems to confuse the meanings of terminated and truncated as I believe that terminated means that a condition internal to the environment has ended an episode whereas truncated means a condition outside of the environment results in the episode ends.

I would disagree with that. For instance, in the car racing example, truncated=True after finishing a lap, this is internal to the env. The same goes for a robot that goes outside the tracking limits.

If the terminal state happens to be reached at the same time as truncation would have occurred, we get the full picture with this semantic.

but the truncation did not occur... so truncation should be False. If you follow the same logic, you would output at every timestep each termination condition that may have occurred? (and if you want additional/debug info, that what the info dict is meant for)

pseudo-rnd-thoughts commented 1 year ago

You are right that internal and external to the environment reasoning is incorrect, I think internal or external to the agent's observation would be more correct.

Thinking about this more, I understand the disagreement to be if the termination and truncation states are logically / causally linked.

My understanding is that you think that if when a state transitions, it reaches a termination state and the episode ends, therefore never reaches the truncation state making terminated=True and truncated=False.

An alternative interpretation is that the state transitions to a new state and step returns if this new state is a terminal state and if this state is a truncated state making terminated=True and truncated=True. With this interpretation, the terminated and truncated conditions are unlinked from each other meaning that either does not logically affect the other.

@araffin Is there any issue with the second interpretation? Or why do you favour your interpretation?

araffin commented 1 year ago

My understanding is that you think that if when a state transitions, it reaches a termination state and the episode ends, therefore never reaches the truncation state making terminated=True and truncated=False.

exactly

Or why do you favour your interpretation?

One reason is consistency, gym has behaved like that in the last 5 years and that would prevent surprises (this would break SB3 code silently for instance).

Another reason is the semantic of the timelimit wrapper. We use timelimit wrappers for different reasons: 1) prevent the agent from exploring uninteresting regions (for instance termination is normally due to reaching a goal), linked with 2): 2) improve the exploration of the agent (by exploring more reset states) 3) avoid infinite length for episode (for instance you usually train after each episode on a real robot) 4) ease comparison/evaluation of the agent (without timelimit, it's harder to compare and may take much longer, also linked with 2) and 3) where you want to check that your agent performs well for many starting states)

If you reach termination before the time limit, then none of 1), 2), 3) or 4) need a TimeLimit wrapper, you don't need to know if the timeout may have been reached (and timeout is an artificial way of terminating an episode) and that would also be confusing in that case (truncated=True but the episode didn't end because of that).

Note that I'm fine with terminated=True and truncated=True in other cases (for instance finishing a lap with the racing car). I would prefer for consistency and for simplicity to have terminated=truncated or terminated, if not then I would disallow terminated=True and truncated=True .

pseudo-rnd-thoughts commented 1 year ago

One reason is consistency, gym has behaved like that in the last 5 years and that would prevent surprises (this would break SB3 code silently for instance).

This is true but the limitation was solely due to the implementation forced only termination or truncation to be true. With our new API, this limitation no longer exists so I don't think we should limit ourselves just because the old API did.

Additionally, how is this causing issues with SB3? Training algorithms should be ignoring the truncated variable beyond stepping through an environment, therefore, should not cause issues in the training stage.

Could you explain why you are happy with terminated=True and truncated=True in other cases but not with the Time Limit wrapper? This seems contradictory.

I would prefer for consistency and for simplicity to have terminated=truncated or terminated, if not then I would disallow terminated=True and truncated=True .

The first option is just the old API and I still don't understand limiting the states for good theoretical or practical reasons.

araffin commented 1 year ago

With our new API, this limitation no longer exists so I don't think we should limit ourselves just because the old API did.

A software with more features doesn't mean better software. We should aim for the most intuitive/user friendly API, especially as gym is used by practitioners with various level of experience, and avoid silent logic change with the previous API (old API doesn't mean bad API). Gym is not new, having subtle changes like that will just bring confusion.

I also don't understand where you want to use that additional information? (as it won't be useful for resetting the agent, nor for training it). If you want to have additional information, then the previous and current API provides you with the info dict. In that case, you can add a info["time_limit_reached"]=True if you want that information. So I don't see it limited.

To answer the rest of your questions, let's talk about the intuitive API, in my opinion, there are two possibles intuitive use of terminated and truncated:

option 1 terminated is only about real termination and truncated is when the episode end is due to truncation. Bootstrapping is done if not terminated is true and conversion of code is: done = terminated or truncated, info["TimeLimit.truncated"]=truncated

With this option, terminated=truncated=True is not allowed.

This is the most intuitive explanation already used with the new API in some codes, for instance: https://github.com/perrin-isir/xpag/blob/main/xpag/wrappers/brax_vec_env.py#L136-L141

option 2 terminated is about when an episode ends and truncated is when the episode end is due to truncation. Bootstrapping is done if not terminated or truncated is true and conversion of code is: done = terminated, info["TimeLimit.truncated"]=truncated

With this option, terminated=truncated=True is allowed, but only when terminated signal is due to truncated=True (not the case here).

With what you propose, truncated=True but the episode truncation did not play a role... so not intuituve and breaking behavior (this behavior breaks SB3 code which relies on info["TimeLimit.truncated"]=truncated for bootstrapping).

XuehaiPan commented 1 year ago

option 2 terminated is about when an episode ends and truncated is when the episode end is due to truncation. Bootstrapping is done if not terminated or truncated is true and conversion of code is: done = terminated, info["TimeLimit.truncated"]=truncated

With this option, terminated=truncated=True is allowed, but only when terminated signal is due to truncated=True (not the case here).

@araffin As posted in https://github.com/openai/gym/issues/3102#issuecomment-1265196682, the correct conversion is:

done                        = terminated or truncated
info["TimeLimit.truncated"] = not terminated and truncated
araffin commented 1 year ago

This is the current implementation in the compat method yes, but:

[1] most importantly, the argument that it is done that way because we get more information is weak if this information is not used (and for debug, there is the info dict)

pseudo-rnd-thoughts commented 1 year ago

Sorry, I forgot about this thread.

  1. You show below in the documentation that this case of both being true is intended. I agree this is not well publicised but we are aiming to do that better. This is a blog post explaining the decisions including a section on the suggested code changes https://farama.org/New-Step-API
  2. For the second documentation quote, the episode ends for two reasons so the documentation seems true to. We can make this more specify to say that the episode might not end solely due to the time limit but that is clear to the documentation quote above.
  3. We will make the blog post more public in the coming days which should deal with this problem.

The problem is partially language I think because in SB3, if to run bootstrapping is called timeout inferring that if a timeout occurs then bootstrap should not happen. However, with the new API, this intuition as you note is incorrect. I would suggest this variable should be renamed to bootstrap = terminated and not truncated which makes more sense I believe.

An additional thought is that all training libraries should be compatible with dealing with this case (terminated=True and truncated=True) due to environments possibly implementing internal truncation this way.

araffin commented 1 year ago

let me try to rephrase...

You show below in the documentation that this case of both being true is intended

the doc is consistent with the code but there is no explicit migration guide and mention of the breaking change. And even if it's the current behavior, I still disagree and argue for a change for all the reasons I mentioned above.

For the second documentation quote, the episode ends for two reasons so the documentation seems true to.

here, and that's mainly what this issue is about, I disagree with the logic. "the episode ends because of timeout" is not true if terminated=True, the time limit was reached yes, but if you remove the time limit, the episode still ends, so you cannot say it ends because of the timelimit, and for me, you should not say that the episode is truncated, because it is not (it has reached the state of time limit but is not truncated).

The problem is partially language

I also think it's partially language but mostly logic for me (for instance saying "timelimit was reached" instead of "episode ends because of timelimit" would partially solve this issue).

An additional thought is that all training libraries should be compatible with dealing with this case (terminated=True and truncated=True) due to environments possibly implementing internal truncation this way.

this goes directly against of what you propose, no? what you describe here is option 2 and if we follow the current documentation, if terminated=truncated=True for internal truncation, then the episode is not considered truncated (in the sense that we will not bootstrap and truncated=True will be ignored), which defeats the point of having two booleans...

EDIT: another instance that shows the current allowed states is not the most intuitive: https://github.com/sail-sg/envpool/pull/205/files#diff-ce18b2383ca873c94323ec7719ea561774f6e64c044e7874b2f47f3746351deaR89 (terminated and truncated are exclusive)

araffin commented 1 year ago

@pseudo-rnd-thoughts gym 0.26 has been released for almost two months already and this ~bug~ breaking change is still not mentioned in the migration guide, so it's still time to keep previous behavior and prevent confusion ;) https://gymnasium.farama.org/content/migration-guide/

I'm also still curious to hear why the current implementation is the most intuitive?

(Despite counter-examples (code written with gym 0.26):

(sidenote: I also cannot find the doc I quoted before anymore)

EDIT: for backward compatiblity and also to give more info, info["TimeLimit.truncated"] should be set anyway (so we know truncation is due to timelimit and not something else)

RedTachyon commented 1 year ago

so it's still time to keep previous behavior and prevent confusion ;)

Can we be adults and skip the passive aggressive comments? This decision was discussed a bunch before we made it, and a clever comeback is like the least likely thing to change it.

For me ultimately the argument is that this approach gives strictly more information, with the downside being that it requires library maintainers to actually maintain their libraries. Keeping an old design randomly made by OpenAI just to keep things the same is a terrible idea in the long run.

And of course this should have been explicitly stated in the docs, I'm sure making a PR would have taken less time than this thread, so if you want to contribute that, the PRs are always open, otherwise I can get to it when I sit down with a computer tomorrow or so.