Open araffin opened 2 years ago
This was an intentional change that we should have mentioned in the release notes more. We did this as there is no reason why to restrict the truncation (in the general case) to only be True when terminated is False.
I could see us modify this assumption in the TimeLimit
wrapper but I don't see why other than backward compatibility which was due to the limitations of the previous implementation
I would disagree on the semantic of it.
If an environment is already terminated due to some condition, it is not over because of truncation, you should not bootstrap, so truncated should be False
.
See comment by @XuehaiPan in https://github.com/openai/gym/issues/2510#issuecomment-1070027709 "truncated (the sampling is ended due to time limit): if True, stop further step() call due to time limit, the next observation should be valid (and done=True)"
The current TimeLimit implementation seems also to be inconsistent with https://github.com/openai/gym/blob/master/gym/utils/step_api_compatibility.py
EDIT: I agree terminated and truncated can be true, but here we know that termination is not due to truncation, so I think it should be False
I think a comment has gone missing in the compatibility file which explains I think the issue
As the old API only allowed 3 states, not allowing (termination=True
and truncation=True
) then in the conversion we favor termination over truncation if both are true.
However, termination and truncation refer to different reasons that an episode might have ended.
If users need to use bootstrapping or not, then users should just use terminated
and ignore the truncation result.
Is there a separate issue to the backward compatibility?
We did this as there is no reason why to restrict the truncation (in the general case) to only be True when terminated is False.
who is "we"? You seem to disagree with @vwxyzjn (please read https://github.com/openai/gym/issues/2510#issuecomment-1092214830) who disagrees with @XuehaiPan (see https://github.com/openai/gym/issues/2510#issuecomment-1070027709)
This API change was meant to makes things clearer...
For me a clear option (that would also match previous behavior) would be:
terminated=True
-> you must reset the env (truncated is included in the reasons for termination)truncated=True
-> episode is terminated but you can bootstrapHowever, termination and truncation refer to different reasons that an episode might have ended.
In the case here of the time limit, if an episode is already terminated, then you should not set truncated=True
because it is already terminated (termination is not due to truncation), marking it truncated=True
would mean you could bootstrap, which is not the case here.
And because the episode is already terminated, the reason for termination is not the timelimit, so truncated should be False.
For the old API, the variable done = terminated or truncated
. The TimeLimit
wrapper overrides the inner environment's done
output:
env.done=False |
env.done=True |
|
---|---|---|
truncated=False |
valid next obs wrapper.done=False |
invalid next obs wrapper.done=True |
truncated=True |
valid next obs wrapper.done=True |
N/A |
Now, in the new API, terminated
is a new separate variable. So, we do not need to flip it in TimeLimit
anymore.
terminated=False |
terminated=True |
|
---|---|---|
truncated=False |
valid next obs | invalid next obs, need reset (gamma=0) |
truncated=True |
valid next obs (reset is optional) | invalid next obs, need reset (gamma=0) |
terminated
means: the next observation is invalid, no more step()
should be called, and requires a reset()
call.truncated
means: hint time limit, whether the next observation is valid depends on the terminated
value.I think terminated
has a higher priority than truncated
, so we should always check it in if
statement before truncated
.
For API conversion:
old new
done = terminated or truncated
truncated = not terminated and truncated
The "we" is myself, @RedTachyon and @arjun-kg who discussed this on discord.
This API change was meant to makes things clearer...
I agree this was the aim, but it seems that clarifying the meaning of the variables has caused confusion that I suspect already existed but did not have a place to show itself.
And because the episode is already terminated, the reason for termination is not the timelimit, so truncated should be False.
This seems to confuse the meanings of terminated
and truncated
as I believe that terminated
means that a condition internal to the environment has ended an episode whereas truncated
means a condition outside of the environment results in the episode ends. Therefore, I agree that it can be confusing for the episode to end with two results but they are for separate reasons that shouldn't interact with each other.
As @XuehaiPan notes above, I don't believe this a bug. It does break backward compatibility and requires training libraries from correctly only using terminated
in their bootstraping (they shouldn't need to think about the truncation
beyond stepping through the environments so this shouldn't be an issue imo). Therefore, I am 50\50 on if we should change the current implementation.
My perspective is just that by allowing truncated==terminated==True
we get strictly more information. If the terminal state happens to be reached at the same time as truncation would have occurred, we get the full picture with this semantic. The alternative is merging this outcome with regular termination, which I'll agree is probably good enough for a vast majority of uses. But having more information is generally preferable over actively removing some of that information.
The variable truncated means: hint time limit, reset is optional
truncation is not only due to timelimit, see below:
This seems to confuse the meanings of terminated and truncated as I believe that terminated means that a condition internal to the environment has ended an episode whereas truncated means a condition outside of the environment results in the episode ends.
I would disagree with that.
For instance, in the car racing example, truncated=True
after finishing a lap, this is internal to the env.
The same goes for a robot that goes outside the tracking limits.
If the terminal state happens to be reached at the same time as truncation would have occurred, we get the full picture with this semantic.
but the truncation did not occur... so truncation should be False. If you follow the same logic, you would output at every timestep each termination condition that may have occurred? (and if you want additional/debug info, that what the info dict is meant for)
You are right that internal and external to the environment reasoning is incorrect, I think internal or external to the agent's observation would be more correct.
Thinking about this more, I understand the disagreement to be if the termination and truncation states are logically / causally linked.
My understanding is that you think that if when a state transitions, it reaches a termination state and the episode ends, therefore never reaches the truncation state making terminated=True
and truncated=False
.
An alternative interpretation is that the state transitions to a new state and step
returns if this new state is a terminal state and if this state is a truncated state making terminated=True
and truncated=True
. With this interpretation, the terminated and truncated conditions are unlinked from each other meaning that either does not logically affect the other.
@araffin Is there any issue with the second interpretation? Or why do you favour your interpretation?
My understanding is that you think that if when a state transitions, it reaches a termination state and the episode ends, therefore never reaches the truncation state making terminated=True and truncated=False.
exactly
Or why do you favour your interpretation?
One reason is consistency, gym has behaved like that in the last 5 years and that would prevent surprises (this would break SB3 code silently for instance).
Another reason is the semantic of the timelimit wrapper. We use timelimit wrappers for different reasons: 1) prevent the agent from exploring uninteresting regions (for instance termination is normally due to reaching a goal), linked with 2): 2) improve the exploration of the agent (by exploring more reset states) 3) avoid infinite length for episode (for instance you usually train after each episode on a real robot) 4) ease comparison/evaluation of the agent (without timelimit, it's harder to compare and may take much longer, also linked with 2) and 3) where you want to check that your agent performs well for many starting states)
If you reach termination before the time limit, then none of 1), 2), 3) or 4) need a TimeLimit wrapper, you don't need to know if the timeout may have been reached (and timeout is an artificial way of terminating an episode) and that would also be confusing in that case (truncated=True
but the episode didn't end because of that).
Note that I'm fine with terminated=True
and truncated=True
in other cases (for instance finishing a lap with the racing car).
I would prefer for consistency and for simplicity to have terminated=truncated or terminated
, if not then I would disallow terminated=True
and truncated=True
.
One reason is consistency, gym has behaved like that in the last 5 years and that would prevent surprises (this would break SB3 code silently for instance).
This is true but the limitation was solely due to the implementation forced only termination or truncation to be true. With our new API, this limitation no longer exists so I don't think we should limit ourselves just because the old API did.
Additionally, how is this causing issues with SB3? Training algorithms should be ignoring the truncated variable beyond stepping through an environment, therefore, should not cause issues in the training stage.
Could you explain why you are happy with terminated=True
and truncated=True
in other cases but not with the Time Limit wrapper? This seems contradictory.
I would prefer for consistency and for simplicity to have
terminated=truncated or terminated
, if not then I would disallowterminated=True
andtruncated=True
.
The first option is just the old API and I still don't understand limiting the states for good theoretical or practical reasons.
With our new API, this limitation no longer exists so I don't think we should limit ourselves just because the old API did.
A software with more features doesn't mean better software. We should aim for the most intuitive/user friendly API, especially as gym is used by practitioners with various level of experience, and avoid silent logic change with the previous API (old API doesn't mean bad API). Gym is not new, having subtle changes like that will just bring confusion.
I also don't understand where you want to use that additional information? (as it won't be useful for resetting the agent, nor for training it).
If you want to have additional information, then the previous and current API provides you with the info dict.
In that case, you can add a info["time_limit_reached"]=True
if you want that information. So I don't see it limited.
To answer the rest of your questions, let's talk about the intuitive API, in my opinion, there are two possibles intuitive use of terminated and truncated:
option 1
terminated
is only about real termination and truncated
is when the episode end is due to truncation.
Bootstrapping is done if not terminated
is true and conversion of code is:
done = terminated or truncated
, info["TimeLimit.truncated"]=truncated
With this option, terminated=truncated=True
is not allowed.
This is the most intuitive explanation already used with the new API in some codes, for instance: https://github.com/perrin-isir/xpag/blob/main/xpag/wrappers/brax_vec_env.py#L136-L141
option 2
terminated
is about when an episode ends and truncated
is when the episode end is due to truncation.
Bootstrapping is done if not terminated or truncated
is true and conversion of code is:
done = terminated
, info["TimeLimit.truncated"]=truncated
With this option, terminated=truncated=True
is allowed, but only when terminated
signal is due to truncated=True
(not the case here).
With what you propose, truncated=True
but the episode truncation did not play a role... so not intuituve and breaking behavior (this behavior breaks SB3 code which relies on info["TimeLimit.truncated"]=truncated
for bootstrapping).
option 2
terminated
is about when an episode ends andtruncated
is when the episode end is due to truncation. Bootstrapping is done ifnot terminated or truncated
is true and conversion of code is:done = terminated
,info["TimeLimit.truncated"]=truncated
With this option,
terminated=truncated=True
is allowed, but only whenterminated
signal is due totruncated=True
(not the case here).
@araffin As posted in https://github.com/openai/gym/issues/3102#issuecomment-1265196682, the correct conversion is:
done = terminated or truncated
info["TimeLimit.truncated"] = not terminated and truncated
This is the current implementation in the compat method yes, but:
[1] most importantly, the argument that it is done that way because we get more information is weak if this information is not used (and for debug, there is the info dict)
Sorry, I forgot about this thread.
The problem is partially language I think because in SB3, if to run bootstrapping is called timeout
inferring that if a timeout occurs then bootstrap should not happen. However, with the new API, this intuition as you note is incorrect. I would suggest this variable should be renamed to bootstrap = terminated and not truncated
which makes more sense I believe.
An additional thought is that all training libraries should be compatible with dealing with this case (terminated=True
and truncated=True
) due to environments possibly implementing internal truncation this way.
let me try to rephrase...
You show below in the documentation that this case of both being true is intended
the doc is consistent with the code but there is no explicit migration guide and mention of the breaking change. And even if it's the current behavior, I still disagree and argue for a change for all the reasons I mentioned above.
For the second documentation quote, the episode ends for two reasons so the documentation seems true to.
here, and that's mainly what this issue is about, I disagree with the logic.
"the episode ends because of timeout" is not true if terminated=True
, the time limit was reached yes, but if you remove the time limit, the episode still ends, so you cannot say it ends because of the timelimit, and for me, you should not say that the episode is truncated, because it is not (it has reached the state of time limit but is not truncated).
The problem is partially language
I also think it's partially language but mostly logic for me (for instance saying "timelimit was reached" instead of "episode ends because of timelimit" would partially solve this issue).
An additional thought is that all training libraries should be compatible with dealing with this case (terminated=True and truncated=True) due to environments possibly implementing internal truncation this way.
this goes directly against of what you propose, no?
what you describe here is option 2 and if we follow the current documentation, if terminated=truncated=True
for internal truncation, then the episode is not considered truncated (in the sense that we will not bootstrap and truncated=True
will be ignored), which defeats the point of having two booleans...
EDIT: another instance that shows the current allowed states is not the most intuitive: https://github.com/sail-sg/envpool/pull/205/files#diff-ce18b2383ca873c94323ec7719ea561774f6e64c044e7874b2f47f3746351deaR89 (terminated and truncated are exclusive)
@pseudo-rnd-thoughts gym 0.26 has been released for almost two months already and this ~bug~ breaking change is still not mentioned in the migration guide, so it's still time to keep previous behavior and prevent confusion ;) https://gymnasium.farama.org/content/migration-guide/
I'm also still curious to hear why the current implementation is the most intuitive?
(Despite counter-examples (code written with gym 0.26):
(sidenote: I also cannot find the doc I quoted before anymore)
EDIT: for backward compatiblity and also to give more info, info["TimeLimit.truncated"]
should be set anyway (so we know truncation is due to timelimit and not something else)
so it's still time to keep previous behavior and prevent confusion ;)
Can we be adults and skip the passive aggressive comments? This decision was discussed a bunch before we made it, and a clever comeback is like the least likely thing to change it.
For me ultimately the argument is that this approach gives strictly more information, with the downside being that it requires library maintainers to actually maintain their libraries. Keeping an old design randomly made by OpenAI just to keep things the same is a terrible idea in the long run.
And of course this should have been explicitly stated in the docs, I'm sure making a PR would have taken less time than this thread, so if you want to contribute that, the PRs are always open, otherwise I can get to it when I sit down with a computer tomorrow or so.
Describe the bug Before gym 0.26: https://github.com/openai/gym/blob/3498617bf031538a808b75b932f4ed2c11896a3e/gym/wrappers/time_limit.py#L51-L56
After: https://github.com/openai/gym/blob/54b406b7991e49ab917ed355d12971009e8248b7/gym/wrappers/time_limit.py#L53-L54
It should be
truncated = truncated or not terminated
to match previous behavior.test was also updated in https://github.com/openai/gym/commit/54b406b7991e49ab917ed355d12971009e8248b7#diff-15620e7669a4869759c87e2b6a4f51ad87609ed2680a54047e2da2ff7fcc4149
If the change is intended, then it should also be mentioned in the release.