Closed jkterry1 closed 2 years ago
(For reference, here's the link to the how the dm_env API handles termination and truncation)
If I recall, the one class responsible for the truncation is the TimeLimit
wrapper, so that all envs would return truncated=False
all the time which seems a bit weird... (the truncation is defined when registering the environment). One other solution, which in fact is mandatory for me, is good documentation with explicit warning.
Documentation should always come before breaking changes in my opinion.
I am in favor of this breaking change option 1. This is a pretty fundamental issue that should be made explicit. The truncated=False
is not explicit enough in my opinion.
Wouldn't only envs that manually apply the TimeLimit
wrapper actually return the correct truncated
value then? For example, in ALE envs where time limits are handled internally, this might be a bit confusing, since truncated
would only be set if triggered by the wrapper, not by ALE internal time limits. (and even more confusing in -v4
envs that have different limits per game)
So there's a general problem with the ALE environments in that v0 and v4 don't implement correct (or reasonable), defaults, making properly reproducing results by the ALE group or DeepMind impossible. They also preclude using correct preprocessing schemes. The plan is to remove these accordingly, though you're right that removing them before making this change may be required, which I had not realized. In the case of strictly reproducing old code, people could use old versions of Gym.
This change could obviously be baked into any environment without the time limit wrapper.
I also think that the correct move here is to include truncation via argument before 1.0 and make it mandatory at 1.0. That would provide the easiest transition and most reasonable behavior.
My 2 cents: I would be for option 3 - Turn done into a custom three state variable, or give it a method like done.is_truncated. Specifically an enumeration, which hopefully can be backwards compatible:
class Done(Enum):
not_done = 0
terminated = 1
truncated = 2
I find that separate bool for truncated is not worth it because 1) it's always False if done is False, 2) it's not succinct. The argument that it might be non-obvious to beginners is not outweighing it to me because I'd like a library that can be beautiful once I understand it. Think about all the cases when you don't want to differentiate terminated / truncated, you will always have to unpack it properly at the very least.
I prefer option 3 (Turn done into a custom three-state variable) too. It's easy to be backward compatible with integers.
NOT_DONE = 0
TERMINATED = 1
TRUNCATED = 2
done.is_trunctated
: it's easy to convert integers to NumPy arrays or tensors.For testing whether we need to start a new episode:
bool(NOT_DONE) # -> False
bool(TERMINATED) # -> True
bool(TRUNCATED) # -> True
The following snippet works fine for both boolean-based done and integer-based done.
observation = env.reset()
done = NOT_DONE
while not done:
action = policy(observation)
observation, reward, done, info = env.step(action)
For value iteration in training (e.g. DQN):
abs(1 - NOT_DONE) # -> 1
abs(1 - TERMINATED) # -> 0
abs(1 - TRUNCATED) # -> 1
Formerly, the target value is calculated by:
Q_target = reward + gamma * (1 - done) * max(Q(next_observation, *))
With the three-state variable, we only need to add abs
operator:
Q_target = reward + gamma * abs(1 - done) * max(Q(next_observation, *))
The operator abs
can be applied to NumPy arrays or tensors too.
@XuehaiPan thanks for including the snippet. Option 3 then appears very convenient engineering-wise. Your explainer is very well-written, and I'd suggest adding it to the documentation and possibly the blog post that explains the new gym version.
@XuehaiPan
Sorry for the delayed reply. I spoke to @vwxyzjn and @RedTachyon about this off line at length. Neither of them had a very strong opinion between [0,1,2] or two booleans, but after the discussion I would very much prefer if we used two booleans.
This is for the following reasons:
abs(1- done)
is a bug not a feature for a general purpose API that many people will be using
- I think that two booleans will result in a more idiot-proof upgrading of old environments to the new API instead of [0,1,2], but this is a more minor concern.
This project has 26k stars and an untold greater number of users. I wouldn't underestimate how big of a hassle changing the return signature of a core method like env.step()
is.
@XuehaiPan's suggestion adds the new behavior without requiring a single line to be changed in any existing user code. Personally I'm not a big fan of the enum consts and would prefer done
to be an Optional[DoneType]
which could have attributes, like:
_, _, done, _ = env.step(_)
if done and done.truncated:
...
print(done.reason)
Cheers, Chris
@XuehaiPan's suggestion adds the new behavior without requiring a single line to be changed in any existing user code. Personally I'm not a big fan of the enum consts and would prefer done to be an Optional[DoneType] which could have attributes, like:
_, _, done, _ = env.step(_) if done and done.truncated: ... print(done.reason)
@ChrisCummins Indeed the enum constants confuse users, especially beginners. It may need much detailed documentation and examples. Include a DQN implementation tutorial in the docs may even be better.
in https://github.com/openai/gym/issues/2510#issuecomment-1035960224
I prefer option 3 (Turn done into a custom three-state variable) too. It's easy to be backward compatible with integers.
NOT_DONE = 0 TERMINATED = 1 TRUNCATED = 2
- Why integers rather than a new class with property
done.is_trunctated
: it's easy to convert integers to NumPy arrays or tensors.- Why integers in range (0, 1, 2): only need minimal migration effort which is shown below.
I choose int
(even not Enum
) for capability with NumPy arrays and tensors.
dones = [ %%% list of DoneType %%% ]
dones = np.array(dones)
dones.dtype # -> np.dtype('O')
In [1]: from enum import Enum
In [2]: class DoneType(Enum):
...: NOT_DONE = 0
...: DONE = TERMINATED = 1
...: TRUNCATED = 2
...:
...: def __bool__(self):
...: return bool(self.value)
...:
...: def __int__(self):
...: return int(self.value)
...:
...: def __float__(self):
...: return float(self.value)
In [3]: import numpy as np
In [4]: dones = np.array([DoneType.NOT_DONE, DoneType.DONE])
In [5]: dones
Out[5]: array([<DoneType.NOT_DONE: 0>, <DoneType.DONE: 1>], dtype=object)
In [6]: dones.dtype
Out[6]: dtype('O')
In [7]: dones.astype(int)
Out[7]: array([0, 1])
To support auto type casting with NumPy, we should inherit DoneType
with int
(Python does not allow subclassing with bool
)
class DoneType(int):
def __new__(cls, done, truncated=False):
done = super().__new__(cls, bool(done))
done.truncated = truncated
return done
@property
def truncated(self):
return self.__truncated
@truncated.setter
def truncated(self, value):
self.__truncated = bool(value)
def __str__(self):
return f"DoneType({bool(self)}, truncated={self.truncated})"
__repr__ = __str__
def episode_ended(self):
return self.done or self.truncated
In [1]: import numpy as np
In [2]: class DoneType(int):
...
In [3]: done = DoneType(True)
In [4]: done
Out[4]: DoneType(True, truncated=False)
In [5]: done.truncated = True
In [6]: done
Out[6]: DoneType(True, truncated=True)
In [7]: dones = [
...: DoneType(False, truncated=False), # not done
...: DoneType(True, truncated=False), # terminated
...: DoneType(False, truncated=True), # not done but truncated
...: DoneType(True, truncated=True), # done and truncated at same time
...: ]
In [8]: np.array(dones)
Out[8]: array([0, 1, 0, 1])
In [9]: np.array([done.truncated for done in dones])
Out[9]: array([False, False, True, True])
The problem is how to tensorize np.array([done.truncated for done in dones])
. And handle AttributeError
for old environments.
but after the discussion I would very much prefer if we used two booleans.
This is for the following reasons:
- Two booleans is a more explicit API, and is instantly understandable for beginners from a code snippet instead of having to read the API docs
- Two booleans results in more explicit code. I think that giving the ability to write code to write things like your abs(1- done) is a bug not a feature for a general purpose API that many people will be using
- I think that two booleans will result in a more idiot-proof upgrading of old environments to the new API instead of [0,1,2], but this is a more minor concern.
@jkterry1 Showed in the third point, it's not backward compatible. I would rather leave the step()
returns tuple (next_obs, reward, done, info)
.
I have another proposal (1-3 in https://github.com/openai/gym/issues/2510#issue-1075178315):
ended()
to gym.Env
. Separate done
from rollout sampling control.Old API:
obs = env.reset()
while True:
action = policy(obs)
obs, reward, done, info = env.step(action)
if done:
break
New API:
obs = env.reset()
while True:
action = policy(obs)
obs, reward, done, info = env.step(action)
if env.ended():
break # `done` maybe `True` when the episode is truncated
We only need to add truncation behavior to TimeLimit.ended()
by step counting.
cc @vwxyzjn @RedTachyon.
Hey, I spoke to Costa about this offline a bit again. Your core concern, along with @ChrisCummins 's, appears to be about the breaking nature of this change. Costa shared some of these, and if he does then clarifying further is clearly prudent.
To clarify, my intention would be that, until the 1.0 release, there would be a truncated
argument to step` that would be false by default but issue a warning while false, allowing code to upgrade. In the 1.0 release, along with the other breaking changes announced in the roadmap, this would just switch to always being on without an option to turn it off. I also think that the amount of changes required to adapt code to my proposed scheme would be fairly small (e.g. you'd just need to add a , _ to the output of step and use your old code like before).
I also think that making a breaking change to do this incredibly important feature right in a permanent manner in 1.0 is more important than retaining backwards compatibility, especially given that the Gym will almost certainly be used a ton more in the next 5 years than it has in the past 5.
Regarding how to do this if we're doing a breaking change, I don't think that adding the truncation method to the API is a clean or natural way compared to the original options 1-3
I also think that the amount of changes required to adapt code to my proposed scheme would be fairly small (e.g. you'd just need to add a , _ to the output of step and use your old code like before).
The above snippet is
observation, reward, done, _, info = env.step(action)
The return value of truncated
is never used.
Besides, the main concern for me is that we never use truncated
in training. We even don't need to save it to the reply buffer.
obs = env.reset()
while True:
action = policy(obs)
next_obs, reward, done, truncated, info = env.step()
buffer.add((obs, next_obs, reward, done, info)) # `truncated` is not added
obs = next_obs
if done or truncated:
break
For me, I would support 3 instead of 1:
truncated
during training, it only affects sampling.obs, reward, done, _, info
means we ignore and never use truncated
at all).Either enumerate or new type DoneType
is fine for me if we are not going to add a new method for truncation.
The following snippet supports backward compatibility and NumPy array conversion.
class DoneType(int):
def __new__(cls, done, truncated=False):
done = super().__new__(cls, bool(done))
done.truncated = truncated
return done
@property
def truncated(self):
return self.__truncated
@truncated.setter
def truncated(self, value):
self.__truncated = bool(value)
def __str__(self):
return f"DoneType({bool(int(self))}, truncated={self.truncated})"
__repr__ = __str__
def episode_ended(self):
return self.done or self.truncated
__bool__ = episode_ended # this means bool(int(self)) != int(bool(self))
obs = env.reset()
while True:
action = policy(obs)
next_obs, reward, done, info = env.step()
buffer.add((obs, next_obs, reward, done, info))
obs = next_obs
if done: # here we call `done.__bool__()`
break
One thing is not clear to me, independent of the implementation, which is the semantic and allowed cases (and should made clear in the doc https://github.com/openai/gym/issues/2510#issuecomment-989734977):
done=False
and truncated=True
possible?done=True
and truncated=True
?If done=False
and truncated=True
is allowed, then the user will have to check for both done and truncated to know if he/she should resets the env?
and what is the semantic? the episode is not terminated but in fact it is truncated?
If done=True
and truncated=True
is allowed, what is the semantic? the episode is terminated because of truncation?
or the episode is terminated because of some reason and is also truncated?
@araffin
If
done=False
andtruncated=True
is allowed, then the user will have to check for both done and truncated to know if he/she should resets the env? and what is the semantic? the episode is not terminated but in fact it is truncated?
Yes, done=False
and truncated=True
is allowed.
In training, the episode can be truncated by the horizon limit. Imagine an environment that has 10,000 steps for each episode (e.g. PendulumEnv
without TimeLimit
). Truncation is that the sampling is ended while the episode is not.
In testing or evaluation, the TimeLimit
should be removed. And we always have truncated=False
.
If
done=True
andtruncated=True
is allowed, what is the semantic? the episode is terminated because of truncation? or the episode is terminated because of some reason and is also truncated?
The episode is terminated because of some reason. And it also reaches the horizon limit and is
been truncated. We actually don't care about the value of truncated
when done=True
. That's why we can determine termination vs. truncation with only 3-state enum instead of 4.
My preferred semantic would be for done=False and truncated=True to not be allowed. If an environment is truncated this is obviously a kind of done, and no information is lost via this that I'm aware of, and this would make retaining handling compatibility significantly easier.
My preferred semantic would be for done=False and truncated=True to not be allowed. If an environment is truncated this is obviously a kind of done, and no information is lost via this that I'm aware of
@jkterry1 I think the semantic done
means different for me.
Do you mean:
done
(the sampling should be ended): if True
, stop further step()
call, whatever the next observation is valid or not.
truncated
(the sampling is ended due to time limit): if True
, stop further step()
call due to time limit, the next observation should be valid (and done=True
).
Whether the next observation is valid or not:
done=False |
done=True |
|
---|---|---|
truncated=False |
valid | invalid (gamma=0) |
truncated=True |
N/A | valid |
# Sampling phase
obs = env.reset()
while True:
action = policy(obs)
next_obs, reward, done, truncated, info = env.step(action)
buffer.add((obs, next_obs, reward, done, truncated, info))
if done:
break
next_obs = obs
####################
# Training phase
if truncated or not done:
target_value = reward + gamma * value_func(next_obs)
else:
target_value = reward
value_loss = 0.5 * square( value_func(obs) - target_value )
This needs to add truncated
to the replay buffer, which is not trivial migration on the subsequence RL learning frameworks. Both sampling and training code will need to update.
# Sampling phase
obs = env.reset()
while True:
action = policy(obs)
- next_obs, reward, done, info = env.step(action)
- buffer.add((obs, next_obs, reward, done, info))
+ next_obs, reward, done, truncated, info = env.step(action)
+ buffer.add((obs, next_obs, reward, done, truncated, info))
if done:
break
next_obs = obs
####################
# Training phase
-if not done:
+if truncated or not done:
target_value = reward + gamma * value_func(next_obs)
else:
target_value = reward
value_loss = 0.5 * square( value_func(obs) - target_value )
For me:
done=False
and truncated=True
can be allowed. And it's the most cases when truncated=True
.
done
(i.e. terminated
) (the episode is ended and the sampling should be ended): if True
, the next observation is invalid, stop further step()
call.
truncated
(the sampling is ended): if True
, the next observation can be valid (when done=False
), stop further step()
call due to time limit.
Whether the next observation is valid or not:
done=False |
done=True |
|
---|---|---|
truncated=False |
valid | invalid (gamma=0) |
truncated=True |
valid | invalid (gamma=0) |
# Sampling phase
obs = env.reset()
while True:
action = policy(obs)
next_obs, reward, done, truncated, info = env.step(action)
buffer.add((obs, next_obs, reward, done, info))
if done or truncated:
break
next_obs = obs
####################
# Training phase
if not done:
target_value = reward + gamma * value_func(next_obs)
else:
target_value = reward
value_loss = 0.5 * square( value_func(obs) - target_value )
The only change is in the sampling phase.
# Sampling phase
obs = env.reset()
while True:
action = policy(obs)
- next_obs, reward, done, info = env.step(action)
+ next_obs, reward, done, truncated, info = env.step(action)
buffer.add((obs, next_obs, reward, done, info))
- if done:
+ if done or truncated:
break
next_obs = obs
And we can implement tricky DoneType
in https://github.com/openai/gym/issues/2510#issuecomment-1066522592 without any changes in the downstream codebase.
@XuehaiPan just a quick thought regarding your previous comment,
This needs to add truncated to the replay buffer, which is not trivial migration on the subsequence RL learning frameworks. Both sampling and training code will need to update.
In this case can we not add done and not truncated
as one variable to the replay buffer and no other changes would be necessary? This is in the situation where done
means any kind of done.
In this case can we not add
done and not truncated
as one variable to the replay buffer and no other changes would be necessary?
@arjun-kg You'd better not do that. That makes done
have different meanings in sampling and training.
done_in_train = done_in_sample and not truncated
which makes the code harder to understand and causes trouble in future maintenance.
The meaning of truncated
is not ambiguous, right?
For me, both semantic of done
make sense:
indicator to stop the episode in sampling phase. (done
)
done=True
-> no more env.step()
call
done=False
-> further more env.step()
call
Pros: easy to understand for RL beginners
Cons: more migration works in downstream codebase, confuse people dealing with gamma
in training.
indicator whether the next_obs
is valid. (terminated
)
done=True
-> gamma == 0
done=False
-> gamma != 0
Pros: more straightforward for writing training code and less migration effort
Cons: more documentation, maybe an example code
IMO, I prefer the second one. We should reach an agreement before submitting a PR to implement it. Any thoughts?
One thing is not clear to me, independent of the implementation, which is the semantic and allowed cases I think the semantic done means different for me. Any thoughts?
The answers just confirmed what I was afraid of: this API change brings confusion even for the ones implementing the feature :/ (which mean even more confusion for new comers, in addition to a major breaking change).
In my opinion, a better alternative is to use the current API and make use of the infos
dict to output termination reason as a good practice (should be highlighted in the doc), so we don't cover only termination due to timeouts and the semantic is clear:
done=True
means the end of an episode, be it because of timeout or not (as before)infos["termination.truncated_episode"] = True
(we can probably find a better name than the current TimeLimit.truncated
) means that the episode was terminated due to timeout (or something else that should not be treated as a real termination, for instance reaching tracker limits on a real robot)infos["termination.termination_condition_name"] = True
may be used to tell the user the termination was due to something else (in fact, several conditions may be true at the same time)Hey @araffin, that is a good point. I think the new API at this point would be
obs, reward, terminated, truncated, info = env.step(action)
So there is no ambiguous done
anymore.
Hey @araffin, that is a good point. I think the new API at this point would be
obs, reward, terminated, truncated, info = env.step(action)
So there is no ambiguous
done
anymore.
hmm, not sure how renaming done
to terminated
solves the confusion from https://github.com/openai/gym/issues/2510#issuecomment-1068026884 ?
Renaming done
to terminated
makes it more explicit. And we can make the example code like
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
obs = env.reset()
I think just removing the done
which has dual meanings, the code becomes a lot more explicit. Would you agree @araffin?
If termination and truncation happen at the same time, in which case I think it will return terminated=True, truncated=False
because truncation ultimately did not happen.
All in all, the possible cases are
terminated=False, truncated=False
environment is still ongoingterminated=True, truncated=False
environment is terminatedterminated=False, truncated=False
environment is truncatedterminated=True, truncated=True
I think just removing the done which has dual meanings, the code becomes a lot more explicit. Would you agree @araffin?
@vwxyzjn mmh, for me done
has only one meaning: episode is terminated, need to call reset()
. The only difference is that an episode can be terminated for many reasons, one of them being time limit.
Renaming done to terminated makes it more explicit.
If you look at the actual PR https://github.com/openai/gym/pull/2752, you will notice a weird pattern that I described some months ago (here https://github.com/openai/gym/issues/2510#issuecomment-989734977): all the envs return truncated=False
all the time!
I don't think that will help users reading the code.
Either a infos["termination_reason"] = Terminations.TRUNCATED
or infos["termination.truncated"] = True
would actually be more explicit and cover a larger range of use cases: I consider outputting the termination reason as a good practice, usually much more meaningful than the reward value or the done signal.
Either a infos["termination_reason"] = Terminations.TRUNCATED or infos["termination.truncated"] = True would actually be more explicit and cover a larger range of use cases: I consider outputting the termination reason as a good practice, usually much more meaningful than the reward value or the done signal.
@araffin I think explicit boolean variable truncated
is better than info["termination.truncated"]
, because it's easier to convert it into tensors. It's more Deep RL friendly.
dict
into NumPy array will give np.array([list of infos]).dtype -> np.dtype('O')
.env.info_space
, it's hard to prepare padded batches for RNNs. Actually, a general RL framework cannot handle training with extra data in info
. The framework doesn't know the potential keys and corresponding spaces.@XuehaiPan
Converting a batch of dict into NumPy array will give np.array([list of infos]).dtype -> np.dtype('O').
well, in that case, you do the same as for a batch of dict observation, you filter and then convert it (see https://github.com/openai/gym/blob/da7b8ae8fc6f5ff97c5f7303c03a26c8b4b3f4e2/gym/vector/utils/numpy_utils.py#L65). Most of the custom envs by users are single env (not already vectorized as isaac gym), so it doesn't much difference at the end.
Anyway, from an algorithm point of view, you need to do a for loop on each env, otherwise you might do computation that is not needed (https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/on_policy_algorithm.py#L193).
Since we don't have an attribute env.info_space, it's hard to prepare padded batches for RNNs.
I don't know what is env.info_space
...
Actually, a general RL framework cannot handle training with extra data in info
why not?
For RNN and on-policy algorithms, you don't need to know about timeout for padding): https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/feat/ppo-lstm/sb3_contrib/common/recurrent/buffers.py#L129 For reference, timeout for RNN is handled here
EDIT: looking at your issue on RLlib, it look like a very specific usecase (additional loss function for RNN using separate rewards) that very few people would encounter. For such a specific use case, I would recommend to fork the library and adapt it to your needs
@araffin
Converting a batch of dict into NumPy array will give np.array([list of infos]).dtype -> np.dtype('O').
well, in that case, you do the same as for a batch of dict observation, you filter and then convert it (see
).
That's what exactly I meant:
Since we don't have an attribute
env.info_space
, it's hard to prepare padded batches for RNNs.I don't know what is
env.info_space
...
You need a spaces.Dict
instance to convert dict
s into NumPy arrays, right? In addition, the info
dict is unordered rather than an OrderedDict
.
Actually, a general RL framework cannot handle training with extra data in info
why not?
EDIT: looking at your issue on RLlib, it look like a very specific usecase (additional loss function for RNN using separate rewards) that very few people would encounter. For such a specific use case, I would recommend to fork the library and adapt it to your needs
We can save almost anything in the info
dict. It's hard for a general RL framework to extract the values of a user-defined key in the info
, and then stack them into a tensor for training.
If we are only talking about a officially supported key "TimeLimit.truncated"
, indeed it's the same with a separate boolean truncated
. But I think it's not well documented.
BTW, search result for key TimeLimit.truncated
:
If we are only talking about a officially supported key "TimeLimit.truncated", indeed it's the same with a separate boolean truncated. But I think it's not well documented.
@XuehaiPan
I'm glad we arrive to the same conclusion =)!
That's exactly my point since the beginning (https://github.com/openai/gym/issues/2510#issuecomment-989734977):
TimeLimit
wrapper) and is unofficial (already used by different codebases) but is badly documented (I do mention truncation in our tips and tricks video and associated notebook but should be also added in the text)BTW, search result for key TimeLimit.truncated:
This is mostly an issue with github search...
(ray doesn't seem to support truncation: https://github.com/ray-project/ray/issues/10093)
You need a spaces.Dict instance to convert dicts into NumPy arrays, right? In addition, the info dict is unordered rather than an OrderedDict.
I don't think so, you can just stack along the batch axis for the given keys (in the truncation case, this is just booleans), see: https://github.com/openai/gym/blob/da7b8ae8fc6f5ff97c5f7303c03a26c8b4b3f4e2/gym/vector/utils/numpy_utils.py#L53 And since Python 3.6, dictionaries are preserving the order (as for ordered dict). But this is off topic, this issue is about truncation, not passing any additional arrays to the training. (also mentioned, you need to do a for loop anyway for truncation, unless you want to add some masking, which is more efficient but less readable)
Closing in favor of https://github.com/openai/gym/pull/2752
So there's a general problem with the ALE environments in that v0 and v4 don't implement correct (or reasonable), defaults, making properly reproducing results by the ALE group or DeepMind impossible. They also preclude using correct preprocessing schemes. The plan is to remove these accordingly, though you're right that removing them before making this change may be required, which I had not realized. In the case of strictly reproducing old code, people could use old versions of Gym.
This change could obviously be baked into any environment without the time limit wrapper.
Could you please elaborate on this? I've been asking myself the question how to reproduce the Atari setting as done by Deepmind in their publications and I've got the impression, that the <Game>NoFrameskip-v4
together with the following wrappers should be pretty similar (Deepmind does not use interpolation=cv2.INTER_AREA, but INTER_LINEAR for downscaling the image to 84x84, but I guess that will not make really a measurable difference):
AtariPreprocessing
(with terminal_on_life_loss=False
)TransformReward
with f(x)=np.sign(x)
FrameStack
with num_stack=4
I know terminal_on_life_loss=True
is not what Machado et al. recommend, however the referenced DQN paper seems to make use of that as well as I suppose subsequent publications, like those of QRDQN and IQN.
You wrote that with v4
it is impossible to reproduce their results, but I do not see a reason why as outlined above (only difference in v4 I see is that the episode limit is 100k, while for Deepmind it should be 108k, v5 seems to fix this). Would be glad if could point out why my proposal does not replicate their setup.
Btw in terms of the new -v5
variant to reproduce their setup I would use frameskip=1
(wrapper handles this), repeat_action_probability=0
and full_action_space=False
in the env and the wrappers as above.
@jkterry1 After looking at the source, I get the impression, that is rather hard to reproduce the setup with the v5
version due to https://github.com/mgbellemare/Arcade-Learning-Environment/issues/464 since I'm not sure the internal frameskipping of the environment does pooling as done by the frameskip implementation in the AtariPreprocessing wrapper. So I would rather do frameksipping in the wrapper than the env directly.
Right now, one of the biggest weaknesses of the Gym API is that Done is used for both truncation and termination. The problem is that algorithms in Q learning family (and I assume others), depend on the differentiation between a terminal state and a truncated state to function as intended. Right now, Gym solves this for some (but not all) environments with infos for built in environments via the time limiting wrapper inserting this information into infos. No third party environments I'm aware of use this, and most learning code uses this. This has resulted in a few problems:
This sort of thing in the opinion of myself and those I've spoken to at OpenAI warrants a breaking change in the pursuit of a 1.0 release. There are three options for making the breaking change:
observation, reward, done/terminated, truncated, info = env.step(action)
done
into a custom three state variable, or give it a method likedone.is_truncated
My preference is option 1. I think that option 3 is too confusing and wouldn't generally be used because it's not explicit. I also think that option 2 adds additional confusion and complexity over option 1 without really adding any additional utility. I'm creating this issue so that people can discuss it before a PR is made. This feature is intended to be released in the release of gym after the current.
Regarding breaking changes and backwards compatibility:
_,
In other words, it's my view that the level of breaking this update will be is a minor annoyance, and I think that this is a very important feature. This change will be the second of three planned small changes to the base API before the 1.0 release (the first was to the seed method, the last will be to the render method). I am obviously aware of the profound role Gym plays in the community, however Gym will almost certainly be used a lot more over the next 5 years than the past 5, and I think that a series of minor breaking changes to clean things up now in the pursuit of a proper 1.0 release will be very worth it.
Please offer your thoughts on which of the three options here is best in the comments if you would like (or new options if I'm missing something) and remain civil and inviting to others.