[Proposal] Formal API handling of truncation vs termination

jkterry1 commented 2 years ago

Right now, one of the biggest weaknesses of the Gym API is that Done is used for both truncation and termination. The problem is that algorithms in Q learning family (and I assume others), depend on the differentiation between a terminal state and a truncated state to function as intended. Right now, Gym solves this for some (but not all) environments with infos for built in environments via the time limiting wrapper inserting this information into infos. No third party environments I'm aware of use this, and most learning code uses this. This has resulted in a few problems:

The DQN people are deeply unamused because reimplementations of their algorithms are very frequently done wrong because of this
Many people who are very qualified in RL now have complained about being wildly confused by this while first learning RL
Many large institutions (e.g. some large groups at Google brain) refuse to use Gym almost entirely over this design issue, which is bad

This sort of thing in the opinion of myself and those I've spoken to at OpenAI warrants a breaking change in the pursuit of a 1.0 release. There are three options for making the breaking change:

Add have step return an extra boolean value in addition to done, e.g. something like observation, reward, done/terminated, truncated, info = env.step(action)
That option, but return the discount value like dm_env does instead of a boolean
Turn done into a custom three state variable, or give it a method like done.is_truncated

My preference is option 1. I think that option 3 is too confusing and wouldn't generally be used because it's not explicit. I also think that option 2 adds additional confusion and complexity over option 1 without really adding any additional utility. I'm creating this issue so that people can discuss it before a PR is made. This feature is intended to be released in the release of gym after the current.

Regarding breaking changes and backwards compatibility:

Old style environments can be made compliant given very minor modifications and a wrapper that we'll include in gym.wrappers that can do most of the conversion work
Learning code can be compliant with the new API simply via an extra _,

In other words, it's my view that the level of breaking this update will be is a minor annoyance, and I think that this is a very important feature. This change will be the second of three planned small changes to the base API before the 1.0 release (the first was to the seed method, the last will be to the render method). I am obviously aware of the profound role Gym plays in the community, however Gym will almost certainly be used a lot more over the next 5 years than the past 5, and I think that a series of minor breaking changes to clean things up now in the pursuit of a proper 1.0 release will be very worth it.

Please offer your thoughts on which of the three options here is best in the comments if you would like (or new options if I'm missing something) and remain civil and inviting to others.

Rohan138 commented 2 years ago

(For reference, here's the link to the how the dm_env API handles termination and truncation)

araffin commented 2 years ago

If I recall, the one class responsible for the truncation is the TimeLimit wrapper, so that all envs would return truncated=False all the time which seems a bit weird... (the truncation is defined when registering the environment). One other solution, which in fact is mandatory for me, is good documentation with explicit warning. Documentation should always come before breaking changes in my opinion.

vwxyzjn commented 2 years ago

I am in favor of this breaking change option 1. This is a pretty fundamental issue that should be made explicit. The truncated=False is not explicit enough in my opinion.

schmidtdominik commented 2 years ago

Wouldn't only envs that manually apply the TimeLimit wrapper actually return the correct truncated value then? For example, in ALE envs where time limits are handled internally, this might be a bit confusing, since truncated would only be set if triggered by the wrapper, not by ALE internal time limits. (and even more confusing in -v4 envs that have different limits per game)

jkterry1 commented 2 years ago

So there's a general problem with the ALE environments in that v0 and v4 don't implement correct (or reasonable), defaults, making properly reproducing results by the ALE group or DeepMind impossible. They also preclude using correct preprocessing schemes. The plan is to remove these accordingly, though you're right that removing them before making this change may be required, which I had not realized. In the case of strictly reproducing old code, people could use old versions of Gym.

This change could obviously be baked into any environment without the time limit wrapper.

jkterry1 commented 2 years ago

I also think that the correct move here is to include truncation via argument before 1.0 and make it mandatory at 1.0. That would provide the easiest transition and most reasonable behavior.

ikamensh commented 2 years ago

My 2 cents: I would be for option 3 - Turn done into a custom three state variable, or give it a method like done.is_truncated. Specifically an enumeration, which hopefully can be backwards compatible:

class Done(Enum):
  not_done = 0
  terminated = 1
  truncated = 2

I find that separate bool for truncated is not worth it because 1) it's always False if done is False, 2) it's not succinct. The argument that it might be non-obvious to beginners is not outweighing it to me because I'd like a library that can be beautiful once I understand it. Think about all the cases when you don't want to differentiate terminated / truncated, you will always have to unpack it properly at the very least.

XuehaiPan commented 2 years ago

I prefer option 3 (Turn done into a custom three-state variable) too. It's easy to be backward compatible with integers.

NOT_DONE   = 0
TERMINATED = 1
TRUNCATED  = 2

Why integers rather than a new class with property done.is_trunctated: it's easy to convert integers to NumPy arrays or tensors.
Why integers in range (0, 1, 2): only need minimal migration effort which is shown below.

For testing whether we need to start a new episode:

bool(NOT_DONE)   # -> False
bool(TERMINATED) # -> True
bool(TRUNCATED)  # -> True

The following snippet works fine for both boolean-based done and integer-based done.

observation = env.reset()
done = NOT_DONE
while not done:
    action = policy(observation)
    observation, reward, done, info = env.step(action)

For value iteration in training (e.g. DQN):

abs(1 - NOT_DONE)   # -> 1
abs(1 - TERMINATED) # -> 0
abs(1 - TRUNCATED)  # -> 1

Formerly, the target value is calculated by:

Q_target = reward + gamma * (1 - done) * max(Q(next_observation, *))

With the three-state variable, we only need to add abs operator:

Q_target = reward + gamma * abs(1 - done) * max(Q(next_observation, *))

The operator abs can be applied to NumPy arrays or tensors too.

vwxyzjn commented 2 years ago

@XuehaiPan thanks for including the snippet. Option 3 then appears very convenient engineering-wise. Your explainer is very well-written, and I'd suggest adding it to the documentation and possibly the blog post that explains the new gym version.

jkterry1 commented 2 years ago

@XuehaiPan

Sorry for the delayed reply. I spoke to @vwxyzjn and @RedTachyon about this off line at length. Neither of them had a very strong opinion between [0,1,2] or two booleans, but after the discussion I would very much prefer if we used two booleans.

This is for the following reasons:

Two booleans is a more explicit API, and is instantly understandable for beginners from a code snippet instead of having to read the API docs
Two booleans results in more explicit code. I think that giving the ability to write code to write things like your abs(1- done) is a bug not a feature for a general purpose API that many people will be using
I think that two booleans will result in a more idiot-proof upgrading of old environments to the new API instead of [0,1,2], but this is a more minor concern.

ChrisCummins commented 2 years ago

I think that two booleans will result in a more idiot-proof upgrading of old environments to the new API instead of [0,1,2], but this is a more minor concern.

This project has 26k stars and an untold greater number of users. I wouldn't underestimate how big of a hassle changing the return signature of a core method like env.step() is.

@XuehaiPan's suggestion adds the new behavior without requiring a single line to be changed in any existing user code. Personally I'm not a big fan of the enum consts and would prefer done to be an Optional[DoneType] which could have attributes, like:

_, _, done, _ = env.step(_)
if done and done.truncated:
  ...
print(done.reason)

Cheers, Chris

XuehaiPan commented 2 years ago

@XuehaiPan's suggestion adds the new behavior without requiring a single line to be changed in any existing user code. Personally I'm not a big fan of the enum consts and would prefer done to be an Optional[DoneType] which could have attributes, like:
_, _, done, _ = env.step(_)
if done and done.truncated:
  ...
print(done.reason)

@ChrisCummins Indeed the enum constants confuse users, especially beginners. It may need much detailed documentation and examples. Include a DQN implementation tutorial in the docs may even be better.

in https://github.com/openai/gym/issues/2510#issuecomment-1035960224

I prefer option 3 (Turn done into a custom three-state variable) too. It's easy to be backward compatible with integers.
NOT_DONE   = 0
TERMINATED = 1
TRUNCATED  = 2
Why integers rather than a new class with property done.is_trunctated: it's easy to convert integers to NumPy arrays or tensors.

Why integers in range (0, 1, 2): only need minimal migration effort which is shown below.

I choose int (even not Enum) for capability with NumPy arrays and tensors.

dones = [ %%% list of DoneType %%% ]
dones = np.array(dones)
dones.dtype # -> np.dtype('O')

In [1]: from enum import Enum

In [2]: class DoneType(Enum):
   ...:     NOT_DONE = 0
   ...:     DONE = TERMINATED = 1
   ...:     TRUNCATED = 2
   ...: 
   ...:     def __bool__(self):
   ...:         return bool(self.value)
   ...: 
   ...:     def __int__(self):
   ...:         return int(self.value)
   ...: 
   ...:     def __float__(self):
   ...:         return float(self.value)

In [3]: import numpy as np

In [4]: dones = np.array([DoneType.NOT_DONE, DoneType.DONE])

In [5]: dones
Out[5]: array([<DoneType.NOT_DONE: 0>, <DoneType.DONE: 1>], dtype=object)

In [6]: dones.dtype
Out[6]: dtype('O')

In [7]: dones.astype(int)
Out[7]: array([0, 1])

To support auto type casting with NumPy, we should inherit DoneType with int (Python does not allow subclassing with bool)

class DoneType(int):
    def __new__(cls, done, truncated=False):
        done = super().__new__(cls, bool(done))
        done.truncated = truncated
        return done

    @property
    def truncated(self):
        return self.__truncated

    @truncated.setter
    def truncated(self, value):
        self.__truncated = bool(value)

    def __str__(self):
        return f"DoneType({bool(self)}, truncated={self.truncated})"

    __repr__ = __str__

    def episode_ended(self):
        return self.done or self.truncated

In [1]: import numpy as np

In [2]: class DoneType(int):
            ...

In [3]: done = DoneType(True)

In [4]: done
Out[4]: DoneType(True, truncated=False)

In [5]: done.truncated = True

In [6]: done
Out[6]: DoneType(True, truncated=True)

In [7]: dones = [
   ...:     DoneType(False, truncated=False),  # not done
   ...:     DoneType(True, truncated=False),   # terminated
   ...:     DoneType(False, truncated=True),   # not done but truncated
   ...:     DoneType(True, truncated=True),    # done and truncated at same time
   ...: ]

In [8]: np.array(dones)
Out[8]: array([0, 1, 0, 1])

In [9]: np.array([done.truncated for done in dones])
Out[9]: array([False, False,  True,  True])

The problem is how to tensorize np.array([done.truncated for done in dones]). And handle AttributeError for old environments.

XuehaiPan commented 2 years ago

but after the discussion I would very much prefer if we used two booleans.

This is for the following reasons:

Two booleans is a more explicit API, and is instantly understandable for beginners from a code snippet instead of having to read the API docs

Two booleans results in more explicit code. I think that giving the ability to write code to write things like your abs(1- done) is a bug not a feature for a general purpose API that many people will be using

I think that two booleans will result in a more idiot-proof upgrading of old environments to the new API instead of [0,1,2], but this is a more minor concern.

@jkterry1 Showed in the third point, it's not backward compatible. I would rather leave the step() returns tuple (next_obs, reward, done, info).

I have another proposal (1-3 in https://github.com/openai/gym/issues/2510#issue-1075178315):

Add method ended() to gym.Env. Separate done from rollout sampling control.

Old API:

obs = env.reset()
while True:
    action = policy(obs)
    obs, reward, done, info = env.step(action)

    if done:
        break

New API:

obs = env.reset()
while True:
    action = policy(obs)
    obs, reward, done, info = env.step(action)

    if env.ended():
        break  # `done` maybe `True` when the episode is truncated

We only need to add truncation behavior to TimeLimit.ended() by step counting.

cc @vwxyzjn @RedTachyon.

jkterry1 commented 2 years ago

Hey, I spoke to Costa about this offline a bit again. Your core concern, along with @ChrisCummins 's, appears to be about the breaking nature of this change. Costa shared some of these, and if he does then clarifying further is clearly prudent.

To clarify, my intention would be that, until the 1.0 release, there would be a truncated argument to step` that would be false by default but issue a warning while false, allowing code to upgrade. In the 1.0 release, along with the other breaking changes announced in the roadmap, this would just switch to always being on without an option to turn it off. I also think that the amount of changes required to adapt code to my proposed scheme would be fairly small (e.g. you'd just need to add a , _ to the output of step and use your old code like before).

I also think that making a breaking change to do this incredibly important feature right in a permanent manner in 1.0 is more important than retaining backwards compatibility, especially given that the Gym will almost certainly be used a ton more in the next 5 years than it has in the past 5.

Regarding how to do this if we're doing a breaking change, I don't think that adding the truncation method to the API is a clean or natural way compared to the original options 1-3

XuehaiPan commented 2 years ago

I also think that the amount of changes required to adapt code to my proposed scheme would be fairly small (e.g. you'd just need to add a , _ to the output of step and use your old code like before).

The above snippet is

observation, reward, done, _, info = env.step(action)

The return value of truncated is never used.

Besides, the main concern for me is that we never use truncated in training. We even don't need to save it to the reply buffer.

obs = env.reset()

while True:
    action = policy(obs)
    next_obs, reward, done, truncated, info = env.step()

    buffer.add((obs, next_obs, reward, done, info))  # `truncated` is not added
    obs = next_obs
    if done or truncated:
        break

For me, I would support 3 instead of 1:

useless truncated during training, it only affects sampling.
not backward compatible (the migration snippet obs, reward, done, _, info means we ignore and never use truncated at all).

Either enumerate or new type DoneType is fine for me if we are not going to add a new method for truncation.

The following snippet supports backward compatibility and NumPy array conversion.

class DoneType(int):
    def __new__(cls, done, truncated=False):
        done = super().__new__(cls, bool(done))
        done.truncated = truncated
        return done

    @property
    def truncated(self):
        return self.__truncated

    @truncated.setter
    def truncated(self, value):
        self.__truncated = bool(value)

    def __str__(self):
        return f"DoneType({bool(int(self))}, truncated={self.truncated})"

    __repr__ = __str__

    def episode_ended(self):
        return self.done or self.truncated

    __bool__ = episode_ended  # this means bool(int(self)) != int(bool(self))

obs = env.reset()

while True:
    action = policy(obs)
    next_obs, reward, done, info = env.step()

    buffer.add((obs, next_obs, reward, done, info))
    obs = next_obs
    if done:  # here we call `done.__bool__()`
        break

araffin commented 2 years ago

One thing is not clear to me, independent of the implementation, which is the semantic and allowed cases (and should made clear in the doc https://github.com/openai/gym/issues/2510#issuecomment-989734977):

is done=False and truncated=True possible?
same for done=True and truncated=True?

If done=False and truncated=True is allowed, then the user will have to check for both done and truncated to know if he/she should resets the env? and what is the semantic? the episode is not terminated but in fact it is truncated?

If done=True and truncated=True is allowed, what is the semantic? the episode is terminated because of truncation? or the episode is terminated because of some reason and is also truncated?

XuehaiPan commented 2 years ago

@araffin

If done=False and truncated=True is allowed, then the user will have to check for both done and truncated to know if he/she should resets the env? and what is the semantic? the episode is not terminated but in fact it is truncated?

Yes, done=False and truncated=True is allowed.

In training, the episode can be truncated by the horizon limit. Imagine an environment that has 10,000 steps for each episode (e.g. PendulumEnv without TimeLimit). Truncation is that the sampling is ended while the episode is not.

In testing or evaluation, the TimeLimit should be removed. And we always have truncated=False.

If done=True and truncated=True is allowed, what is the semantic? the episode is terminated because of truncation? or the episode is terminated because of some reason and is also truncated?

The episode is terminated because of some reason. And it also reaches the horizon limit and is been truncated. We actually don't care about the value of truncated when done=True. That's why we can determine termination vs. truncation with only 3-state enum instead of 4.

jkterry1 commented 2 years ago

My preferred semantic would be for done=False and truncated=True to not be allowed. If an environment is truncated this is obviously a kind of done, and no information is lost via this that I'm aware of, and this would make retaining handling compatibility significantly easier.

XuehaiPan commented 2 years ago

My preferred semantic would be for done=False and truncated=True to not be allowed. If an environment is truncated this is obviously a kind of done, and no information is lost via this that I'm aware of

@jkterry1 I think the semantic done means different for me.

Do you mean:

done (the sampling should be ended): if True, stop further step() call, whatever the next observation is valid or not.
truncated (the sampling is ended due to time limit): if True, stop further step() call due to time limit, the next observation should be valid (and done=True).

Whether the next observation is valid or not:

	`done=False`	`done=True`
`truncated=False`	valid	invalid (gamma=0)
`truncated=True`	N/A	valid

# Sampling phase
obs = env.reset()
while True:
    action = policy(obs)
    next_obs, reward, done, truncated, info = env.step(action)
    buffer.add((obs, next_obs, reward, done, truncated, info))

    if done:
        break
    next_obs = obs

####################

# Training phase

if truncated or not done:
    target_value = reward + gamma * value_func(next_obs)
else:
    target_value = reward

value_loss = 0.5 * square( value_func(obs) - target_value )

This needs to add truncated to the replay buffer, which is not trivial migration on the subsequence RL learning frameworks. Both sampling and training code will need to update.

 # Sampling phase
 obs = env.reset()
 while True:
     action = policy(obs)
-    next_obs, reward, done, info = env.step(action)
-    buffer.add((obs, next_obs, reward, done, info))
+    next_obs, reward, done, truncated, info = env.step(action)
+    buffer.add((obs, next_obs, reward, done, truncated, info))

     if done:
         break
     next_obs = obs

 ####################

 # Training phase

-if not done:
+if truncated or not done:
     target_value = reward + gamma * value_func(next_obs)
 else:
     target_value = reward

 value_loss = 0.5 * square( value_func(obs) - target_value )

For me:

done=False and truncated=True can be allowed. And it's the most cases when truncated=True.

done (i.e. terminated) (the episode is ended and the sampling should be ended): if True, the next observation is invalid, stop further step() call.
truncated (the sampling is ended): if True, the next observation can be valid (when done=False), stop further step() call due to time limit.

Whether the next observation is valid or not:

	`done=False`	`done=True`
`truncated=False`	valid	invalid (gamma=0)
`truncated=True`	valid	invalid (gamma=0)

# Sampling phase
obs = env.reset()
while True:
    action = policy(obs)
    next_obs, reward, done, truncated, info = env.step(action)
    buffer.add((obs, next_obs, reward, done, info))

    if done or truncated:
        break
    next_obs = obs

####################

# Training phase

if not done:
    target_value = reward + gamma * value_func(next_obs)
else:
    target_value = reward

value_loss = 0.5 * square( value_func(obs) - target_value )

The only change is in the sampling phase.

 # Sampling phase
 obs = env.reset()
 while True:
     action = policy(obs)
-    next_obs, reward, done, info = env.step(action)
+    next_obs, reward, done, truncated, info = env.step(action)
     buffer.add((obs, next_obs, reward, done, info))

-    if done:
+    if done or truncated:
         break
     next_obs = obs

And we can implement tricky DoneType in https://github.com/openai/gym/issues/2510#issuecomment-1066522592 without any changes in the downstream codebase.

arjun-kg commented 2 years ago

@XuehaiPan just a quick thought regarding your previous comment,

This needs to add truncated to the replay buffer, which is not trivial migration on the subsequence RL learning frameworks. Both sampling and training code will need to update.

In this case can we not add done and not truncated as one variable to the replay buffer and no other changes would be necessary? This is in the situation where done means any kind of done.

XuehaiPan commented 2 years ago

In this case can we not add done and not truncated as one variable to the replay buffer and no other changes would be necessary?

@arjun-kg You'd better not do that. That makes done have different meanings in sampling and training.

done_in_train = done_in_sample and not truncated

which makes the code harder to understand and causes trouble in future maintenance.

The meaning of truncated is not ambiguous, right?

For me, both semantic of done make sense:

indicator to stop the episode in sampling phase. (done)
- done=True -> no more env.step() call
- done=False -> further more env.step() call
- Pros: easy to understand for RL beginners
- Cons: more migration works in downstream codebase, confuse people dealing with gamma in training.
indicator whether the next_obs is valid. (terminated)
- done=True -> gamma == 0
- done=False -> gamma != 0
- Pros: more straightforward for writing training code and less migration effort
- Cons: more documentation, maybe an example code

IMO, I prefer the second one. We should reach an agreement before submitting a PR to implement it. Any thoughts?

araffin commented 2 years ago

One thing is not clear to me, independent of the implementation, which is the semantic and allowed cases I think the semantic done means different for me. Any thoughts?

The answers just confirmed what I was afraid of: this API change brings confusion even for the ones implementing the feature :/ (which mean even more confusion for new comers, in addition to a major breaking change).

In my opinion, a better alternative is to use the current API and make use of the infos dict to output termination reason as a good practice (should be highlighted in the doc), so we don't cover only termination due to timeouts and the semantic is clear:

done=True means the end of an episode, be it because of timeout or not (as before)
infos["termination.truncated_episode"] = True (we can probably find a better name than the current TimeLimit.truncated) means that the episode was terminated due to timeout (or something else that should not be treated as a real termination, for instance reaching tracker limits on a real robot)
infos["termination.termination_condition_name"] = True may be used to tell the user the termination was due to something else (in fact, several conditions may be true at the same time)

vwxyzjn commented 2 years ago

Hey @araffin, that is a good point. I think the new API at this point would be

obs, reward, terminated, truncated, info = env.step(action)

So there is no ambiguous done anymore.

araffin commented 2 years ago

Hey @araffin, that is a good point. I think the new API at this point would be

obs, reward, terminated, truncated, info = env.step(action)

So there is no ambiguous done anymore.

hmm, not sure how renaming done to terminated solves the confusion from https://github.com/openai/gym/issues/2510#issuecomment-1068026884 ?

vwxyzjn commented 2 years ago

Renaming done to terminated makes it more explicit. And we can make the example code like

obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
    obs = env.reset()

I think just removing the done which has dual meanings, the code becomes a lot more explicit. Would you agree @araffin?

If termination and truncation happen at the same time, in which case I think it will return terminated=True, truncated=False because truncation ultimately did not happen.

All in all, the possible cases are

terminated=False, truncated=False environment is still ongoing
terminated=True, truncated=False environment is terminated
terminated=False, truncated=False environment is truncated
~~terminated=True, truncated=True~~ (this case cannot happen, by definition this case would mean the environment is terminated)

araffin commented 2 years ago

I think just removing the done which has dual meanings, the code becomes a lot more explicit. Would you agree @araffin?

@vwxyzjn mmh, for me done has only one meaning: episode is terminated, need to call reset(). The only difference is that an episode can be terminated for many reasons, one of them being time limit.

Renaming done to terminated makes it more explicit.

If you look at the actual PR https://github.com/openai/gym/pull/2752, you will notice a weird pattern that I described some months ago (here https://github.com/openai/gym/issues/2510#issuecomment-989734977): all the envs return truncated=False all the time!

I don't think that will help users reading the code.

Either a infos["termination_reason"] = Terminations.TRUNCATED or infos["termination.truncated"] = True would actually be more explicit and cover a larger range of use cases: I consider outputting the termination reason as a good practice, usually much more meaningful than the reward value or the done signal.

XuehaiPan commented 2 years ago

Either a infos["termination_reason"] = Terminations.TRUNCATED or infos["termination.truncated"] = True would actually be more explicit and cover a larger range of use cases: I consider outputting the termination reason as a good practice, usually much more meaningful than the reward value or the done signal.

@araffin I think explicit boolean variable truncated is better than info["termination.truncated"], because it's easier to convert it into tensors. It's more Deep RL friendly.

Converting a batch of dict into NumPy array will give np.array([list of infos]).dtype -> np.dtype('O').
Since we don't have an attribute env.info_space, it's hard to prepare padded batches for RNNs. Actually, a general RL framework cannot handle training with extra data in info. The framework doesn't know the potential keys and corresponding spaces.

See also ray-project/ray#21991

araffin commented 2 years ago

@XuehaiPan

Converting a batch of dict into NumPy array will give np.array([list of infos]).dtype -> np.dtype('O').

well, in that case, you do the same as for a batch of dict observation, you filter and then convert it (see https://github.com/openai/gym/blob/da7b8ae8fc6f5ff97c5f7303c03a26c8b4b3f4e2/gym/vector/utils/numpy_utils.py#L65). Most of the custom envs by users are single env (not already vectorized as isaac gym), so it doesn't much difference at the end.

Anyway, from an algorithm point of view, you need to do a for loop on each env, otherwise you might do computation that is not needed (https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/on_policy_algorithm.py#L193).

Since we don't have an attribute env.info_space, it's hard to prepare padded batches for RNNs.

I don't know what is env.info_space...

Actually, a general RL framework cannot handle training with extra data in info

why not?

For RNN and on-policy algorithms, you don't need to know about timeout for padding): https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/feat/ppo-lstm/sb3_contrib/common/recurrent/buffers.py#L129 For reference, timeout for RNN is handled here

EDIT: looking at your issue on RLlib, it look like a very specific usecase (additional loss function for RNN using separate rewards) that very few people would encounter. For such a specific use case, I would recommend to fork the library and adapt it to your needs

XuehaiPan commented 2 years ago

@araffin

Converting a batch of dict into NumPy array will give np.array([list of infos]).dtype -> np.dtype('O').

well, in that case, you do the same as for a batch of dict observation, you filter and then convert it (see

https://github.com/openai/gym/blob/da7b8ae8fc6f5ff97c5f7303c03a26c8b4b3f4e2/gym/vector/utils/numpy_utils.py#L65

).

That's what exactly I meant:

Since we don't have an attribute env.info_space, it's hard to prepare padded batches for RNNs.

I don't know what is env.info_space...

You need a spaces.Dict instance to convert dicts into NumPy arrays, right? In addition, the info dict is unordered rather than an OrderedDict.

Actually, a general RL framework cannot handle training with extra data in info

why not?

EDIT: looking at your issue on RLlib, it look like a very specific usecase (additional loss function for RNN using separate rewards) that very few people would encounter. For such a specific use case, I would recommend to fork the library and adapt it to your needs

We can save almost anything in the info dict. It's hard for a general RL framework to extract the values of a user-defined key in the info, and then stack them into a tensor for training.

If we are only talking about a officially supported key "TimeLimit.truncated", indeed it's the same with a separate boolean truncated. But I think it's not well documented.

BTW, search result for key TimeLimit.truncated:

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3/search?q=TimeLimit.truncated&type=code
RLlib: https://github.com/ray-project/ray/search?q=TimeLimit.truncated&type=code

araffin commented 2 years ago

If we are only talking about a officially supported key "TimeLimit.truncated", indeed it's the same with a separate boolean truncated. But I think it's not well documented.

@XuehaiPan

I'm glad we arrive to the same conclusion =)!

That's exactly my point since the beginning (https://github.com/openai/gym/issues/2510#issuecomment-989734977):

we need documentation first before doing any breaking change (for instance renaming the key or changing the API)
the solution already exists (via the TimeLimit wrapper) and is unofficial (already used by different codebases) but is badly documented (I do mention truncation in our tips and tricks video and associated notebook but should be also added in the text)

BTW, search result for key TimeLimit.truncated:

This is mostly an issue with github search...

(ray doesn't seem to support truncation: https://github.com/ray-project/ray/issues/10093)

You need a spaces.Dict instance to convert dicts into NumPy arrays, right? In addition, the info dict is unordered rather than an OrderedDict.

I don't think so, you can just stack along the batch axis for the given keys (in the truncation case, this is just booleans), see: https://github.com/openai/gym/blob/da7b8ae8fc6f5ff97c5f7303c03a26c8b4b3f4e2/gym/vector/utils/numpy_utils.py#L53 And since Python 3.6, dictionaries are preserving the order (as for ordered dict). But this is off topic, this issue is about truncation, not passing any additional arrays to the training. (also mentioned, you need to do a for loop anyway for truncation, unless you want to add some masking, which is more efficient but less readable)

jkterry1 commented 2 years ago

Closing in favor of https://github.com/openai/gym/pull/2752

hh0rva1h commented 2 years ago

So there's a general problem with the ALE environments in that v0 and v4 don't implement correct (or reasonable), defaults, making properly reproducing results by the ALE group or DeepMind impossible. They also preclude using correct preprocessing schemes. The plan is to remove these accordingly, though you're right that removing them before making this change may be required, which I had not realized. In the case of strictly reproducing old code, people could use old versions of Gym.

This change could obviously be baked into any environment without the time limit wrapper.

Could you please elaborate on this? I've been asking myself the question how to reproduce the Atari setting as done by Deepmind in their publications and I've got the impression, that the <Game>NoFrameskip-v4 together with the following wrappers should be pretty similar (Deepmind does not use interpolation=cv2.INTER_AREA, but INTER_LINEAR for downscaling the image to 84x84, but I guess that will not make really a measurable difference):

AtariPreprocessing (with terminal_on_life_loss=False)
TransformReward with f(x)=np.sign(x)
FrameStack with num_stack=4

I know terminal_on_life_loss=True is not what Machado et al. recommend, however the referenced DQN paper seems to make use of that as well as I suppose subsequent publications, like those of QRDQN and IQN.

You wrote that with v4 it is impossible to reproduce their results, but I do not see a reason why as outlined above (only difference in v4 I see is that the episode limit is 100k, while for Deepmind it should be 108k, v5 seems to fix this). Would be glad if could point out why my proposal does not replicate their setup.

Btw in terms of the new -v5 variant to reproduce their setup I would use frameskip=1 (wrapper handles this), repeat_action_probability=0 and full_action_space=False in the env and the wrappers as above.

hh0rva1h commented 2 years ago

@jkterry1 After looking at the source, I get the impression, that is rather hard to reproduce the setup with the v5 version due to https://github.com/mgbellemare/Arcade-Learning-Environment/issues/464 since I'm not sure the internal frameskipping of the environment does pooling as done by the frameskip implementation in the AtariPreprocessing wrapper. So I would rather do frameksipping in the wrapper than the env directly.

openai / gym

[Proposal] Formal API handling of truncation vs termination #2510