openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.61k stars 8.6k forks source link

[Proposal] New `info` API for vectorized environments #2657

Closed vwxyzjn closed 2 years ago

vwxyzjn commented 2 years ago

Motivation

The current vector env API would return an info dict per environment. So if if I have 5 envs, then the dict could look like [{}, {}, {}, {}, {"truncated": True}]

While this API is intuitive, it's not a friendly interface to high-throughput environments such as brax (also see https://github.com/google/brax/issues/164) and envpool. In these envs, they prefer the following paradigm: instead of 5 dictionaries, it's just a single dictionary with an array of length 5.

{"truncated":[False, False, False, False, True]}

Pros and Cons

Brax and envpool's approach feels more ergonomic for high-throughput envs, whereas Gym's approach is maybe more efficient for "sparse" info keys that don't appear often (I could be wrong with this).

Proposal

I propose to change the info API for vectorized environments to adapt brax's current design for info. This might also have implications for wrappers and other APIs. For example, the current vectorized environment API would return info like this (with the RecordEpisodeStatistics used):

[
    {
        "episode": {
            "r": 75.0,
            "l": 75,
            "t": 36.770559,
        },
        "terminal_observation": array([0.2097086, -0.5355723, -0.21343598, -0.11173592], dtype=float32),
    },
    {},
    {},
    {},
    {},
]

I am not sure what's the best way to handle this... Maybe something like the following?

{
    "episode": {
        "r": [75.0, None, None, None, None],
        "l": [75, None, None, None, None],
        "t": [36.770559, None, None, None, None],
    },
    "terminal_observation": [
        array([0.2097086, -0.5355723, -0.21343598, -0.11173592], dtype=float32),
        None,
        None,
        None,
        None,
    ]
}

Alternative 1

Alternatively, it's possible to do this in a very hacky way: brax or env pool could always override the first sub-env's info dict to always provide their desired info so something like:

[{"truncated":[False, False, False, False, True]}, {}, {}, {}, {}]

This way, we don't have to change the current API, however, it's extremely hacky and unintuitive.

Alternative 2

Alternatively, maybe info should be reserved for something sparse like episode and terminal_observation. And for more frequent infos they should be accessed via attributes (i.e., envs.get_attr(name) (enabled with https://github.com/openai/gym/pull/1600). So something like

root_joints = brax_envs.get_attr("root_joints")
# instead of 
# root_joints = info["root_joints"]
ale_lives = envpool_envs.get_attr("ale_lives")
# instead of 
# ale_lives = info["ale_lives"]

What are your thoughts @RedTachyon @JesseFarebro, @tristandeleu @XuehaiPan @Trinkle23897 @erikfrey @araffin @Miffyli

Checklist

Trinkle23897 commented 2 years ago

I'd suggest only using None if necessary because it's hard to either directly convert the array into tensor, or generate data from venv executor. In general, tianshou's Batch has already proposed a potential solution like:

In [27]: from tianshou.data import Batch

In [28]: a = Batch(episode={'l': 75, 'r': 75.0, 't': 36.77, 'd': "abcd"})

In [29]: b = Batch(terminal_obs=0.2097)

In [30]: Batch.stack([a, b])
Out[30]: 
Batch(
    terminal_obs: array([0.    , 0.2097]),
    episode: Batch(
                 l: array([75,  0]),
                 r: array([75.,  0.]),
                 t: array([36.77,  0.  ]),
                 d: array(['abcd', None], dtype=object),
             ),
)

https://github.com/thu-ml/tianshou/issues/139

gianlucadecola commented 2 years ago

I'm currently looking at this; for wrappers like RecordEpisodeStatistics the most reasonable approach to me seems the first proposal

{
    "episode": {
        "r": [75.0, None, None, None, None],
        "l": [75, None, None, None, None],
        "t": [36.770559, None, None, None, None],
    },
    "terminal_observation": [
        array([0.2097086, -0.5355723, -0.21343598, -0.11173592], dtype=float32),
        None,
        None,
        None,
        None,
    ]
}

Since there are a lot of things to work on I think it is better to agree on the new API before starting to make changes.

Also the solution proposed by @Trinkle23897 seems good, my only doubt is that using 0's can be misleading regarding the fact that it is a "real" zero or a zero from the fact that there is no info for the env.

@pseudo-rnd-thoughts @jkterry1

pseudo-rnd-thoughts commented 2 years ago

There is a third option which I suspect to be inefficient however removes the need for no-data data, i.e. 0 or None. That is pairing the info data with the environment number. This I think would be efficient for sparse environment info but not for dense

{
   "key": [
         (data, environment number) 
         ...
   ], 
   ...
}

or

{
   "key": {
         environment number: data, 
         ...
   }, 
   ...
}

The primary problem is that it is not memory efficient as it doesn't compress the NumPy data into a single array however if you don't need to store the data then this is fine.

Otherwise, I think that Brax and tianshou have very similar implementations that we could follow making gym work with their modules easier.

Edit: One of my questions is what is the application of needed information from info, particularly given that truncation is moved from info to step return. FYI, I'm not an experienced person with distributed RL

jkterry1 commented 2 years ago

Closing in favor of PR