openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.58k stars 8.59k forks source link

Environment reset() should also return "info", like step() does. #1683

Closed keskival closed 2 years ago

keskival commented 5 years ago

To be symmetric with the observation returned from the step() method, I think reset() method should also return info for arbitrary information about the game state.

This would be useful for certain games where non-Space compliant information about the game state could be returned, especially for the subprocess vectorized environments where the environment internal state is not otherwise easily accessible by the agent.

Sure, in principle all observed state should be projected to the observation space, but that is not always possible, which is why I assume returned "info" exists in the step() method in the first place.

christopherhesse commented 5 years ago

The lack of symmetry is odd. But if we look at step() as actually obs = observe() and rew, done, info = act() combined, then reset() is reset() (which has no return value) + observe().

I think changing this might break the current implementation of vecenvs, since a reset step would need to return (pre-reset info, post-reset info).

christopherhesse commented 5 years ago

info is generally not meant to be part of the observation. If you can't represent your observation using a gym.Space you can just return some other object and not obey the space interface (though of course this means that anything that requires a valid space to work will not work with your environment).

keskival commented 5 years ago

Thank you for the quick response!

I appreciate the clarification of the purpose of the info return item. Currently the documentation states as follows:

"info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning."

This is given in association to other observation information, with observation, reward and done. But as you say, it specifically mentions state change which associates it to act.

I still think the interfaces should be symmetric, or that info should at least be retrievable from the initial state through the vectorized environment interface, because there is no other good way to access such debugging relevant state information in the environment when one does not want to incorporate it to the observation space for various reasons.

pzhokhov commented 4 years ago

How about __getattr__ method in vectorized env, that could query all sub-environments and returns a list of results?

pzhokhov commented 4 years ago

in fact, I think there is already a PR to that extent: https://github.com/openai/gym/pull/1600/files (maybe it is reasonable to limit its scope to read-only properties)

jkterry1 commented 3 years ago

@vwxyzjn @cpnota opinions on what to do with this?

cpnota commented 3 years ago

I agree that it should have been designed to return info to begin with, but obviously at this point changing the core API is a bad idea.

Regarding #1600 , it seems potentially useful but I have no strong opinions.

vwxyzjn commented 3 years ago

I don't have a strong opinion on this, but would prefer to keep things backward-compatible if we make this change (e.g. do gym.reset(info=True)). Also, if you want to get the relevant diagnostics / info, you could always just call a function from the env.

jkterry1 commented 3 years ago

I like the idea of gym.reset(info=True)

balisujohn commented 2 years ago

I'd be pretty happy to take on this issue. It seems like the most obvious change to make is to add the optional argument in the abstract reset function in the Env class. I think this won't break any existing implementations of the Env class. Also I think the default value for the optional info argument to the reset function should be False rather than True; if we default to True, then existing code which uses existing environments will break because the default behavior of the reset function will be to return two objects instead of one. I think the slight asymmetry that will be retained by having info default to False is probably worthwhile in terms of not breaking backwards compatibility.

Best I can tell though, the presence of optional arguments in the Env class is really just a style guide for each environment, so the core of this effort will actually be modifying all the environments to offer this functionality. In theory this shouldn't be a breaking change for any environment, or have any influence on environment behavior or determinism; so I'm interested to hear if this would be an okay change to make to all the environments without incrementing their version number.

If this all seems reasonable, feel free to assign this issue to me, and I'll start working on a pull request.

jkterry1 commented 2 years ago

Hey, I can't assign people due to an issue with github, but I'll have @RedTachyon review your plan and let you know

RedTachyon commented 2 years ago

So in general I like the idea, but there is one part that worries me a little bit, specifically the fact that with this approach, the return signature of env.reset changes depending on the input. Which would be doable (although mildly annoying) with some algebraic data type in a sanely typed language, and in python it can get by through its duck typing. At the same time, I'm not sure if there's a better way to do it at the moment.

To be precise, my problem is with the fact that sometimes you'd need to do obs = env.reset(...) and sometimes obs, info = env.reset(...)

I'd actually be in favor of moving towards the latter as the default behavior e.g. in a 1.0 release. Aside from people getting angry at any changes, I think it's a sane design choice, and the 1.0 version mark might make them complain a bit less.

Also as a word of warning - this will likely be a bit more tricky to properly implement across gym than you expect due to the existence of vector environments. The easy part will be passing the info argument into each individual environment. The more difficult part will be combining all the info outputs (although there should be code to do the same thing in step, so it'd probably be a matter of copying it and adjusting a little bit)

jkterry1 commented 2 years ago

I'm not changing the default of this before a 1.0 release and we have to add functionality for this so Atari can return seeds with the new PR. We can revisit changing the default with 1.0

Sounds good @balisujohn

wookayin commented 2 years ago

My two cents regarding #2546: Have you considered making this option class-wide? Because some may not prefer calling reset(info=false)` every time, and in light of a potential breaking change this would make the codes more transparent.

Another reason is that in many gym wrappers (e.g., dm_env, TFAgents, etc.) reset() is assumed to take no arguments, which means that in order to retrieve info from reset(), the downstream wrappers need to catch up with this change. However if returning dict were possible by changing arguments a constructor, we can achieve the same behavior without requiring those wrappers to be updated.

RedTachyon commented 2 years ago

@wookayin The intention is not that they'd call obs = env.reset(info=False), but that they'd call obs, info = env.reset(), or obs, _ = env.reset() if they don't need the info.

And yes, the downstream code will have to catch up with the 1.0 release when the default API is changed. Creating a compatibility wrapper isn't hard though, so they can do that, or maybe we'll even include it in a future release. Or they can do maintenance on the libraries they're maintaining.

jkterry1 commented 2 years ago

This has been merged into master as a flag to step

wookayin commented 2 years ago

Are we sure we want to make a breaking change? This is going to break all the previous codes that are using gym. I'm worried this is too harsh, and I suggest we keep the previous API. I still think there is a way to keep the previous behavior as a default one, but env.reset() would return a tuple when configured so (like env.reset(return_info=True) or return_info=True passed to the constructor).

UPDATE: It looks like the comment by @RedTachyon was simply misleading, #2546 is indeed NOT making any breaking changes. On 0.22.0 and 1.x, env.reset() will still return a single observation as return_info is by default False. I guess you probably have meant env.reset(return_info=True) for the latter (or without the argument if config were class-wide, if implemented).

RedTachyon commented 2 years ago

@wookayin My comment referred to what this will look like in the future, when we do intend to change the default API. In 0.xx, the defaults will remain the way they are right now. In 1.x, there are no such promises.

superxingzheng commented 1 year ago

This change broke a lot of code.

creativekids11 commented 5 months ago

All my codes are not working now