thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.95k stars 1.13k forks source link

Setting seed, return_info, options for reset #605

Closed ultmaster closed 2 years ago

ultmaster commented 2 years ago

Gym environment API supports seed, return_info, and options in reset:

https://github.com/openai/gym/blob/dcae553204957a75db5e4169e33547ccac599609/gym/core.py#L93

But I find the information lost in tianshou:

https://github.com/thu-ml/tianshou/blob/e01385ea30d21b7efc3f9377d22d83e6afd39e2f/tianshou/env/worker/subproc.py#L90 https://github.com/thu-ml/tianshou/blob/e01385ea30d21b7efc3f9377d22d83e6afd39e2f/tianshou/env/worker/dummy.py#L34

Also the return type is Union[ObsType, tuple[ObsType, dict]] in gym, but only ObsType in tianshou.

Trinkle23897 commented 2 years ago

We haven't adapted to the newest version of gym in tianshou.

Union[ObsType, tuple[ObsType, dict]]

The tuple is for multi-agent because of my suggestion.

However, it is a huge change. It may cause many existing customized environments not work. So I personally think it's better to wait for a while to support the newest feature (probably when gym 1.0 release?).

ultmaster commented 2 years ago

I'm asking this because I want to put some extra information (e.g., collect logs) on reset.

However, I didn't realize it's a new feature recently added by gym. In that case, there's no hurry.

Trinkle23897 commented 2 years ago

I think the proper way is to support both. Can we add an extra flag to determine if the env returns only obs or a tuple of (obs, info) at first? It needs to change the collector as well.

Another solution is to add an environment wrapper:

class ForceReturnInfoAtReset(gym.Wrappers):
  def reset(self, **kwargs):
    result = self.env.reset(**kwargs)
    if isinstance(result, tuple):
      return result
    return result, {}  # fake info

And change all tianshou's code to (obs, info).

ultmaster commented 2 years ago

I think tianshou doesn't need to support this natively. Rather, it needs to have the interface to let users implement similar features (e.g., #322) on their own, in a natural way.

For vector env, it should support arbitrary input arguments and return values. It can have the assumption that the first return value of the tuple must be obs, but I think other assumptions are not necessary.

For collector, I think the current collector implementation is a bit messy. Maybe collect method needs some breakdown, so that users can customize their own collectors easily.

jkterry1 commented 2 years ago

For whatever good my input does here:

I'm largely fine with not using the new changes until the 1.0 release comes out (or more specifically what we're planning to call the 0.5.0 release where all the breaking changes to the base API will become permanently enabled, the 1.0 release will have additional work on wrappers, hardware accelerated environments and vector environments). However, the breaking changes that aren't enabled by default really need to be updated and the minimum version of Gym needs to be updated, per https://github.com/thu-ml/tianshou/pull/613. Perhaps leaving this PR open for a few more months until the 0.5.0 release comes out and working to get #613 merged makes sense?