openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.4k stars 8.59k forks source link

[Proposal] Custom arguments in step and reset methods #2399

Closed RedTachyon closed 2 years ago

RedTachyon commented 2 years ago

Proposal

Add *args and kwargs to the signatures of step and reset (or just kwargs)

Motivation

Lowkey inspired by #2396 and taking it further, but also by my previous work and thoughts.

Essentially, it is sometimes helpful to pass additional information to the environment that is non-agent-specific, either during the step, or when starting a new episode.

I'd argue this doesn't add any complexity, but rather opens the door to better flexibility when defining custom environments. It can be entirely ignored, also in already existing environments.

Alternatives

In principle, a solution is just ignoring the problem. Python doesn't care that much about signatures of functions in subclasses, so if you can ignore the IDE screaming at you when working with a custom environment, it should be fine.

Other than that, there are a few hacks possible like changing the action space, but they're exactly that, hacks.

The main way is probably creating a new environment whenever one needs to have new kwargs passed to the environments. Which is fine for Cartpole, but a massive nuisance when creating a new environment is a costly process (e.g. running a game executable)

Additional context

There was a discussion of a similar nature in issue #337, but it's been 5 years and a pandemic, so I'd like to revisit this.

Checklist

jkterry1 commented 2 years ago

Can you provide a few concrete examples of where this would be helpful? I'm not imagining them off the top of my head for some reason.

RedTachyon commented 2 years ago

So one example: experiments around the topic of generalization, e.g. you have N different configurations of the environment, want to train on some selection of them, and then evaluate the performance on some other configuartions. With this implemented, you can just adjust the configuration between the episodes on the reset calls.

A very concrete example from my research - crowd simulation setup implemented in Unity. I want to be able to train with different numbers of agents and different map layouts, and to be able to switch between those easily. For example to see if training in scenario A generalizes to a good performance in scenario B, or whether training with 20 agents works gives a good policy for 50 agents. Also as curriculum learninig - start training with 10 agents, progressively scale up to 100 or so. In all of these cases, it'd be super convenient to just be able to pass that to the reset method (which is what I actually do, except I just dropped the gym dependency at that point)

jkterry1 commented 2 years ago

It really seems like this is giving reset too much power though, e.g. the parameters you described should be passed to make (which does take arbitrary arguments like this) and an environment created based on them. Moving environment configuration power from make to reset doesn't seem like something that sound be done to me?

johannespitz commented 2 years ago

Hi, I often need to pass flags that enable certain logging code, which should only be executed in "eval mode". In particular custom PyBullet or MuJoCo envs allow i.e. logging each simulation step (large arrays), logging contact information (info that usually doesn't get queried from the engine)... These arguments could be passed to reset(), but I find it more convenient to pass them to step() directly.

And just as @RedTachyon, I also pass (temporary) configuration details to reset(). For example some kind of noise scale, or difficulty parameter. Often it is not practical to create new environments every time because they would load large urdf files again and again.

In general I think the proposed change would be very convenient for detailed evaluations/benchmarks (more than for training).

jkterry1 commented 2 years ago

"For example some kind of noise scale, or difficulty parameter. Often it is not practical to create new environments every time because they would load large urdf files again and again."

I still don't understand why creating a new environment is not the proper and sensible thing to do here? Am I missing something? Regarding the loading issue, couldn't you just already load it once and have the environment inherit from the preloaded datastructure? That doesn't sound like it would be too difficult.

johannespitz commented 2 years ago

Maybe I have been using the gym envs in an unintended way, but I always thought of the environment as the main entity. Isn't it common practice to implement domain randomization in reset()? Sometimes it might be necessary to overrule such randomizations. I don't see why it would be more proper and sensible to instantiate new envs with sharp distributions than to overwrite the parameters in reset(). Out of curiosity, how would you implement the environment that inherits from a preloaded datastructure? The problem in my case would be that the simulator (which needs to load the urdf) has state, i.e. internal variables that speed up collision detection. Now when I pass the simulator to the environment (or have the environment inherit from some base class) the env can never be reproducible (even after seeding), because I cannot control the simulator state. Obviously I have a similar problem when I pass configurations in reset() but at least a sequence of "make, seed, reset(config), reset(config)" will always be reproducible, since the simulator state is part of the environment.

jkterry1 commented 2 years ago

The more I think about this the more I think it's not a good idea; @RedTachyon if you feel very strongly that this solves a large problem that cannot be solved otherwise that I'm somehow not seeing we can discuss offline

jkterry1 commented 2 years ago

Per offline discussion with @RedTachyon, we should do this for reset

jkterry1 commented 2 years ago

I still remain unconvinced of doing this for step, per #2396

johannespitz commented 2 years ago

The main reason why I would like to have the option also in step is to run expensive logging instructions only in certain situations. For example, I might want to log all contact points of the simulation only if a gripper is close to a certain object (to calculate grip metrics afterwards). Or maybe someone would like to dump the game state when an agent reaches a new level. I also imagine that some day I might want to overwrite the randomization for a couple of steps for benchmark reasons, e.g. pull on an object. Of course, you could argue that should be part of the env in the first place but I'd still find it useful.

jkterry1 commented 2 years ago

So Ariel made a valid argument for what this sort of thing should belong in reset when I spoke to him. The problem I have with step is that this breaks the contract of what step /means/ (taking an action). Moreover, why is custom step arguments even easier than using custom environment method? That's what I've always done myself.

johannespitz commented 2 years ago

Of course, there are ways to work around the problem but the reason why I didn't use a custom environment method is that the simulation itself runs at a different frequency than the agent interacts with the environment. Therefore, I cannot simply query the desired state before or after step. I would have to explicitly enable logging before step and disable it afterwards. But if everyone else thinks passing arguments in reset is sufficient, I'm still happy that you decided to at least add them to reset. (Although I don't really see a downside to allow passing arguments in step.)

jkterry1 commented 2 years ago

Closed in favor of Ariel's PR for organization

KFM135 commented 11 months ago

Hi, I am using gymnasium 0.28.1 and stable_baselines3. I got the following error on my custom environment: Traceback (most recent call last):

 File "C:\Users\kzm0114\PycharmProjects\RL\plotting_test.py", line 149, in <module>
    model.learn(total_timesteps=int(timesteps), progress_bar=True)
  File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\ppo\ppo.py", line 308, in learn
    return super().learn(
  File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 246, in learn
    total_timesteps, callback = self._setup_learn(
  File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\common\base_class.py", line 424, in _setup_learn
    self._last_obs = self.env.reset()  # type: ignore[assignment]
  File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\common\vec_env\dummy_vec_env.py", line 77, in reset
    obs, self.reset_infos[env_idx] = self.envs[env_idx].reset(seed = self._seeds[env_idx])
  File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\common\monitor.py", line 83, in reset
    return self.env.reset(**kwargs)
TypeError: reset() got an unexpected keyword argument 'seed'

After reading this thread, I checked the reset() of the monitor. It has **kwargs. Isn't it strange? Can anyone please help me with this? Thanks!

pseudo-rnd-thoughts commented 11 months ago

This seems to be an issue for sb3 not gymnasium. Could you post on there