Closed RedTachyon closed 2 years ago
Can you provide a few concrete examples of where this would be helpful? I'm not imagining them off the top of my head for some reason.
So one example: experiments around the topic of generalization, e.g. you have N different configurations of the environment, want to train on some selection of them, and then evaluate the performance on some other configuartions. With this implemented, you can just adjust the configuration between the episodes on the reset
calls.
A very concrete example from my research - crowd simulation setup implemented in Unity. I want to be able to train with different numbers of agents and different map layouts, and to be able to switch between those easily. For example to see if training in scenario A generalizes to a good performance in scenario B, or whether training with 20 agents works gives a good policy for 50 agents. Also as curriculum learninig - start training with 10 agents, progressively scale up to 100 or so. In all of these cases, it'd be super convenient to just be able to pass that to the reset
method (which is what I actually do, except I just dropped the gym dependency at that point)
It really seems like this is giving reset too much power though, e.g. the parameters you described should be passed to make (which does take arbitrary arguments like this) and an environment created based on them. Moving environment configuration power from make to reset doesn't seem like something that sound be done to me?
Hi, I often need to pass flags that enable certain logging code, which should only be executed in "eval mode". In particular custom PyBullet or MuJoCo envs allow i.e. logging each simulation step (large arrays), logging contact information (info that usually doesn't get queried from the engine)... These arguments could be passed to reset(), but I find it more convenient to pass them to step() directly.
And just as @RedTachyon, I also pass (temporary) configuration details to reset(). For example some kind of noise scale, or difficulty parameter. Often it is not practical to create new environments every time because they would load large urdf files again and again.
In general I think the proposed change would be very convenient for detailed evaluations/benchmarks (more than for training).
"For example some kind of noise scale, or difficulty parameter. Often it is not practical to create new environments every time because they would load large urdf files again and again."
I still don't understand why creating a new environment is not the proper and sensible thing to do here? Am I missing something? Regarding the loading issue, couldn't you just already load it once and have the environment inherit from the preloaded datastructure? That doesn't sound like it would be too difficult.
Maybe I have been using the gym envs in an unintended way, but I always thought of the environment as the main entity. Isn't it common practice to implement domain randomization in reset()? Sometimes it might be necessary to overrule such randomizations. I don't see why it would be more proper and sensible to instantiate new envs with sharp distributions than to overwrite the parameters in reset(). Out of curiosity, how would you implement the environment that inherits from a preloaded datastructure? The problem in my case would be that the simulator (which needs to load the urdf) has state, i.e. internal variables that speed up collision detection. Now when I pass the simulator to the environment (or have the environment inherit from some base class) the env can never be reproducible (even after seeding), because I cannot control the simulator state. Obviously I have a similar problem when I pass configurations in reset() but at least a sequence of "make, seed, reset(config), reset(config)" will always be reproducible, since the simulator state is part of the environment.
The more I think about this the more I think it's not a good idea; @RedTachyon if you feel very strongly that this solves a large problem that cannot be solved otherwise that I'm somehow not seeing we can discuss offline
Per offline discussion with @RedTachyon, we should do this for reset
I still remain unconvinced of doing this for step, per #2396
The main reason why I would like to have the option also in step
is to run expensive logging instructions only in certain situations.
For example, I might want to log all contact points of the simulation only if a gripper is close to a certain object (to calculate grip metrics afterwards). Or maybe someone would like to dump the game state when an agent reaches a new level.
I also imagine that some day I might want to overwrite the randomization for a couple of steps for benchmark reasons, e.g. pull on an object. Of course, you could argue that should be part of the env in the first place but I'd still find it useful.
So Ariel made a valid argument for what this sort of thing should belong in reset when I spoke to him. The problem I have with step is that this breaks the contract of what step /means/ (taking an action). Moreover, why is custom step arguments even easier than using custom environment method? That's what I've always done myself.
Of course, there are ways to work around the problem but the reason why I didn't use a custom environment method is that the simulation itself runs at a different frequency than the agent interacts with the environment. Therefore, I cannot simply query the desired state before or after step. I would have to explicitly enable logging before step and disable it afterwards. But if everyone else thinks passing arguments in reset is sufficient, I'm still happy that you decided to at least add them to reset. (Although I don't really see a downside to allow passing arguments in step.)
Closed in favor of Ariel's PR for organization
Hi, I am using gymnasium 0.28.1 and stable_baselines3. I got the following error on my custom environment: Traceback (most recent call last):
File "C:\Users\kzm0114\PycharmProjects\RL\plotting_test.py", line 149, in <module>
model.learn(total_timesteps=int(timesteps), progress_bar=True)
File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\ppo\ppo.py", line 308, in learn
return super().learn(
File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 246, in learn
total_timesteps, callback = self._setup_learn(
File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\common\base_class.py", line 424, in _setup_learn
self._last_obs = self.env.reset() # type: ignore[assignment]
File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\common\vec_env\dummy_vec_env.py", line 77, in reset
obs, self.reset_infos[env_idx] = self.envs[env_idx].reset(seed = self._seeds[env_idx])
File "C:\Users\kzm0114\PycharmProjects\RL\venv\lib\site-packages\stable_baselines3\common\monitor.py", line 83, in reset
return self.env.reset(**kwargs)
TypeError: reset() got an unexpected keyword argument 'seed'
After reading this thread, I checked the reset() of the monitor. It has **kwargs. Isn't it strange? Can anyone please help me with this? Thanks!
This seems to be an issue for sb3 not gymnasium. Could you post on there
Proposal
Add *args and kwargs to the signatures of
step
andreset
(or just kwargs)Motivation
Lowkey inspired by #2396 and taking it further, but also by my previous work and thoughts.
Essentially, it is sometimes helpful to pass additional information to the environment that is non-agent-specific, either during the step, or when starting a new episode.
I'd argue this doesn't add any complexity, but rather opens the door to better flexibility when defining custom environments. It can be entirely ignored, also in already existing environments.
Alternatives
In principle, a solution is just ignoring the problem. Python doesn't care that much about signatures of functions in subclasses, so if you can ignore the IDE screaming at you when working with a custom environment, it should be fine.
Other than that, there are a few hacks possible like changing the action space, but they're exactly that, hacks.
The main way is probably creating a new environment whenever one needs to have new kwargs passed to the environments. Which is fine for Cartpole, but a massive nuisance when creating a new environment is a costly process (e.g. running a game executable)
Additional context
There was a discussion of a similar nature in issue #337, but it's been 5 years and a pandemic, so I'd like to revisit this.
Checklist