Closed maxspahn closed 1 year ago
@alxschwrz Would you mind testing these changes with a simple RL algorithm? Just to check whether it could work?
Use the FlattenObservation
filter by gym.wrappers ideally. That would be epic!
I tested my code in #170 for package urdfenvs @ git+https://github.com/maxspahn/gym_envs_urdf.git@2365ecd62b60ede20408eb3ce178c97c2e7c1836
(which is the version for this branch)
Code ran as intended for flattening observation space. For example (for pointRobot.urdf
) I got this for env.observation_space
:
Box([-5. -5. -5. -2.175 -2.175 -2.175], [5. 5. 5. 2.175 2.175 2.175], (6,), float64)
which seems correct to me! But, I couldn't run the RL algorithm, due to this error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/tmp/ipykernel_1366/2669538420.py in
1 # Define the TD3 agent and train it on the environment
----> 2 model = TD3("MlpPolicy", env, verbose=1)
3 model.learn(total_timesteps=100000)
[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/td3/td3.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/td3/td3.py) in __init__(self, policy, env, learning_rate, buffer_size, learning_starts, batch_size, tau, gamma, train_freq, gradient_steps, action_noise, replay_buffer_class, replay_buffer_kwargs, optimize_memory_usage, policy_delay, target_policy_noise, target_noise_clip, tensorboard_log, policy_kwargs, verbose, seed, device, _init_setup_model)
96 ):
97
---> 98 super().__init__(
99 policy,
100 env,
[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py) in __init__(self, policy, env, learning_rate, buffer_size, learning_starts, batch_size, tau, gamma, train_freq, gradient_steps, action_noise, replay_buffer_class, replay_buffer_kwargs, optimize_memory_usage, policy_kwargs, tensorboard_log, verbose, device, support_multi_env, monitor_wrapper, seed, use_sde, sde_sample_freq, use_sde_at_warmup, sde_support, supported_action_spaces)
104 ):
105
--> 106 super().__init__(
107 policy=policy,
108 env=env,
[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/base_class.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/base_class.py) in __init__(self, policy, env, learning_rate, policy_kwargs, tensorboard_log, verbose, device, support_multi_env, monitor_wrapper, seed, use_sde, sde_sample_freq, supported_action_spaces)
166
167 if supported_action_spaces is not None:
--> 168 assert isinstance(self.action_space, supported_action_spaces), (
169 f"The algorithm only supports {supported_action_spaces} as action spaces "
170 f"but {self.action_space} was provided"
AssertionError: The algorithm only supports as action spaces but Dict(robot_0:Box(-2.175, 2.175, (3,), float64)) was provided
meaning our action space needs flattening too! After some digging in the internet I came across a gist for doing that. By using the FlattenAction
our action space type was converted from Dict(robot_0:Box(-2.175, 2.175, (3,), float64))
format to Box(-2.175, 2.175, (3,), float64)
(and I don't know whether this is right or not).
After doing this, we got a new error for our RL algorithm:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_1366/2669538420.py in
1 # Define the TD3 agent and train it on the environment
2 model = TD3("MlpPolicy", env, verbose=1)
----> 3 model.learn(total_timesteps=100000)
[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/td3/td3.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/td3/td3.py) in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar)
212 ) -> SelfTD3:
213
--> 214 return super().learn(
215 total_timesteps=total_timesteps,
216 callback=callback,
[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py) in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar)
332
333 while self.num_timesteps < total_timesteps:
--> 334 rollout = self.collect_rollouts(
335 self.env,
336 train_freq=self.train_freq,
[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py) in collect_rollouts(self, env, callback, train_freq, replay_buffer, action_noise, learning_starts, log_interval)
565
566 # Rescale and perform action
--> 567 new_obs, rewards, dones, infos = env.step(actions)
...
--> 168 action_robot = action[action_id : action_id + robot.n()]
169 robot.apply_action(action_robot, self.dt())
170 action_id += robot.n()
TypeError: unhashable type: 'slice'
which could be from my implementation of RL part. I don't know well enough about Gym environments and will read furthermore on this subject.
@behradkhadem Thanks for the quick reply!
Indeed, the problem on urdfenvs
is that the action space is a gym.spaces.Dict
while the actions are plain arrays.
I'll fix that in a second.
@behradkhadem I corrected the action spaces and added a checking for the actions. I'll try it out myself with stable baselines in a second.
@behradkhadem With these updates, it is possible to run an RL algorithm. Let me know if it also works for you. Be aware that you need to implement some sort of reward function yourself.
By it is possible to run an RL algorithm
, i mean it doesn't crash, but whether it is doing anything meaningful, I don't know. :D
@behradkhadem With these updates, it is possible to run an RL algorithm. Let me know if it also works for you. Be aware that you need to implement some sort of reward function yourself.
By
it is possible to run an RL algorithm
, i mean it doesn't crash, but whether it is doing anything meaningful, I don't know. :D
Well, that's good enough for me! I should deepen my understanding of URDF robots and Gym environments first.
Thanks a lot @maxspahn you helped a lot, and I really don't know how to thank you! If there is anything, I would be more than happy to help.
@behradkhadem I am glad if I was able to help you.
If you want to thank me, just keep me informed if something related to urdfenvs goes wrong, so I can improve it. Also, just leave a little star here :star:
As the
flatten_observation
did not work as intended, see #170 and #171, this PR makes sure that theFlattenObservation
-wrapper works and effectively replaces theflatten_observation
argument.This required some changes in the structure of
urdf_env.py
.The new setup for an environment would be
env.set_spaces()
which effectively sets action and observation spaces.env = gym.wrappers.flatten_observation.FlattenObservation(env)
env.reset()
.Bug Fixes
Ft
Ft[sensors]
Ft[structure]