maxspahn / gym_envs_urdf

URDF environments for gym
https://maxspahn.github.io/gym_envs_urdf/
GNU General Public License v3.0
43 stars 14 forks source link

Fix flatten observation #173

Closed maxspahn closed 1 year ago

maxspahn commented 1 year ago

As the flatten_observation did not work as intended, see #170 and #171, this PR makes sure that the FlattenObservation-wrapper works and effectively replaces the flatten_observation argument.

This required some changes in the structure of urdf_env.py.

The new setup for an environment would be

  1. Create list of robots as before.
  2. Create UrdfEnv with the list of robots and arguments.
  3. Add obstacles, goals and sensors.
  4. Finalize step 3. by calling the function env.set_spaces() which effectively sets action and observation spaces.
  5. OPTIONAL: Flatten your observations using the openAI wrapper env = gym.wrappers.flatten_observation.FlattenObservation(env)
  6. That's it, you can now run your episodes starting with env.reset().

Bug Fixes

Ft

Ft[sensors]

Ft[structure]

maxspahn commented 1 year ago

@alxschwrz Would you mind testing these changes with a simple RL algorithm? Just to check whether it could work? Use the FlattenObservation filter by gym.wrappers ideally. That would be epic!

behradkhadem commented 1 year ago

I tested my code in #170 for package urdfenvs @ git+https://github.com/maxspahn/gym_envs_urdf.git@2365ecd62b60ede20408eb3ce178c97c2e7c1836 (which is the version for this branch)

Code ran as intended for flattening observation space. For example (for pointRobot.urdf) I got this for env.observation_space:

Box([-5.    -5.    -5.    -2.175 -2.175 -2.175], [5.    5.    5.    2.175 2.175 2.175], (6,), float64)

which seems correct to me! But, I couldn't run the RL algorithm, due to this error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/tmp/ipykernel_1366/2669538420.py in 
      1 # Define the TD3 agent and train it on the environment
----> 2 model = TD3("MlpPolicy", env, verbose=1)
      3 model.learn(total_timesteps=100000)

[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/td3/td3.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/td3/td3.py) in __init__(self, policy, env, learning_rate, buffer_size, learning_starts, batch_size, tau, gamma, train_freq, gradient_steps, action_noise, replay_buffer_class, replay_buffer_kwargs, optimize_memory_usage, policy_delay, target_policy_noise, target_noise_clip, tensorboard_log, policy_kwargs, verbose, seed, device, _init_setup_model)
     96     ):
     97 
---> 98         super().__init__(
     99             policy,
    100             env,

[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py) in __init__(self, policy, env, learning_rate, buffer_size, learning_starts, batch_size, tau, gamma, train_freq, gradient_steps, action_noise, replay_buffer_class, replay_buffer_kwargs, optimize_memory_usage, policy_kwargs, tensorboard_log, verbose, device, support_multi_env, monitor_wrapper, seed, use_sde, sde_sample_freq, use_sde_at_warmup, sde_support, supported_action_spaces)
    104     ):
    105 
--> 106         super().__init__(
    107             policy=policy,
    108             env=env,

[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/base_class.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/base_class.py) in __init__(self, policy, env, learning_rate, policy_kwargs, tensorboard_log, verbose, device, support_multi_env, monitor_wrapper, seed, use_sde, sde_sample_freq, supported_action_spaces)
    166 
    167             if supported_action_spaces is not None:
--> 168                 assert isinstance(self.action_space, supported_action_spaces), (
    169                     f"The algorithm only supports {supported_action_spaces} as action spaces "
    170                     f"but {self.action_space} was provided"

AssertionError: The algorithm only supports  as action spaces but Dict(robot_0:Box(-2.175, 2.175, (3,), float64)) was provided

meaning our action space needs flattening too! After some digging in the internet I came across a gist for doing that. By using the FlattenAction our action space type was converted from Dict(robot_0:Box(-2.175, 2.175, (3,), float64)) format to Box(-2.175, 2.175, (3,), float64) (and I don't know whether this is right or not). After doing this, we got a new error for our RL algorithm:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_1366/2669538420.py in 
      1 # Define the TD3 agent and train it on the environment
      2 model = TD3("MlpPolicy", env, verbose=1)
----> 3 model.learn(total_timesteps=100000)

[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/td3/td3.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/td3/td3.py) in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar)
    212     ) -> SelfTD3:
    213 
--> 214         return super().learn(
    215             total_timesteps=total_timesteps,
    216             callback=callback,

[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py) in learn(self, total_timesteps, callback, log_interval, tb_log_name, reset_num_timesteps, progress_bar)
    332 
    333         while self.num_timesteps < total_timesteps:
--> 334             rollout = self.collect_rollouts(
    335                 self.env,
    336                 train_freq=self.train_freq,

[~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/behradx/projects/RL/SB3/Robot/~/anaconda3/envs/testarea/lib/python3.9/site-packages/stable_baselines3/common/off_policy_algorithm.py) in collect_rollouts(self, env, callback, train_freq, replay_buffer, action_noise, learning_starts, log_interval)
    565 
    566             # Rescale and perform action
--> 567             new_obs, rewards, dones, infos = env.step(actions)
...
--> 168             action_robot = action[action_id : action_id + robot.n()]
    169             robot.apply_action(action_robot, self.dt())
    170             action_id += robot.n()

TypeError: unhashable type: 'slice'

which could be from my implementation of RL part. I don't know well enough about Gym environments and will read furthermore on this subject.

maxspahn commented 1 year ago

@behradkhadem Thanks for the quick reply!

Indeed, the problem on urdfenvs is that the action space is a gym.spaces.Dict while the actions are plain arrays. I'll fix that in a second.

maxspahn commented 1 year ago

@behradkhadem I corrected the action spaces and added a checking for the actions. I'll try it out myself with stable baselines in a second.

maxspahn commented 1 year ago

@behradkhadem With these updates, it is possible to run an RL algorithm. Let me know if it also works for you. Be aware that you need to implement some sort of reward function yourself.

By it is possible to run an RL algorithm, i mean it doesn't crash, but whether it is doing anything meaningful, I don't know. :D

behradkhadem commented 1 year ago

@behradkhadem With these updates, it is possible to run an RL algorithm. Let me know if it also works for you. Be aware that you need to implement some sort of reward function yourself.

By it is possible to run an RL algorithm, i mean it doesn't crash, but whether it is doing anything meaningful, I don't know. :D

Well, that's good enough for me! I should deepen my understanding of URDF robots and Gym environments first.

Thanks a lot @maxspahn you helped a lot, and I really don't know how to thank you! If there is anything, I would be more than happy to help.

maxspahn commented 1 year ago

@behradkhadem I am glad if I was able to help you.

If you want to thank me, just keep me informed if something related to urdfenvs goes wrong, so I can improve it. Also, just leave a little star here :star: