Discussion about the future work

zichunxx commented 7 months ago

Hi!

Thanks for your excellent work which has benefited me a lot.

In your paper, the other modalities with RGB, depth, force sensor, and tactile sensor input of panda-gym were listed as future work.

For now, panda-gym has evolved into v3. If the other modalities are still considered to be updated?

Thanks!

qgallouedec commented 7 months ago

Hi, thanks for your question.

I considered these evolutions following the paper. My conclusion is that PyBullet does not allow you to do everything related to images in a reasonable time. When you want to render every step, it takes a long time, even with the tiny renderer (you drop from 760 FPS to 4!).

import time

import gymnasium as gym
from gymnasium.wrappers import PixelObservationWrapper

import panda_gym

env = gym.make("PandaPickAndPlace-v3", renderer="Tiny")
env = PixelObservationWrapper(env, pixels_only=True)
env.reset()
start = time.time()
for i in range(100):
    env.step(env.action_space.sample())
stop = time.time()
print("FPS: {:.2f}".format(100 / (stop - start)))

To do this with reasonable speed, you'd have to migrate to more powerful rendering engines, like MuJoCo, and that's not planned.

As for the force sensor, I haven't implemented it, mainly because I think it's a matter for specific studies, and users are free to fork and adapt it according to their needs. For the records, pybullet does allow this. Here's a hacky example to print the force at a finger joint:

import gymnasium as gym

class PrintForceWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.body_id = self.env.sim._bodies_idx[self.env.robot.body_name]
        self.joint_idx = self.env.robot.joint_indices[0]
        self.env.sim.physics_client.enableJointForceTorqueSensor(self.body_id, self.joint_idx)

    def step(self, action):
        r = super().step(action)
        joint_state = self.env.sim.physics_client.getJointState(self.body_id, self.joint_idx)
        force = joint_state[2][:3]
        print("Force: ", force)
        return r

env = gym.make("PandaPickAndPlace-v3")
env = PrintForceWrapper(env)

env.reset()
for _ in range(100):
    env.step(env.action_space.sample())

I'm still open to contributions, though. If the majority thinks it's an interesting feature to add, then we can add it.

zichunxx commented 7 months ago

Thanks for your detailed explanation.

Since most of my previous work was carried out with pybullet, I guess I still have to go with this physics engine.

Can the disadvantage of rendering speed be compensated as much as possible in other ways, such as vectorized environments for multiprocessing?

In my opinion, the most complex task in panda-gym is pick and place, and the training cost for this task with vision input is tolerable compared to some tasks that have to be trained in reality.

I'm not sure if I'm right and would appreciate your guidance.

Thanks.

qgallouedec commented 7 months ago

Sorry for the late reply.

You're probably right. I know that by reducing the size of the observation, the simulation goes faster:

import time

import gymnasium as gym
from gymnasium.wrappers import PixelObservationWrapper

import panda_gym

env = gym.make("PandaPickAndPlace-v3", renderer="OpenGL", render_width=84, render_height=84)
env = PixelObservationWrapper(env, pixels_only=True)
env.reset()
start = time.time()
for i in range(100):
    env.step(env.action_space.sample())
stop = time.time()
print("FPS: {:.2f}".format(100 / (stop - start)))

On a process with a rendering of 84 by 84 I reach 70 FPS. In the best case, distribution over 16 workers gives around 4M frames/hour. The resulting images look like this.

That said, you can play with the zoom and orientation for rendering, see the documentation.

zichunxx commented 7 months ago

Thanks for your generous guidance.

Is there any recommended document or repo for referencing the vision-based modality? I'm new to this area and don't know how to effectively define the desired goal and observation space for this kind of environment to converge.

Besides, hope this modality could be considered for an update. Thanks!

qgallouedec commented 7 months ago

Not to my knowledge. If you find any references, please share them here. You can do basic training with multimodal observation (image and desired goal, as x, y, z target position) using sb3 like this:

import gymnasium as gym
from gymnasium.wrappers import FilterObservation, PixelObservationWrapper
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import SubprocVecEnv

import panda_gym

class ObsToImageWrapper(gym.Wrapper):
    def __init__(self, env):
        env = PixelObservationWrapper(env, pixels_only=False)
        env = FilterObservation(env, ["pixels", "desired_goal"])
        super().__init__(env)

def main():
    env = make_vec_env(
        "PandaReach-v3",
        n_envs=16,
        wrapper_class=ObsToImageWrapper,
        vec_env_cls=SubprocVecEnv,
        env_kwargs={"renderer": "Tiny", "render_width": 84, "render_height": 84},
    )

    model = PPO("MultiInputPolicy", env, verbose=1)
    model.learn(total_timesteps=1000000)

if __name__ == "__main__":
    main()

Note that you must use the tiny renderer because pybullet doesn't support multithreading with OpenGL.

As I suspected, the training is quite slow, due to the simulation (around 100 FPS with 16 workers, far from the best case mentioned above).

zichunxx commented 7 months ago

For the image input, the desired goal can also be represented as the vector of x, y, and z coordinates of the target. The main difference between the vision-based and other environments is that the observation space is replaced with a rendered image instead of object states obtained directly from the physics engine. So, is the rest of the information, such as the position and velocity of the end-effector, still necessary? If necessary, should the rendered image be compressed in advance via CNN to reduce dimensionality? Thanks.

qgallouedec commented 7 months ago

For the image input, the desired goal can also be represented as the vector of x, y, and z coordinates of the target.

I'm not sure to understand. In the provided code, the input of the agent is actually composed of (1) the image and (2) the desired obs as (x, y, z).

So, is the rest of the information, such as the position and velocity of the end-effector, still necessary?

Again, it depends on what you want to do. In the provided code, these values aren't part of the input. In a scenario closer to reality, you could also include in the observation the data related to the robot, but not the data related to the object. And let the agent infer the object's position from the image.

If necessary, should the rendered image be compressed in advance via CNN to reduce dimensionality? Thanks.

It depends on what you want to prove. For example, this work is focused on decoupling feature learning from policy learning. https://openreview.net/forum?id=Hkl-di09FQ

zichunxx commented 7 months ago

Sorry for my ignorance of the overriding effect of PixelObservationWrapper and FilterObservation on the observation space. I've spent some time browsing both kinds of wrappers and some related examples. I think I understand what you mean. I will try to train different tasks with different modalities and if I am successful I will put the results here in time. Thanks again for your patient help.

qgallouedec commented 7 months ago

I'm closing the issue, feel free to share your results here even though the issue is closed

qgallouedec / panda-gym

Discussion about the future work #82