haosulab / ManiSkill

SAPIEN Manipulation Skill Framework, a GPU parallelized robotics simulator and benchmark
https://maniskill.ai/
Apache License 2.0
920 stars 166 forks source link

How to add visual distractions? #685

Open AlexandreBrown opened 2 weeks ago

AlexandreBrown commented 2 weeks ago

Hi,
I would like to change textures (randomly or via a png file) of the various objects in the scene (eg: before every new episode).
I managed to change the base_color but when I change the textures, nothing happens.
Any pointers is appreciated.
The objective is to change textures, camera FOV, lighting and if possible add new objects to evaluate methods for visual generalization.

base_color_texture = RenderTexture2D(
    "/home/user/Downloads/cliff_side_4k.blend/textures/cliff_side_diff_4k.jpg"
)
for actor_name in self.base_env.unwrapped.scene.actors.keys():
    for part in self.base_env.unwrapped.scene.actors[actor_name]._objs:
        for triangle in (
            part.find_component_by_type(sapien.render.RenderBodyComponent)
            .render_shapes[0]
            .parts
        ):
            # triangle.material.set_base_color([0.8, 0.1, 0.1, 1.0])
            triangle.material.set_base_color_texture(base_color_texture)

obs_dict, _ = self.base_env.reset()

PS: I do not know much about textures, I downloaded a sample file from https://polyhaven.com/a/cliff_side

StoneT2000 commented 2 weeks ago

Is this your own custom environment? What environment is this exactly? And are you planning to use the GPU sim + rendering?

AlexandreBrown commented 2 weeks ago

Hi @StoneT2000 , I am using SimplerEnv and TorchRL.
The code is a TorchRL Env that wraps SimplerEnv environment to utilize it in TorchRL unified interface.
TorchRL env wrapper:

import torch
import numpy as np
from tensordict import TensorDict, TensorDictBase
from torchrl.envs import EnvBase
from torchrl.data import Composite, Unbounded, Bounded
from sapien.pysapien.render import RenderTexture2D
import sapien

class SimplerEnvWrapper(EnvBase):
    def __init__(self, base_env, **kwargs):
        super().__init__(**kwargs)
        self._device = torch.device(kwargs.get("device", "cpu"))
        self.base_env = base_env
        self.numpy_to_torch_dtype_dict = {
            bool: torch.bool,
            np.uint8: torch.uint8,
            np.int8: torch.int8,
            np.int16: torch.int16,
            np.int32: torch.int32,
            np.int64: torch.int64,
            np.float16: torch.float16,
            np.float32: torch.float32,
            np.float64: torch.float64,
        }
        self._make_specs()

    def _make_specs(self):
        raw_observation_spec = self.get_image_from_maniskill3_obs_dict(
            self.base_env, self.base_env.observation_space.spaces
        )
        height = raw_observation_spec.shape[-3]
        width = raw_observation_spec.shape[-2]
        self.channels = raw_observation_spec.shape[-1]
        shape = (height, width, self.channels)
        observation_spec = {
            "pixels": Bounded(
                low=torch.from_numpy(
                    raw_observation_spec.low[0, :, :, : self.channels]
                ).to(self._device),
                high=torch.from_numpy(
                    raw_observation_spec.high[0, :, :, : self.channels]
                ).to(self._device),
                shape=shape,
                dtype=torch.uint8,
                device=self._device,
            )
        }
        self.observation_spec = Composite(**observation_spec)

        action_space = self.base_env.action_space
        self.action_spec = Bounded(
            low=torch.from_numpy(action_space.low).to(self._device),
            high=torch.from_numpy(action_space.high).to(self._device),
            shape=action_space.shape,
            dtype=self.numpy_to_torch_dtype_dict[action_space.dtype.type],
            device=self._device,
        )

        self.reward_spec = Unbounded(
            shape=(1,), dtype=torch.float32, device=self._device
        )
        self.done_spec = Unbounded(shape=(1,), dtype=torch.bool, device=self._device)

    def get_image_from_maniskill3_obs_dict(self, env, obs, camera_name=None):
        if camera_name is None:
            if "google_robot" in env.unwrapped.robot_uids.uid:
                camera_name = "overhead_camera"
            elif "widowx" in env.unwrapped.robot_uids.uid:
                camera_name = "3rd_view_camera"
            else:
                raise NotImplementedError()
        img = obs["sensor_data"][camera_name]["rgb"]
        return img

    def _reset(self, tensordict: TensorDictBase = None):

        base_color_texture = RenderTexture2D(
            "/home/user/Downloads/cliff_side_4k.blend/textures/cliff_side_diff_4k.jpg"
        )
        for actor_name in self.base_env.unwrapped.scene.actors.keys():
            for part in self.base_env.unwrapped.scene.actors[actor_name]._objs:
                for triangle in (
                    part.find_component_by_type(sapien.render.RenderBodyComponent)
                    .render_shapes[0]
                    .parts
                ):
                    # triangle.material.set_base_color([0.8, 0.1, 0.1, 1.0])
                    triangle.material.set_base_color_texture(base_color_texture)

        obs_dict, _ = self.base_env.reset()

        rgb_obs = (
            self.get_image_from_maniskill3_obs_dict(self.base_env, obs_dict)[
                0, :, :, : self.channels
            ]
            .to(torch.uint8)
            .squeeze(0)
        )
        text_instruction = self.base_env.unwrapped.get_language_instruction()
        done = torch.tensor(False, dtype=torch.bool, device=self._device)
        terminated = torch.tensor(False, dtype=torch.bool, device=self._device)

        return TensorDict(
            {
                "pixels": rgb_obs,
                "text_instruction": text_instruction,
                "done": done,
                "terminated": terminated,
            },
            batch_size=[],
            device=self._device,
        )

    def _step(self, tensordict: TensorDictBase):
        action = tensordict["action"]
        obs_dict, reward, done, _, info = self.base_env.step(action)

        rgb_obs = (
            self.get_image_from_maniskill3_obs_dict(self.base_env, obs_dict)[
                0, :, :, : self.channels
            ]
            .to(torch.uint8)
            .squeeze(0)
        )
        text_instruction = self.base_env.unwrapped.get_language_instruction()

        return TensorDict(
            {
                "pixels": rgb_obs,
                "text_instruction": text_instruction,
                "reward": reward,
                "done": done,
            },
            batch_size=[],
            device=self._device,
        )

    def _set_seed(self, seed: int):
        self.base_env.seed(seed)

PS: I am not sure I am doing this right, should I apply the changes before the environment reset?
PS #2 : Is there specific file requirements for the texture file ? Do you have a test sample I can use as well? Or does any texture from publicly available texture websites work ?

Where base_env is obtained using Maniskill3 gym integration :

from mani_skill.envs.sapien_env import BaseEnv
...

env_name = cfg["env"]["name"]

sensor_configs = dict()
sensor_configs["shader_pack"] = "default"

base_env: BaseEnv = gym.make(
    env_name,
    max_episode_steps=max_episode_steps,
    obs_mode="rgb+segmentation",
    num_envs=1,
    sensor_configs=sensor_configs,
    render_mode="rgb_array",
    sim_backend=cfg["env"]["device"],
)

I am testing the following existing environments from maniskill3 (using SimplerEnv):

My goal is to leverage the flexibility of maniskill3/simplerenv and be able to :

The more I can achieve from this list, the better.
Note that I am not familiar with Maniskill3 so I did not try to create anything custom yet.

Ideally I would like to apply these randomization at the start of the episode.
I assume video overlay would require per step update (if we treat a video as a sequence of frames where at each step we update the overlayed frame).
I understand that GPU vectorization probably means these use cases are much harder, in which case I would prefer to go for the low hanging fruit first (eg: randomization that are only applied at the start of the episode, if that's easier).

Yes I plan on using the GPU to improve simulation performance (fps), I assume that sim_backend='cuda' is what needs to be done for this but please feel free to tell me more about it. GPU vectorization is a strong motivation for me to use maniskill3 with simplerenv (via their maniskill3 branch) instead of the existing maniskill2/simplerenv.

StoneT2000 commented 1 week ago

Thanks for the extensive notes, all of what you suggest are possible but it depends a little bit on what models you want to evaluate actually.

There are two ways forward. The easiest option actually is to build a new table-top environment (take one of the templates or e.g. the pick cube environment) and add the parallelizations / randomizations you want for a custom environment. Only choose this option if you don't need to verify real2sim alignment and just simply want a controllable robot and objects.

Alternatively you can copy the code for the bridge dataset digital twins and modify the attributes in there to change the default RGB overlays, swap the overlay at each timestep when using video, modify the scene loader to add distractor objects etc.

https://github.com/haosulab/ManiSkill/blob/56dcd4cf1b1f04b7e7dfd82ec625c8428ce1f801/mani_skill/envs/tasks/digital_twins/bridge_dataset_eval (copy both).

Let me know which option you think is needed and I can suggest the relevant docs/code to do what you want.

AlexandreBrown commented 1 week ago

Thanks a lot @StoneT2000 for the amazing reply!

do you plan to train a model and evaluate it? Or evaluate off the shelf models?

I plan on training and evaluating models (training from scatch).

How realistic do you want the environment to look? Are you planning to try vision based sim2real or just do real2sim evaluation of a model trained on real world data?

Are you planning to try vision based sim2real

Yes.

I want to train in simulation using an environment that is as realistic as possible (visually) but if this hinders training time I'm open to try to train using a hybrid approach where the environments are still realistic but maybe slightly less (eg: without ray-tracing) to boost collection speed during training and then the visual generalization benchmark can be more realistic and slower.
Basically I will need to train agents from scratch in simulation and then once trained, I will evaluate the approach using aggressive visual domain randomization (aggressive sim2real visual changes like random camera FOV, random objects colors, random textures, random lighting, random objects if it's feasible etc). The model will only depend on image observation (RGB pixels) and will be trained in an online RL fashion.
I am focused on an approach that shows generalization over visual distractions so the more visual distractions I can showcase the better.

The easiest option actually is to build a new table-top environment (take one of the templates or e.g. the pick cube environment) and add the parallelizations / randomizations you want for a custom environment.

This sounds interesting as I also want not just 1 environment but at least 2-3 that can show increasing level of difficulty (eg: easy to hard).
Is it easier to create an environment from scatch or to start from an existing one ? Context : I have very little experience in environment design. Where can I find a template and documentation for this ? When you say "add the parallelizations" what do you mean exactly?

AlexandreBrown commented 4 days ago

@StoneT2000 After looking at the doc for Maniskill3, I'm tempted to use Maniskill3 directly instead of SimplerEnv. Would it be feasible to use Maniskill3 directly while also being able to add the visual distractions ?

Any help is appreciated!