Noisy or unreadable images when rendering #310

Closed vmoens closed 2 years ago

vmoens commented 2 years ago


In some instances (embodied algos in my case) the new mujoco rendering gives unreadable images after a little while, e.g. here's a grid of 3 views of the same body: image

This occurs after a little while, i.e. the first images rendered are perfectly fine. I tried to narrow it down to a minimal reproductible example but I can't find a way to do it (sorry about that!) When using the old bindings (mujoco-py and such) thiis issue disappears.

I'm using MUJOCO_GL=egl and have installed glew in my conda (working on a cluster where I have no sudo access).

I'm working with either G100 or A100 GPUs, and using them for training and rendering. Also to mention: I'm running a bunch of envs in parallel (not multithrerad but multiprocessing) for fast collection of data.

Here is my conda env

ikostrikov commented 2 years ago

I have a similar problem. Sometimes it renders images correctly, but sometimes it renders only the background image (see the video). This issue is non-deterministic, and the video might be rendered correctly or incorrectly for the same seed.

OS: Ubuntu 20.04, MuJoCo version: 2.2.0 I use MUJOCO_GL=egl as well.

saran-t commented 2 years ago

Can you please try running with DISABLE_RENDER_THREAD_OFFLOADING=1 (environment variable)?

vmoens commented 2 years ago

I still get the same behaviour with DISABLE_RENDER_THREAD_OFFLOADING=1 :/

ikostrikov commented 2 years ago

@saran-t DISABLE_RENDER_THREAD_OFFLOADING=1 doesn't resolve the problem for me either.

saran-t commented 2 years ago

Can I please have a minimal repro code that I can run on my side?

saran-t commented 2 years ago

@vmoens @ikostrikov Gentle nudge on the request for minimal repro above. We'd like to try to get to the bottom of this.

vmoens commented 2 years ago

Hi @saran-t I've been trying hard to reproduce this but it seems to only happen after the code reaches a certain level of complexity (e.g. gpus are used for training and rendering, etc.) Would it be ok if I point you to a specific commit on torchrl, give you the precise conda env setting, the machine config etc for you to reproduce? It's going to be a bit messy but at least it's something!

saran-t commented 2 years ago

If it's consistently reproducible, a messy repro case will be better than not having one at all, so please do give us that anyway.

Also, are you saying with like-for-like experiment complexity level, mujoco-py rendering does not break in the same way?

vmoens commented 2 years ago

Here's one 0e88eac27f1d01bfa1d260d52c051ab5fe514859

Here's the command line

conda create -n mbrl_dmcontrol3 python=3.10
conda activate mbrl_dmcontrol3
pip install dm_control
module load cuda/11.6 nccl/2.12.7-cuda.11.6 nccl_efa/1.15.1-nccl.2.12.7-cuda.11.6
pip3 install torch torchvision torchaudio --extra-index-url
pip install functorch
pip install hydra-core
# from torchrl root:
python develop
cd examples/dreamer/
EGL_DEVICE_ID=2 MUJOCO_GL=egl CHECK_IMAGES=1 srun -p train --gpus-per-node 3 -c 32 python frame_skip=2 init_env_steps=10000 logger=csv

The CHECK_IMAGES=1 will make sure an error is raise as soon as an image is more than half black or white (ie render has collapsed)

You should see an error like this during the first test rollout:

Traceback (most recent call last):
  File "/fsx/users/vmoens/work/rl_mb/examples/dreamer/", line 411, in main
    call_record(logger, record, collected_frames, sampled_tensordict_save, stats, model_based_env, actor_model, cfg)
  File "/fsx/users/vmoens/conda/envs/mbrl_dmcontrol3/lib/python3.10/site-packages/torch/autograd/", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/examples/dreamer/", line 132, in call_record
    td_record = record(None)
  File "/fsx/users/vmoens/conda/envs/mbrl_dmcontrol3/lib/python3.10/site-packages/torch/autograd/", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/trainers/", line 907, in __call__
    td_record = self.recorder.rollout(
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/", line 503, in rollout
    tensordict = self.reset()
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/", line 346, in reset
    tensordict_reset = self._reset(tensordict, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/transforms/", line 403, in _reset
    out_tensordict = self.base_env.reset(execute_step=False, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/", line 346, in reset
    tensordict_reset = self._reset(tensordict, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/", line 122, in _reset
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/", line 136, in _read_obs
    observations = self.observation_spec.encode(observations)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/data/", line 1107, in encode
    out[key] = self[key].encode(item)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/data/", line 243, in encode
    assert v < 0.5, f"numpy: {val.shape}"
AssertionError: numpy: (240, 320, 3)
saran-t commented 2 years ago

Please point me to where the rendering context is set up and where the multiprocessing occurs.

vmoens commented 2 years ago

If it's consistently reproducible, a messy repro case will be better than not having one at all, so please do give us that anyway.

got it!

Also, are you saying with like-for-like experiment complexity level, mujoco-py rendering does not break in the same way?

let me rephrase: with one library where we used to rely on mujoco-py but switched to the new mujoco bindings, we have seen this issue appearing. I ran the following experiment using an old version of dm_control with torchrl and the issue disappears. Here's the setup

torchrl commit: 056699bd214937400c5cc7722669e7819a93bc1e


conda create -n mbrl_olddmc python=3.9
conda activate mbrl_olddmc
pip install mujoco_py
pip install dm-control==0.0.403778684  # works with mujoco 210
pip3 install torch torchvision torchaudio --extra-index-url
pip install functorch
pip install hydra-core
cd path/to/torchrl
python develop

conda env config vars set MJLIB_PATH=/data/home/vmoens/.mujoco/mujoco210/bin/ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/home/vmoens/.mujoco/mujoco210/bin MUJOCO_GL=egl PYOPENGL_PLATFORM=egl MUJOCO_PY_MUJOCO_PATH=/data/home/vmoens/.mujo
conda deactivate && conda activate mbrl_olddmc


EGL_DEVICE_ID=2 MUJOCO_GL=egl CHECK_IMAGES=1 srun -p train --gpus-per-node 3 -c 32 python frame_skip=2 init_env_steps=10000 logger=csv env_per_collector=1 num_workers=1


For rendering, we use dm_control pixels wrapper. When executing a step we create a torch. Tensor from the numpy array and send it on device if needed.

In the example script I gave here above, we first run a random rollout in the environment to get statistics about the observation. To do that, we have a function that creates an environment instance, runs the rollout and calculates the stats. Then we run another random rollout to get data to pass to the model (to initialize it): we have lazy layers that take the right shape once they see real data. In this example, that's where the issue happens (not event during training).

saran-t commented 2 years ago

I'm having trouble running python frame_skip=2 init_env_steps=10000 logger=csv on my machine.

Could you please make a repro script that just runs the dm_control environment without any agent in the loop, preferably without any dependency on Torch?

Note also that I don't have access to a SLURM cluster and I need to repro this on a local machine.

saran-t commented 2 years ago

OK, I have this running. I have zero familiarity with this code, but it seems that Hydra is creating some sort of default cfg and is forcing cfg.collector_devices to be ['cuda:1', 'cuda:1']. On my machine with only a single GPU, this causes an "invalid ordinal" CUDA error.

I had to go into torchrl/trainers/helpers/ and manually override device to 'cuda:0' which allows the script to run. However, now everything runs just fine and I cannot actually trigger the error.

vmoens commented 2 years ago

Let me write a single-gpu example for you

saran-t commented 2 years ago

I've managed to trigger the error. Still investigating, but it looks like something is copying the rendering context objects in Python, which isn't a supported operation.

saran-t commented 2 years ago

@vmoens Could you please try and see if it fixes your issue?

vmoens commented 2 years ago

It is running in a much more stable way than it used to. No noisy pixel, and runs that used to collapse after a couple of iterations are now running smoothly. For me this can be considered as closed. Thanks so much for your help @saran-t! This is amazing

saran-t commented 2 years ago

I'll have this fixed in our 1.0.6 release later this week.

saran-t commented 2 years ago

This should now be fixed in version 1.0.6.