Closed vmoens closed 2 years ago
I have a similar problem. Sometimes it renders images correctly, but sometimes it renders only the background image (see the video). This issue is non-deterministic, and the video might be rendered correctly or incorrectly for the same seed.
OS: Ubuntu 20.04, MuJoCo version: 2.2.0 I use MUJOCO_GL=egl as well.
Can you please try running with DISABLE_RENDER_THREAD_OFFLOADING=1
(environment variable)?
I still get the same behaviour with DISABLE_RENDER_THREAD_OFFLOADING=1
:/
@saran-t DISABLE_RENDER_THREAD_OFFLOADING=1
doesn't resolve the problem for me either.
Can I please have a minimal repro code that I can run on my side?
@vmoens @ikostrikov Gentle nudge on the request for minimal repro above. We'd like to try to get to the bottom of this.
Hi @saran-t I've been trying hard to reproduce this but it seems to only happen after the code reaches a certain level of complexity (e.g. gpus are used for training and rendering, etc.) Would it be ok if I point you to a specific commit on torchrl, give you the precise conda env setting, the machine config etc for you to reproduce? It's going to be a bit messy but at least it's something!
If it's consistently reproducible, a messy repro case will be better than not having one at all, so please do give us that anyway.
Also, are you saying with like-for-like experiment complexity level, mujoco-py
rendering does not break in the same way?
Here's one 0e88eac27f1d01bfa1d260d52c051ab5fe514859
Here's the command line
conda create -n mbrl_dmcontrol3 python=3.10
conda activate mbrl_dmcontrol3
pip install dm_control
module load cuda/11.6 nccl/2.12.7-cuda.11.6 nccl_efa/1.15.1-nccl.2.12.7-cuda.11.6
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install functorch
pip install hydra-core
# from torchrl root:
python setup.py develop
cd examples/dreamer/
EGL_DEVICE_ID=2 MUJOCO_GL=egl CHECK_IMAGES=1 srun -p train --gpus-per-node 3 -c 32 python dreamer.py frame_skip=2 init_env_steps=10000 logger=csv
The CHECK_IMAGES=1
will make sure an error is raise as soon as an image is more than half black or white (ie render has collapsed)
You should see an error like this during the first test rollout:
Traceback (most recent call last):
File "/fsx/users/vmoens/work/rl_mb/examples/dreamer/dreamer.py", line 411, in main
call_record(logger, record, collected_frames, sampled_tensordict_save, stats, model_based_env, actor_model, cfg)
File "/fsx/users/vmoens/conda/envs/mbrl_dmcontrol3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/examples/dreamer/dreamer.py", line 132, in call_record
td_record = record(None)
File "/fsx/users/vmoens/conda/envs/mbrl_dmcontrol3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/torchrl/trainers/trainers.py", line 907, in __call__
td_record = self.recorder.rollout(
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/common.py", line 503, in rollout
tensordict = self.reset()
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/common.py", line 346, in reset
tensordict_reset = self._reset(tensordict, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/transforms/transforms.py", line 403, in _reset
out_tensordict = self.base_env.reset(execute_step=False, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/common.py", line 346, in reset
tensordict_reset = self._reset(tensordict, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/gym_like.py", line 122, in _reset
source=self._read_obs(obs),
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/gym_like.py", line 136, in _read_obs
observations = self.observation_spec.encode(observations)
File "/fsx/users/vmoens/work/rl_mb/torchrl/data/tensor_specs.py", line 1107, in encode
out[key] = self[key].encode(item)
File "/fsx/users/vmoens/work/rl_mb/torchrl/data/tensor_specs.py", line 243, in encode
assert v < 0.5, f"numpy: {val.shape}"
AssertionError: numpy: (240, 320, 3)
Please point me to where the rendering context is set up and where the multiprocessing occurs.
If it's consistently reproducible, a messy repro case will be better than not having one at all, so please do give us that anyway.
got it!
Also, are you saying with like-for-like experiment complexity level,
mujoco-py
rendering does not break in the same way?
let me rephrase: with one library where we used to rely on mujoco-py but switched to the new mujoco bindings, we have seen this issue appearing. I ran the following experiment using an old version of dm_control with torchrl and the issue disappears. Here's the setup
torchrl commit: 056699bd214937400c5cc7722669e7819a93bc1e
Setup:
conda create -n mbrl_olddmc python=3.9
conda activate mbrl_olddmc
pip install mujoco_py
pip install dm-control==0.0.403778684 # works with mujoco 210
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install functorch
pip install hydra-core
cd path/to/torchrl
python setup.py develop
conda env config vars set MJLIB_PATH=/data/home/vmoens/.mujoco/mujoco210/bin/libmujoco210.so LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/home/vmoens/.mujoco/mujoco210/bin MUJOCO_GL=egl PYOPENGL_PLATFORM=egl MUJOCO_PY_MUJOCO_PATH=/data/home/vmoens/.mujo
co/mujoco210
conda deactivate && conda activate mbrl_olddmc
Command:
EGL_DEVICE_ID=2 MUJOCO_GL=egl CHECK_IMAGES=1 srun -p train --gpus-per-node 3 -c 32 python dreamer.py frame_skip=2 init_env_steps=10000 logger=csv env_per_collector=1 num_workers=1
Importantly:
env_per_collector=1 num_workers=1 async_collection=False
which tell our trainer to collect data on the same process where the training occurs.For rendering, we use dm_control pixels wrapper. When executing a step we create a torch. Tensor from the numpy array and send it on device if needed.
In the example script I gave here above, we first run a random rollout in the environment to get statistics about the observation. To do that, we have a function that creates an environment instance, runs the rollout and calculates the stats. Then we run another random rollout to get data to pass to the model (to initialize it): we have lazy layers that take the right shape once they see real data. In this example, that's where the issue happens (not event during training).
I'm having trouble running python dreamer.py frame_skip=2 init_env_steps=10000 logger=csv
on my machine.
Could you please make a repro script that just runs the dm_control environment without any agent in the loop, preferably without any dependency on Torch?
Note also that I don't have access to a SLURM cluster and I need to repro this on a local machine.
OK, I have this running. I have zero familiarity with this code, but it seems that Hydra is creating some sort of default cfg
and is forcing cfg.collector_devices
to be ['cuda:1', 'cuda:1']
. On my machine with only a single GPU, this causes an "invalid ordinal" CUDA error.
I had to go into torchrl/trainers/helpers/envs.py
and manually override device
to 'cuda:0'
which allows the script to run. However, now everything runs just fine and I cannot actually trigger the error.
Let me write a single-gpu example for you
I've managed to trigger the error. Still investigating, but it looks like something is copying the rendering context objects in Python, which isn't a supported operation.
@vmoens Could you please try https://github.com/saran-t/dm_control/pull/1 and see if it fixes your issue?
It is running in a much more stable way than it used to. No noisy pixel, and runs that used to collapse after a couple of iterations are now running smoothly. For me this can be considered as closed. Thanks so much for your help @saran-t! This is amazing
I'll have this fixed in our 1.0.6 release later this week.
This should now be fixed in version 1.0.6.
Hi!
In some instances (embodied algos in my case) the new mujoco rendering gives unreadable images after a little while, e.g. here's a grid of 3 views of the same body:
This occurs after a little while, i.e. the first images rendered are perfectly fine. I tried to narrow it down to a minimal reproductible example but I can't find a way to do it (sorry about that!) When using the old bindings (mujoco-py and such) thiis issue disappears.
I'm using
MUJOCO_GL=egl
and have installed glew in my conda (working on a cluster where I have no sudo access).I'm working with either G100 or A100 GPUs, and using them for training and rendering. Also to mention: I'm running a bunch of envs in parallel (not multithrerad but multiprocessing) for fast collection of data.
Here is my conda env