Noisy or unreadable images when rendering

vmoens commented 2 years ago

Hi!

In some instances (embodied algos in my case) the new mujoco rendering gives unreadable images after a little while, e.g. here's a grid of 3 views of the same body:

This occurs after a little while, i.e. the first images rendered are perfectly fine. I tried to narrow it down to a minimal reproductible example but I can't find a way to do it (sorry about that!) When using the old bindings (mujoco-py and such) thiis issue disappears.

I'm using MUJOCO_GL=egl and have installed glew in my conda (working on a cluster where I have no sudo access).

I'm working with either G100 or A100 GPUs, and using them for training and rendering. Also to mention: I'm running a bunch of envs in parallel (not multithrerad but multiprocessing) for fast collection of data.

Here is my conda env

# packages in environment at /fsx/users/vmoens/conda/envs/rl4:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
_tflow_select             2.3.0                       mkl
absl-py                   1.0.0              pyhd8ed1ab_0    conda-forge
aiohttp                   3.8.1            py39hb9d737c_1    conda-forge
aiosignal                 1.2.0              pyhd8ed1ab_0    conda-forge
ale-py                    0.7.5                    pypi_0    pypi
alsa-lib                  1.2.6.1              h7f98852_0    conda-forge
anyio                     3.6.1                    pypi_0    pypi
aom                       3.3.0                h27087fc_1    conda-forge
argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
argon2-cffi-bindings      21.2.0           py39hb9d737c_2    conda-forge
astor                     0.8.1              pyh9f0ad1d_0    conda-forge
asttokens                 2.0.5              pyhd8ed1ab_0    conda-forge
astunparse                1.6.3              pyhd8ed1ab_0    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
atari-py                  0.2.9                    pypi_0    pypi
attr                      2.5.1                h166bdaf_0    conda-forge
attrs                     21.4.0             pyhd8ed1ab_0    conda-forge
autorom                   0.4.2                    pypi_0    pypi
autorom-accept-rom-license 0.4.2                    pypi_0    pypi
babel                     2.10.1                   pypi_0    pypi
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                        py_2    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
beautifulsoup4            4.11.1             pyha770c72_0    conda-forge
blas                      1.0                         mkl
bleach                    5.0.0              pyhd8ed1ab_0    conda-forge
blinker                   1.4                        py_1    conda-forge
brotlipy                  0.7.0           py39hb9d737c_1004    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.5.18.1          ha878542_0    conda-forge
cachetools                5.2.0                    pypi_0    pypi
certifi                   2022.5.18.1              pypi_0    pypi
cffi                      1.15.0           py39h4bc2ebd_0    conda-forge
charset-normalizer        2.0.12             pyhd8ed1ab_0    conda-forge
click                     8.1.3            py39hf3d152e_0    conda-forge
cloudpickle               1.2.2                    pypi_0    pypi
configargparse            1.5.3                    pypi_0    pypi
cryptography              37.0.1           py39h9ce1e76_0
cudatoolkit               11.3.1               h2bc3f7f_2
cycler                    0.11.0                   pypi_0    pypi
cython                    0.29.30                  pypi_0    pypi
dbus                      1.13.6               h5008d03_3    conda-forge
debugpy                   1.6.0            py39h5a03fae_0    conda-forge
decorator                 4.4.2                    pypi_0    pypi
defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
dm-control                1.0.3.post1              pypi_0    pypi
dm-env                    1.5                      pypi_0    pypi
dm-tree                   0.1.7                    pypi_0    pypi
elfutils                  0.186                he364ef2_0    conda-forge
entrypoints               0.4                pyhd8ed1ab_0    conda-forge
executing                 0.8.3              pyhd8ed1ab_0    conda-forge
expat                     2.4.8                h27087fc_0    conda-forge
fasteners                 0.17.3                   pypi_0    pypi
ffmpeg                    5.0.1                h964e5f1_2    conda-forge
fftw                      3.3.8           nompi_hfc0cae8_1114    conda-forge
flatten-dict              0.4.2                    pypi_0    pypi
flit-core                 3.7.1              pyhd8ed1ab_0    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.14.0               h8e229c2_0    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
fonttools                 4.33.3                   pypi_0    pypi
freetype                  2.10.4               h0708190_1    conda-forge
freetype-py               2.3.0                    pypi_0    pypi
frozenlist                1.3.0            py39hb9d737c_1    conda-forge
functorch                 0.3.0a0+693bcee          pypi_0    pypi
gast                      0.4.0              pyh9f0ad1d_0    conda-forge
gettext                   0.19.8.1          h73d1719_1008    conda-forge
giflib                    5.2.1                h36c2ea0_2    conda-forge
glew                      2.1.0                h9c3ff4c_2    conda-forge
glew-osmesa               1.13.0.20151117               0    menpo
glfw                      2.5.3                    pypi_0    pypi
glfw3                     3.2.1                         0    menpo
gmp                       6.2.1                h58526e2_0    conda-forge
gnutls                    3.6.13               h85f3911_1    conda-forge
google-auth               2.6.6                    pypi_0    pypi
google-auth-oauthlib      0.4.6              pyhd8ed1ab_0    conda-forge
google-pasta              0.2.0              pyh8c360ce_0    conda-forge
grpcio                    1.46.3           py39h0f497a6_0    conda-forge
gst-plugins-base          1.20.2               hf6a322e_1    conda-forge
gstreamer                 1.20.2               hd4edc92_1    conda-forge
gym                       0.24.1                   pypi_0    pypi
gym-notices               0.0.6                    pypi_0    pypi
h5py                      2.10.0          nompi_py39h98ba4bc_106    conda-forge
hdf5                      1.10.6          nompi_h3c11f04_101    conda-forge
icu                       70.1                 h27087fc_0    conda-forge
idna                      3.3                pyhd8ed1ab_0    conda-forge
imageio                   2.19.3                   pypi_0    pypi
imageio-ffmpeg            0.4.7                    pypi_0    pypi
importlib-metadata        4.11.4           py39hf3d152e_0    conda-forge
importlib_resources       5.7.1              pyhd8ed1ab_1    conda-forge
iniconfig                 1.1.1                    pypi_0    pypi
intel-openmp              2021.4.0          h06a4308_3561
ipykernel                 6.13.0                   pypi_0    pypi
ipython                   8.4.0            py39hf3d152e_0    conda-forge
ipython-genutils          0.2.0                    pypi_0    pypi
ipython_genutils          0.2.0                      py_1    conda-forge
ipywidgets                7.7.0              pyhd8ed1ab_0    conda-forge
jack                      1.9.18            h8c3723f_1002    conda-forge
jedi                      0.18.1           py39hf3d152e_1    conda-forge
jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
jpeg                      9e                   h166bdaf_1    conda-forge
json-c                    0.16                 hc379101_0    conda-forge
json5                     0.9.8                    pypi_0    pypi
jsonschema                4.5.1                    pypi_0    pypi
jupyter                   1.0.0            py39hf3d152e_7    conda-forge
jupyter-client            7.3.1                    pypi_0    pypi
jupyter-server            1.17.0                   pypi_0    pypi
jupyter_client            7.3.4              pyhd8ed1ab_0    conda-forge
jupyter_console           6.4.3              pyhd8ed1ab_0    conda-forge
jupyter_core              4.10.0           py39hf3d152e_0    conda-forge
jupyterlab                3.4.2                    pypi_0    pypi
jupyterlab-server         2.14.0                   pypi_0    pypi
jupyterlab_pygments       0.2.2              pyhd8ed1ab_0    conda-forge
jupyterlab_widgets        1.1.0              pyhd8ed1ab_0    conda-forge
keras-preprocessing       1.1.2              pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.2                    pypi_0    pypi
krb5                      1.19.3               h3790be6_0    conda-forge
labmaze                   1.0.5                    pypi_0    pypi
lame                      3.100             h7f98852_1001    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
lerc                      3.0                  h9c3ff4c_0    conda-forge
libarchive                3.5.2                hb890918_2    conda-forge
libcap                    2.64                 ha37c62d_0    conda-forge
libclang                  14.0.4          default_h2e3cab8_0    conda-forge
libclang13                14.0.4          default_h3a83d3e_0    conda-forge
libcups                   2.3.3                hf5a7f15_1    conda-forge
libcurl                   7.83.1               h7bff187_0    conda-forge
libdb                     6.2.32               h9c3ff4c_0    conda-forge
libdeflate                1.10                 h7f98852_0    conda-forge
libdrm                    2.4.109              h7f98852_0    conda-forge
libdrm-cos6-x86_64        2.4.65                        4    anaconda
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               h9b69904_4    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libflac                   1.3.4                h27087fc_0    conda-forge
libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
libgfortran-ng            7.5.0               h14aa051_20    conda-forge
libgfortran4              7.5.0               h14aa051_20    conda-forge
libglib                   2.70.2               h174f98d_4    conda-forge
libglu                    9.0.0             he1b5a44_1001    conda-forge
libgomp                   12.1.0              h8d9b700_16    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
libllvm14                 14.0.4               he0ac6c6_0    conda-forge
libmicrohttpd             0.9.75               h7f98852_0    conda-forge
libnghttp2                1.47.0               h727a467_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libogg                    1.3.4                h7f98852_1    conda-forge
libopus                   1.3.1                h7f98852_1    conda-forge
libpciaccess              0.16                 h516909a_0    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libpq                     14.3                 hd77ab85_0    conda-forge
libprotobuf               3.20.1               h6239696_0    conda-forge
libsndfile                1.0.31               h9c3ff4c_1    conda-forge
libsodium                 1.0.18               h36c2ea0_1    conda-forge
libssh2                   1.10.0               ha56f1ee_2    conda-forge
libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
libtiff                   4.3.0                h0fcbabc_4    conda-forge
libtool                   2.4.6             h9c3ff4c_1008    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libva                     2.14.0               h7f98852_0    conda-forge
libvorbis                 1.3.7                h9c3ff4c_0    conda-forge
libvpx                    1.11.0               h9c3ff4c_3    conda-forge
libwebp                   1.2.2                h3452ae3_0    conda-forge
libwebp-base              1.2.2                h7f98852_1    conda-forge
libx11-common-cos6-x86_64 1.6.4                         4    anaconda
libx11-cos6-x86_64        1.6.4                         4    anaconda
libxcb                    1.13              h7f98852_1004    conda-forge
libxkbcommon              1.0.3                he3ba5ed_0    conda-forge
libxml2                   2.9.14               h22db469_0    conda-forge
libzlib                   1.2.12               h166bdaf_0    conda-forge
lockfile                  0.12.2                   pypi_0    pypi
lxml                      4.8.0                    pypi_0    pypi
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
lzo                       2.10              h516909a_1000    conda-forge
markdown                  3.3.7              pyhd8ed1ab_0    conda-forge
markupsafe                2.1.1            py39hb9d737c_1    conda-forge
matplotlib                3.5.2                    pypi_0    pypi
matplotlib-inline         0.1.3              pyhd8ed1ab_0    conda-forge
mesa-libgl-cos6-x86_64    11.0.7                        4    anaconda
mesalib                   21.2.5               h0e4506f_3    conda-forge
mistune                   0.8.4           py39h3811e60_1005    conda-forge
mj-envs                   1.0.0                     dev_0    <develop>
mjrl                      0.1.1                     dev_0    <develop>
mkl                       2021.4.0           h06a4308_640
mkl-service               2.4.0            py39h7e14d7c_0    conda-forge
mkl_fft                   1.3.1            py39h0c7bc48_1    conda-forge
mkl_random                1.2.2            py39hde0f152_0    conda-forge
moviepy                   1.0.3                    pypi_0    pypi
mujoco                    2.2.0                    pypi_0    pypi
mujoco-py                 2.0.2.2                   dev_0    <develop>
multidict                 6.0.2            py39hb9d737c_1    conda-forge
mysql-common              8.0.29               haf5c9bc_1    conda-forge
mysql-libs                8.0.29               h28c427c_1    conda-forge
nbclassic                 0.3.7                    pypi_0    pypi
nbclient                  0.6.3                    pypi_0    pypi
nbconvert                 6.5.0              pyhd8ed1ab_0    conda-forge
nbconvert-core            6.5.0              pyhd8ed1ab_0    conda-forge
nbconvert-pandoc          6.5.0              pyhd8ed1ab_0    conda-forge
nbformat                  5.4.0              pyhd8ed1ab_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
nest-asyncio              1.5.5              pyhd8ed1ab_0    conda-forge
nettle                    3.6                  he412f7d_0    conda-forge
networkx                  2.8.3                    pypi_0    pypi
notebook                  6.4.11                   pypi_0    pypi
notebook-shim             0.1.0                    pypi_0    pypi
nspr                      4.32                 h9c3ff4c_1    conda-forge
nss                       3.78                 h2350873_0    conda-forge
numpy                     1.22.4                   pypi_0    pypi
numpy-base                1.22.3           py39hf524024_0
oauthlib                  3.2.0              pyhd8ed1ab_0    conda-forge
openh264                  2.1.1                h780b84a_0    conda-forge
openjpeg                  2.4.0                hb52868f_1    conda-forge
openssl                   1.1.1o               h166bdaf_0    conda-forge
opt_einsum                3.3.0              pyhd8ed1ab_1    conda-forge
osmesa                    12.2.2.dev                    0    menpo
packaging                 21.3               pyhd8ed1ab_0    conda-forge
pandas                    1.4.2                    pypi_0    pypi
pandoc                    2.18                 ha770c72_0    conda-forge
pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
patchelf                  0.14.5.0                 pypi_0    pypi
pcre                      8.45                 h9c3ff4c_0    conda-forge
pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    9.1.1            py39hae2aec6_0    conda-forge
pip                       22.1.1             pyhd8ed1ab_0    conda-forge
pluggy                    1.0.0                    pypi_0    pypi
portaudio                 19.6.0               h57a0ea0_5    conda-forge
proglog                   0.1.10                   pypi_0    pypi
prometheus_client         0.14.1             pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.29             pyha770c72_0    conda-forge
prompt_toolkit            3.0.29               hd8ed1ab_0    conda-forge
protobuf                  3.19.4                   pypi_0    pypi
psutil                    5.9.1            py39hb9d737c_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pulseaudio                14.0                 h583eb01_5    conda-forge
pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
py                        1.11.0                   pypi_0    pypi
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.8                    pypi_0    pypi
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pygame                    2.1.2                    pypi_0    pypi
pyglet                    1.5.26                   pypi_0    pypi
pygments                  2.12.0             pyhd8ed1ab_0    conda-forge
pyjwt                     2.4.0              pyhd8ed1ab_0    conda-forge
pyopengl                  3.1.6                    pypi_0    pypi
pyopenssl                 22.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 2.4.7                    pypi_0    pypi
pyqt                      5.15.4           py39h18e9c17_1    conda-forge
pyqt5-sip                 12.9.0           py39h5a03fae_1    conda-forge
pyrender                  0.1.45                   pypi_0    pypi
pyrsistent                0.18.1           py39hb9d737c_1    conda-forge
pysocks                   1.7.1            py39hf3d152e_5    conda-forge
pytest                    7.1.2                    pypi_0    pypi
python                    3.9.13          h9a8a25e_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-fastjsonschema     2.15.3             pyhd8ed1ab_0    conda-forge
python-flatbuffers        2.0                pyhd8ed1ab_0    conda-forge
python_abi                3.9                      2_cp39    conda-forge
pytorch                   1.13.0.dev20220531 py3.9_cuda11.3_cudnn8.3.2_0    pytorch-nightly
pytorch-mutex             1.0                        cuda    pytorch-nightly
pytz                      2022.1                   pypi_0    pypi
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
pyyaml                    6.0                      pypi_0    pypi
pyzmq                     23.0.0                   pypi_0    pypi
qt-main                   5.15.4               ha5833f6_1    conda-forge
qtconsole                 5.3.1              pyhd8ed1ab_0    conda-forge
qtconsole-base            5.3.1              pyha770c72_0    conda-forge
qtpy                      2.1.0              pyhd8ed1ab_0    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
requests                  2.27.1             pyhd8ed1ab_0    conda-forge
requests-oauthlib         1.3.1              pyhd8ed1ab_0    conda-forge
rsa                       4.8                pyhd8ed1ab_0    conda-forge
scipy                     1.8.1                    pypi_0    pypi
send2trash                1.8.0              pyhd8ed1ab_0    conda-forge
setuptools                62.3.2           py39hf3d152e_0    conda-forge
sip                       6.5.1            py39he80948d_2    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sk-video                  1.1.10                   pypi_0    pypi
sniffio                   1.2.0                    pypi_0    pypi
soupsieve                 2.3.2.post1              pypi_0    pypi
sqlite                    3.38.5               h4ff8645_0    conda-forge
stack_data                0.2.0              pyhd8ed1ab_0    conda-forge
submitit                  1.4.2                    pypi_0    pypi
svt-av1                   1.1.0                h27087fc_1    conda-forge
tensorboard               2.9.1                    pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1              pyhd8ed1ab_0    conda-forge
tensorflow                2.4.1           mkl_py39h4683426_0
tensorflow-base           2.4.1           mkl_py39h43e0292_0
tensorflow-estimator      2.6.0            py39he80948d_0    conda-forge
termcolor                 1.1.0                      py_2    conda-forge
terminado                 0.15.0           py39hf3d152e_0    conda-forge
tinycss2                  1.1.1              pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
tomli                     2.0.1                    pypi_0    pypi
torch-tb-profiler         0.4.0                    pypi_0    pypi
torchaudio                0.12.0.dev20220531      py39_cu113    pytorch-nightly
torchrl                   0.1                       dev_0    <develop>
torchvision               0.14.0.dev20220531      py39_cu113    pytorch-nightly
tornado                   6.1              py39hb9d737c_3    conda-forge
tqdm                      4.64.0                   pypi_0    pypi
traitlets                 5.2.1.post0              pypi_0    pypi
trimesh                   3.12.6                   pypi_0    pypi
typing-extensions         4.2.0                hd8ed1ab_1    conda-forge
typing_extensions         4.2.0              pyha770c72_1    conda-forge
tzdata                    2022a                h191b570_0    conda-forge
urllib3                   1.26.9             pyhd8ed1ab_0    conda-forge
wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
webencodings              0.5.1                    pypi_0    pypi
websocket-client          1.3.2                    pypi_0    pypi
werkzeug                  2.1.2              pyhd8ed1ab_1    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
widgetsnbextension        3.6.0            py39hf3d152e_0    conda-forge
wrapt                     1.14.1           py39hb9d737c_0    conda-forge
x264                      1!161.3030           h7f98852_1    conda-forge
x265                      3.5                  h924138e_3    conda-forge
xorg-damageproto          1.2.1             h7f98852_1002    conda-forge
xorg-fixesproto           5.0               h7f98852_1002    conda-forge
xorg-glproto              1.4.17            h7f98852_1002    conda-forge
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libx11               1.7.2                h7f98852_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxcursor           1.2.0                h7f98852_0    conda-forge
xorg-libxdamage           1.1.5                h7f98852_1    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h7f98852_1    conda-forge
xorg-libxfixes            5.0.3             h7f98852_1004    conda-forge
xorg-libxinerama          1.1.4             h9c3ff4c_1001    conda-forge
xorg-libxrandr            1.5.2                h7f98852_1    conda-forge
xorg-libxrender           0.9.10            h7f98852_1003    conda-forge
xorg-randrproto           1.5.0             h7f98852_1001    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-util-macros          1.19.3               h7f98852_0    conda-forge
xorg-xextproto            7.3.0             h7f98852_1002    conda-forge
xorg-xf86vidmodeproto     2.3.1             h7f98852_1002    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yarl                      1.7.2            py39hb9d737c_2    conda-forge
zeromq                    4.3.4                h9c3ff4c_1    conda-forge
zipp                      3.8.0              pyhd8ed1ab_0    conda-forge
zlib                      1.2.12               h166bdaf_0    conda-forge
zstd                      1.5.2                h8a70e8d_1    conda-forge

ikostrikov commented 2 years ago

I have a similar problem. Sometimes it renders images correctly, but sometimes it renders only the background image (see the video). This issue is non-deterministic, and the video might be rendered correctly or incorrectly for the same seed.

OS: Ubuntu 20.04, MuJoCo version: 2.2.0 I use MUJOCO_GL=egl as well.

saran-t commented 2 years ago

Can you please try running with DISABLE_RENDER_THREAD_OFFLOADING=1 (environment variable)?

vmoens commented 2 years ago

I still get the same behaviour with DISABLE_RENDER_THREAD_OFFLOADING=1 :/

ikostrikov commented 2 years ago

@saran-t DISABLE_RENDER_THREAD_OFFLOADING=1 doesn't resolve the problem for me either.

saran-t commented 2 years ago

Can I please have a minimal repro code that I can run on my side?

saran-t commented 2 years ago

@vmoens @ikostrikov Gentle nudge on the request for minimal repro above. We'd like to try to get to the bottom of this.

vmoens commented 2 years ago

Hi @saran-t I've been trying hard to reproduce this but it seems to only happen after the code reaches a certain level of complexity (e.g. gpus are used for training and rendering, etc.) Would it be ok if I point you to a specific commit on torchrl, give you the precise conda env setting, the machine config etc for you to reproduce? It's going to be a bit messy but at least it's something!

saran-t commented 2 years ago

If it's consistently reproducible, a messy repro case will be better than not having one at all, so please do give us that anyway.

Also, are you saying with like-for-like experiment complexity level, mujoco-py rendering does not break in the same way?

vmoens commented 2 years ago

Here's one 0e88eac27f1d01bfa1d260d52c051ab5fe514859

Here's the command line

conda create -n mbrl_dmcontrol3 python=3.10
conda activate mbrl_dmcontrol3
pip install dm_control
module load cuda/11.6 nccl/2.12.7-cuda.11.6 nccl_efa/1.15.1-nccl.2.12.7-cuda.11.6
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install functorch
pip install hydra-core
# from torchrl root:
python setup.py develop
cd examples/dreamer/
EGL_DEVICE_ID=2 MUJOCO_GL=egl CHECK_IMAGES=1 srun -p train --gpus-per-node 3 -c 32 python dreamer.py frame_skip=2 init_env_steps=10000 logger=csv

The CHECK_IMAGES=1 will make sure an error is raise as soon as an image is more than half black or white (ie render has collapsed)

You should see an error like this during the first test rollout:

Traceback (most recent call last):
  File "/fsx/users/vmoens/work/rl_mb/examples/dreamer/dreamer.py", line 411, in main
    call_record(logger, record, collected_frames, sampled_tensordict_save, stats, model_based_env, actor_model, cfg)
  File "/fsx/users/vmoens/conda/envs/mbrl_dmcontrol3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/examples/dreamer/dreamer.py", line 132, in call_record
    td_record = record(None)
  File "/fsx/users/vmoens/conda/envs/mbrl_dmcontrol3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/trainers/trainers.py", line 907, in __call__
    td_record = self.recorder.rollout(
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/common.py", line 503, in rollout
    tensordict = self.reset()
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/common.py", line 346, in reset
    tensordict_reset = self._reset(tensordict, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/transforms/transforms.py", line 403, in _reset
    out_tensordict = self.base_env.reset(execute_step=False, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/common.py", line 346, in reset
    tensordict_reset = self._reset(tensordict, **kwargs)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/gym_like.py", line 122, in _reset
    source=self._read_obs(obs),
  File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/gym_like.py", line 136, in _read_obs
    observations = self.observation_spec.encode(observations)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/data/tensor_specs.py", line 1107, in encode
    out[key] = self[key].encode(item)
  File "/fsx/users/vmoens/work/rl_mb/torchrl/data/tensor_specs.py", line 243, in encode
    assert v < 0.5, f"numpy: {val.shape}"
AssertionError: numpy: (240, 320, 3)

saran-t commented 2 years ago

Please point me to where the rendering context is set up and where the multiprocessing occurs.

vmoens commented 2 years ago

If it's consistently reproducible, a messy repro case will be better than not having one at all, so please do give us that anyway.

got it!

Also, are you saying with like-for-like experiment complexity level, mujoco-py rendering does not break in the same way?

let me rephrase: with one library where we used to rely on mujoco-py but switched to the new mujoco bindings, we have seen this issue appearing. I ran the following experiment using an old version of dm_control with torchrl and the issue disappears. Here's the setup

torchrl commit: 056699bd214937400c5cc7722669e7819a93bc1e

Setup:

conda create -n mbrl_olddmc python=3.9
conda activate mbrl_olddmc
pip install mujoco_py
pip install dm-control==0.0.403778684  # works with mujoco 210
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install functorch
pip install hydra-core
cd path/to/torchrl
python setup.py develop

conda env config vars set MJLIB_PATH=/data/home/vmoens/.mujoco/mujoco210/bin/libmujoco210.so LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/home/vmoens/.mujoco/mujoco210/bin MUJOCO_GL=egl PYOPENGL_PLATFORM=egl MUJOCO_PY_MUJOCO_PATH=/data/home/vmoens/.mujo
co/mujoco210
conda deactivate && conda activate mbrl_olddmc

Command:

EGL_DEVICE_ID=2 MUJOCO_GL=egl CHECK_IMAGES=1 srun -p train --gpus-per-node 3 -c 32 python dreamer.py frame_skip=2 init_env_steps=10000 logger=csv env_per_collector=1 num_workers=1

Importantly:

this code immediately fails on the env where I have installed the new dm_control version (with new bindings) but not with the old one
this setup will not use any parallelism for rendering (a single process is launched), see env_per_collector=1 num_workers=1 async_collection=False which tell our trainer to collect data on the same process where the training occurs.
On the same cluster (different nodes but same setup) the error occurs consistently at the same time.

For rendering, we use dm_control pixels wrapper. When executing a step we create a torch. Tensor from the numpy array and send it on device if needed.

In the example script I gave here above, we first run a random rollout in the environment to get statistics about the observation. To do that, we have a function that creates an environment instance, runs the rollout and calculates the stats. Then we run another random rollout to get data to pass to the model (to initialize it): we have lazy layers that take the right shape once they see real data. In this example, that's where the issue happens (not event during training).

saran-t commented 2 years ago

I'm having trouble running python dreamer.py frame_skip=2 init_env_steps=10000 logger=csv on my machine.

Could you please make a repro script that just runs the dm_control environment without any agent in the loop, preferably without any dependency on Torch?

Note also that I don't have access to a SLURM cluster and I need to repro this on a local machine.

saran-t commented 2 years ago

OK, I have this running. I have zero familiarity with this code, but it seems that Hydra is creating some sort of default cfg and is forcing cfg.collector_devices to be ['cuda:1', 'cuda:1']. On my machine with only a single GPU, this causes an "invalid ordinal" CUDA error.

I had to go into torchrl/trainers/helpers/envs.py and manually override device to 'cuda:0' which allows the script to run. However, now everything runs just fine and I cannot actually trigger the error.

vmoens commented 2 years ago

Let me write a single-gpu example for you

saran-t commented 2 years ago

I've managed to trigger the error. Still investigating, but it looks like something is copying the rendering context objects in Python, which isn't a supported operation.

saran-t commented 2 years ago

@vmoens Could you please try https://github.com/saran-t/dm_control/pull/1 and see if it fixes your issue?

vmoens commented 2 years ago

It is running in a much more stable way than it used to. No noisy pixel, and runs that used to collapse after a couple of iterations are now running smoothly. For me this can be considered as closed. Thanks so much for your help @saran-t! This is amazing

saran-t commented 2 years ago

I'll have this fixed in our 1.0.6 release later this week.

saran-t commented 2 years ago

This should now be fixed in version 1.0.6.

google-deepmind / dm_control

Noisy or unreadable images when rendering #310