Closed megatran closed 1 year ago
updated with more details
Does this still fail when you try on colab after the new fix? Or this is just your GPU server. On my side Colab works fine atm (I tried running the visual RL code).
Do you have any details about your GPU server setup?
Colab runs fine (I'm training the Visual RL block at the moment).
The GPU setup uses srun
to request for resources.
I'm ssh to it from my Mac (using -X forwarding with XQuartz for graphics). After a GPU resource is allocated, I can ssh into that and I can confirm that X11 forwarding still works by using xclock
in the server and the clock actually shows up on my Mac screen.
I'm using conda env with Python 3.8
My ~/.bashrc in the server has these configuration for vulkan
export VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json
export VK_LAYER_PATH=/etc/vulkan/implicit_layer.d/nvidia_layers.json
Since the regular env can run and render the scene, I think the graphical forwarding is working.
However, this only happens when I try to use VecEnv
. There's this error/warning that says Only 1 renderer is allowed per process. All previously created renderer resources are now invalid
. This makes me wonder whether the GPU-optimized vectorized environments are trying to create multiple Vulkan renderes within a single process? From the error, I interpret it as Vulkan only allowing "1 render per process" but this VecEnv
is somehow attempting to create multiple renderers/resources per process.
nvidia-smi
nvidia-smi
Mon May 1 22:30:54 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA TITAN Xp On | 00000000:02:00.0 Off | N/A |
| 23% 18C P8 8W / 250W| 1MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
So I'm running into a similar issue when I try on a local linux laptop with a GPU
Here's my local Python 3.8 env setup
Package Version Editable project location
------------------------ ---------- -------------------------------------------
absl-py 1.4.0
antlr4-python3-runtime 4.9.3
anyio 3.6.2
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asttokens 2.2.1
attrs 23.1.0
backcall 0.2.0
beautifulsoup4 4.12.2
bleach 6.0.0
cachetools 5.3.0
certifi 2022.12.7
cffi 1.15.1
charset-normalizer 3.1.0
cloudpickle 2.2.1
cmake 3.26.3
comm 0.1.3
contourpy 1.0.7
cycler 0.11.0
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
executing 1.2.0
fastjsonschema 2.16.3
filelock 3.12.0
fonttools 4.39.3
fqdn 1.5.1
gdown 4.7.1
gitdb 4.0.10
GitPython 3.1.31
google-auth 2.17.3
google-auth-oauthlib 1.0.0
grpcio 1.54.0
gym 0.21.0
h5py 3.8.0
hydra-core 1.3.2
idna 3.4
imageio 2.28.1
imageio-ffmpeg 0.4.8
importlib-metadata 4.13.0
importlib-resources 5.12.0
ipykernel 6.22.0
ipython 8.12.1
ipython-genutils 0.2.0
isoduration 20.11.0
jedi 0.18.2
Jinja2 3.1.2
jsonpointer 2.3
jsonschema 4.17.3
jupyter_client 8.2.0
jupyter_core 5.3.0
jupyter-events 0.6.3
jupyter_server 2.5.0
jupyter_server_terminals 0.4.4
jupyterlab-pygments 0.2.2
kiwisolver 1.4.4
lit 16.0.2
mani-skill2 0.4.2 /home/nt/Documents/RobotLearning/ManiSkill2
Markdown 3.4.3
MarkupSafe 2.1.2
matplotlib 3.7.1
matplotlib-inline 0.1.6
mistune 2.0.5
mpmath 1.3.0
nbclassic 0.5.6
nbclient 0.7.4
nbconvert 7.3.1
nbformat 5.8.0
nest-asyncio 1.5.6
networkx 3.1
notebook 6.5.4
notebook_shim 0.2.3
numpy 1.23.5
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
oauthlib 3.2.2
omegaconf 2.3.0
opencv-python 4.7.0.72
packaging 23.1
pandas 2.0.1
pandocfilters 1.5.0
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.5.0
pip 23.0.1
pkgutil_resolve_name 1.3.10
platformdirs 3.5.0
prometheus-client 0.16.0
prompt-toolkit 3.0.38
protobuf 4.22.3
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pyasn1 0.5.0
pyasn1-modules 0.3.0
pycparser 2.21
Pygments 2.15.1
pyparsing 3.0.9
pyrsistent 0.19.3
PySocks 1.7.1
python-dateutil 2.8.2
python-json-logger 2.0.7
pytz 2023.3
PyYAML 6.0
pyzmq 25.0.2
r3m 0.0.0 /home/nt/Documents/RobotLearning/r3m
requests 2.29.0
requests-oauthlib 1.3.1
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rsa 4.9
Rtree 1.0.1
sapien 2.2.1
scipy 1.10.1
Send2Trash 1.8.2
setuptools 65.5.0
six 1.16.0
smmap 5.0.0
sniffio 1.3.0
soupsieve 2.4.1
stable-baselines3 1.8.0
stack-data 0.6.2
sympy 1.11.1
tabulate 0.9.0
tensorboard 2.12.2
tensorboard-data-server 0.7.0
tensorboard-plugin-wit 1.8.1
terminado 0.17.1
tinycss2 1.2.1
torch 2.0.0
torchvision 0.15.1
tornado 6.3.1
tqdm 4.65.0
traitlets 5.9.0
transforms3d 0.4.1
trimesh 3.21.5
triton 2.0.0
typing_extensions 4.5.0
tzdata 2023.3
uri-template 1.2.0
urllib3 1.26.15
wcwidth 0.2.6
webcolors 1.13
webencodings 0.5.1
websocket-client 1.5.1
Werkzeug 2.3.3
wheel 0.38.4
zipp 3.15.0
[2023-05-02 13:44:11.123] [svulkan2] [warning] Only 1 renderer is allowed per process. All previously created renderer resources are now invalid
2023-05-02 13:44:11,160 - mani_skill2 - INFO - RenderServer is running at: localhost:34585
2023-05-02 13:44:12,781 - mani_skill2 - ERROR - 'NoneType' object has no attribute 'vertices'
Traceback (most recent call last):
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/vector/vec_env.py", line 56, in _worker
env = env_fn()
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/vector/registration.py", line 11, in _make_env
env = env_spec.make(**kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/utils/registration.py", line 34, in make
return self.cls(**_kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/open_cabinet_door_drawer.py", line 38, in __init__
super().__init__(*args, **kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/base_env.py", line 55, in __init__
super().__init__(*args, **kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/sapien_env.py", line 178, in __init__
obs = self.reset(reconfigure=True)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/open_cabinet_door_drawer.py", line 159, in reset
return super().reset(seed=seed, reconfigure=reconfigure, model_id=model_id)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/base_env.py", line 87, in reset
ret = super().reset(seed=self._episode_seed, reconfigure=reconfigure)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/sapien_env.py", line 473, in reset
self.reconfigure()
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/sapien_env.py", line 359, in reconfigure
self._load_articulations()
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/open_cabinet_door_drawer.py", line 66, in _load_articulations
self._set_cabinet_handles_mesh()
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/open_cabinet_door_drawer.py", line 94, in _set_cabinet_handles_mesh
meshes.extend(get_visual_body_meshes(visual_body))
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/utils/trimesh_utils.py", line 40, in get_visual_body_meshes
vertices = render_shape.mesh.vertices * visual_body.scale # [n, 3]
AttributeError: 'NoneType' object has no attribute 'vertices'
Process ForkServerProcess-4:
Traceback (most recent call last):
File "/home/nt/miniconda3/envs/robotlearning38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/nt/miniconda3/envs/robotlearning38/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/vector/vec_env.py", line 86, in _worker
env.close()
UnboundLocalError: local variable 'env' referenced before assignment
2023-05-02 13:44:12,803 - mani_skill2 - ERROR - 'NoneType' object has no attribute 'vertices'
Traceback (most recent call last):
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/vector/vec_env.py", line 56, in _worker
env = env_fn()
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/vector/registration.py", line 11, in _make_env
env = env_spec.make(**kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/utils/registration.py", line 34, in make
return self.cls(**_kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/open_cabinet_door_drawer.py", line 38, in __init__
super().__init__(*args, **kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/base_env.py", line 55, in __init__
super().__init__(*args, **kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/sapien_env.py", line 178, in __init__
obs = self.reset(reconfigure=True)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/open_cabinet_door_drawer.py", line 159, in reset
return super().reset(seed=seed, reconfigure=reconfigure, model_id=model_id)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/base_env.py", line 87, in reset
ret = super().reset(seed=self._episode_seed, reconfigure=reconfigure)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/sapien_env.py", line 473, in reset
self.reconfigure()
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/sapien_env.py", line 359, in reconfigure
self._load_articulations()
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/open_cabinet_door_drawer.py", line 66, in _load_articulations
self._set_cabinet_handles_mesh()
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/envs/ms1/open_cabinet_door_drawer.py", line 94, in _set_cabinet_handles_mesh
meshes.extend(get_visual_body_meshes(visual_body))
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/utils/trimesh_utils.py", line 40, in get_visual_body_meshes
vertices = render_shape.mesh.vertices * visual_body.scale # [n, 3]
AttributeError: 'NoneType' object has no attribute 'vertices'
Process ForkServerProcess-5:
Traceback (most recent call last):
File "/home/nt/miniconda3/envs/robotlearning38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/nt/miniconda3/envs/robotlearning38/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/nt/Documents/RobotLearning/ManiSkill2/mani_skill2/vector/vec_env.py", line 86, in _worker
env.close()
UnboundLocalError: local variable 'env' referenced before assignment
---------------------------------------------------------------------------
ConnectionResetError Traceback (most recent call last)
Cell In[10], line 50
46 eval_env.reset()
48 # create num_envs training environments, with max_episode_steps=100
49 # instead of the default 200 to speed up training
---> 50 env: VecEnv = make_vec_env(
51 env_id,
52 num_envs,
53 obs_mode=obs_mode,
54 reward_mode=reward_mode,
55 control_mode=control_mode,
56 # specify wrappers for each individual environment e.g here we specify the
57 # Continuous task wrapper and pass in the max_episode_steps parameter via the partial tool
58 wrappers=[
59 partial(ContinuousTaskWrapper, max_episode_steps=100)
60 ]
61 )
62 env = ManiSkillRGBDVecEnvWrapper(env)
63 # use the maniskill provided SB3VecEnvWrapper to make the environment compatible with SB3
File ~/Documents/RobotLearning/ManiSkill2/mani_skill2/vector/registration.py:81, in make(env_id, num_envs, server_address, wrappers, enable_segmentation, **kwargs)
77 else:
78 raise NotImplementedError(
79 f"Unsupported observation mode for VecEnv: {obs_mode}"
80 )
---> 81 venv = venv_cls([env_fn for _ in range(num_envs)], server_address=server_address)
82 venv.obs_mode = obs_mode
84 if "robot_seg" in obs_mode:
File ~/Documents/RobotLearning/ManiSkill2/mani_skill2/vector/vec_env.py:435, in RGBDVecEnv.__init__(self, *args, **kwargs)
434 def __init__(self, *args, **kwargs):
--> 435 super().__init__(*args, **kwargs)
437 from mani_skill2.utils.wrappers.observation import RGBDObservationWrapper
439 RGBDObservationWrapper.update_observation_space(self.observation_space)
File ~/Documents/RobotLearning/ManiSkill2/mani_skill2/vector/vec_env.py:204, in VecEnv.__init__(self, env_fns, start_method, server_address, server_kwargs)
202 remote.send(("handshake", None))
203 for remote in self.remotes:
--> 204 remote.recv()
206 # Infer texture names
207 texture_names = set()
File ~/miniconda3/envs/robotlearning38/lib/python3.8/multiprocessing/connection.py:250, in _ConnectionBase.recv(self)
248 self._check_closed()
249 self._check_readable()
--> 250 buf = self._recv_bytes()
251 return _ForkingPickler.loads(buf.getbuffer())
File ~/miniconda3/envs/robotlearning38/lib/python3.8/multiprocessing/connection.py:414, in Connection._recv_bytes(self, maxsize)
413 def _recv_bytes(self, maxsize=None):
--> 414 buf = self._recv(4)
415 size, = struct.unpack("!i", buf.getvalue())
416 if size == -1:
File ~/miniconda3/envs/robotlearning38/lib/python3.8/multiprocessing/connection.py:379, in Connection._recv(self, size, read)
377 remaining = size
378 while remaining > 0:
--> 379 chunk = read(handle, remaining)
380 n = len(chunk)
381 if n == 0:
ConnectionResetError: [Errno 104] Connection reset by peer
after further debugging, I wonder if the opencabinet env is the culprit,
When I default to the example "LiftCube-v0", Colab, GPU server, and local machine seem to run fine!
num_envs = 2 # you can increases this and decrease the n_steps parameter if you have more cores to speed up training
env_id = "LiftCube-v0"
obs_mode = "state"
control_mode = "pd_ee_delta_pose"
reward_mode = "dense"
Could you try using a sparse reward setting for the failing envs?
Closing in favor of https://github.com/haosulab/ManiSkill2/issues/88
Current solution: https://github.com/haosulab/ManiSkill2/issues/88#issuecomment-1532194498
Hello!
Since there's some current issue with Google Colab (https://github.com/haosulab/ManiSkill2/issues/85), I decided to switch to a gpu server (ssh from my local mac, I'm using XQuartz and X11 forwarding)
My test jupyter notebook seems to render the env fine
But when I tried th VecEnv,
I'm getting these errors
I'm wondering if it might be related to https://github.com/haosulab/ManiSkill2/issues/28. Unfortunately, I don't have
sudo
privileges in this gpu server so I can only manage my local setup usingconda
andpip
.