simpler-env / SimplerEnv

Evaluating and reproducing real-world robot manipulation policies (e.g., RT-1, RT-1-X, Octo) in simulation under common setups (e.g., Google Robot, WidowX+Bridge) (CoRL 2024)
https://simpler-env.github.io/
MIT License
344 stars 45 forks source link

Enabling ray tracing cause the crash #17

Closed hilookas closed 4 months ago

hilookas commented 4 months ago

When I delete --enable-raytracing option from SimplerEnv/scripts/openvla_drawer_variant_agg.sh, it seems works fine, but when i add this option back, the env crashed and reported:

openvla/openvla-7b OpenTopDrawerCustomInScene-v0
2024-07-24 16:42:20.547214: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-24 16:42:20.588931: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-24 16:42:20.588982: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-24 16:42:20.590028: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-24 16:42:20.596251: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-24 16:42:21.357531: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Octo is not correctly imported.
No module named 'jax'
**** openvla ****
*** policy_setup: google_robot, unnorm_key: fractal20220817_data ***
/home/ubuntu/miniforge3/envs/openvla/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/ubuntu/miniforge3/envs/openvla/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.05it/s]
/home/ubuntu/miniforge3/envs/openvla/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Control mode:  arm_pd_ee_delta_pose_align_interpolate_by_planner_gripper_pd_joint_target_delta_pos_interpolate_by_planner
scripts/openvla_drawer_variant_agg.sh: line 21: 126570 Segmentation fault      (core dumped) python simpler_env/main_inference.py --policy-model openvla --ckpt-path ${ckpt_path} --robot google_robot_static --control-freq 3 --sim-freq 513 --max-episode-steps 113 --env-name ${env_name} --scene-name ${scene_name} --robot-init-x 0.65 0.85 3 --robot-init-y -0.2 0.2 3 --robot-init-rot-quat-center 0 0 0 1 --robot-init-rot-rpy-range 0 0 1 0 0 1 0.0 0.0 1 --obj-init-x-range 0 0 1 --obj-init-y-range 0 0 1 ${EXTRA_ARGS}
openvla/openvla-7b OpenMiddleDrawerCustomInScene-v0
2024-07-24 16:42:37.389224: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-24 16:42:37.431421: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-24 16:42:37.431470: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-24 16:42:37.432551: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-24 16:42:37.438887: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-24 16:42:38.197379: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Octo is not correctly imported.
No module named 'jax'
**** openvla ****
*** policy_setup: google_robot, unnorm_key: fractal20220817_data ***
/home/ubuntu/miniforge3/envs/openvla/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Env:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.6 LTS
Release:        20.04
Codename:       focal

$ dpkg -l | grep -i nvidia
ii  libnvidia-cfg1-550:amd64                   550.78-0ubuntu0.20.04.1               amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-550                       550.78-0ubuntu0.20.04.1               all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-550:amd64                550.78-0ubuntu0.20.04.1               amd64        NVIDIA libcompute package
ii  libnvidia-compute-550:i386                 550.78-0ubuntu0.20.04.1               i386         NVIDIA libcompute package
ii  libnvidia-decode-550:amd64                 550.78-0ubuntu0.20.04.1               amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-550:i386                  550.78-0ubuntu0.20.04.1               i386         NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-550:amd64                 550.78-0ubuntu0.20.04.1               amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-550:i386                  550.78-0ubuntu0.20.04.1               i386         NVENC Video Encoding runtime library
ii  libnvidia-extra-550:amd64                  550.78-0ubuntu0.20.04.1               amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-550:amd64                   550.78-0ubuntu0.20.04.1               amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-550:i386                    550.78-0ubuntu0.20.04.1               i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-550:amd64                     550.78-0ubuntu0.20.04.1               amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-550:i386                      550.78-0ubuntu0.20.04.1               i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  nvidia-compute-utils-550                   550.78-0ubuntu0.20.04.1               amd64        NVIDIA compute utilities
rc  nvidia-cuda-toolkit                        10.1.243-3                            amd64        NVIDIA CUDA development toolkit
ii  nvidia-dkms-550                            550.78-0ubuntu0.20.04.1               amd64        NVIDIA DKMS package
ii  nvidia-driver-550                          550.78-0ubuntu0.20.04.1               amd64        NVIDIA driver metapackage
ii  nvidia-firmware-550-550.78                 550.78-0ubuntu0.20.04.1               amd64        Firmware files used by the kernel module
ii  nvidia-kernel-common-550                   550.78-0ubuntu0.20.04.1               amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-550                   550.78-0ubuntu0.20.04.1               amd64        NVIDIA kernel source package
ii  nvidia-prime                               0.8.16~0.20.04.2                      all          Tools to enable NVIDI's Prime
ii  nvidia-settings                            470.57.01-0ubuntu0.20.04.3            amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-550                           550.78-0ubuntu0.20.04.1               amd64        NVIDIA driver support binaries
ii  screen-resolution-extra                    0.18build1                            all          Extension for the nvidia-settings control panel
ii  xserver-xorg-video-nvidia-550              550.78-0ubuntu0.20.04.1               amd64        NVIDIA binary Xorg driver

$ vulkaninfo
'DISPLAY' environment variable not set... skipping surface info
WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.2.131

Instance Extensions: count = 19
====================
        VK_EXT_acquire_xlib_display            : extension revision 1
        VK_EXT_debug_report                    : extension revision 10
        VK_EXT_debug_utils                     : extension revision 2
        VK_EXT_direct_mode_display             : extension revision 1
        VK_EXT_display_surface_counter         : extension revision 1
        VK_EXT_swapchain_colorspace            : extension revision 4
        VK_KHR_device_group_creation           : extension revision 1
        VK_KHR_display                         : extension revision 23
        VK_KHR_external_fence_capabilities     : extension revision 1
        VK_KHR_external_memory_capabilities    : extension revision 1
        VK_KHR_external_semaphore_capabilities : extension revision 1
        VK_KHR_get_display_properties2         : extension revision 1
        VK_KHR_get_physical_device_properties2 : extension revision 2
        VK_KHR_get_surface_capabilities2       : extension revision 1
        VK_KHR_surface                         : extension revision 25
        VK_KHR_surface_protected_capabilities  : extension revision 1
        VK_KHR_wayland_surface                 : extension revision 6
        VK_KHR_xcb_surface                     : extension revision 6
        VK_KHR_xlib_surface                    : extension revision 6

Layers: count = 5
=======
VK_LAYER_LUNARG_standard_validation (LunarG Standard Validation Layer) Vulkan version 1.0.131, layer version 1:
        Layer Extensions: count = 0
        Devices: count = 5
                GPU id  : 0 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 1 (llvmpipe (LLVM 12.0.0, 256 bits))
                Layer-Device Extensions: count = 0

                GPU id  : 2 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 3 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 4 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

VK_LAYER_MESA_device_select (Linux device selection layer) Vulkan version 1.2.73, layer version 1:
        Layer Extensions: count = 0
        Devices: count = 5
                GPU id  : 0 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 1 (llvmpipe (LLVM 12.0.0, 256 bits))
                Layer-Device Extensions: count = 0

                GPU id  : 2 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 3 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 4 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

VK_LAYER_MESA_overlay (Mesa Overlay layer) Vulkan version 1.1.73, layer version 1:
        Layer Extensions: count = 0
        Devices: count = 5
                GPU id  : 0 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 1 (llvmpipe (LLVM 12.0.0, 256 bits))
                Layer-Device Extensions: count = 0

                GPU id  : 2 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 3 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 4 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

VK_LAYER_NV_optimus (NVIDIA Optimus layer) Vulkan version 1.2.155, layer version 1:
        Layer Extensions: count = 0
        Devices: count = 5
                GPU id  : 0 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 1 (llvmpipe (LLVM 12.0.0, 256 bits))
                Layer-Device Extensions: count = 0

                GPU id  : 2 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 3 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 4 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

VK_LAYER_NV_optimus (NVIDIA Optimus layer) Vulkan version 1.3.277, layer version 1:
        Layer Extensions: count = 0
        Devices: count = 5
                GPU id  : 0 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 1 (llvmpipe (LLVM 12.0.0, 256 bits))
                Layer-Device Extensions: count = 0

                GPU id  : 2 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 3 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

                GPU id  : 4 (NVIDIA A800 80GB PCIe)
                Layer-Device Extensions: count = 0

Presentable Surfaces:
=====================
...

Server with 4xA800

xuanlinli17 commented 4 months ago

Is your numpy < 2.0? Also ensure that the 3 nvidia json files exist according to troubleshooting. CUDA needs to >=11.8.

xuanlinli17 commented 4 months ago

You can create a fake display using

tmux new -s 1
sudo X :0 &
[exit tmux]
export DISPLAY=:0
hilookas commented 4 months ago

Thanks for your solution! Because the server is down, I could not try this solution 😂. I will temporarily mark this issue as closed and reopen it when necessary in the future.

ChenYi99 commented 3 months ago

The same problem on V100 and this solution did not work for me.

You can create a fake display using

tmux new -s 1
sudo X :0 &
[exit tmux]
export DISPLAY=:0
xuanlinli17 commented 3 months ago

Try xvfb-run -a CUDA_VISIBLE_DEVICES=0 python {}?

ChenYi99 commented 3 months ago

I have tried the suggested xvfb-run method, but it still does not work. Additionally, I have successfully run the mani_skill demo programs using X11-Forwarding with the following commands:

python -m mani_skill.utils.download_asset "ReplicaCAD"
python -m mani_skill.examples.demo_random_action -e "ReplicaCAD_SceneManipulation-v1" --render-mode="rgb_array" --record-dir="videos" # run headless and save video
python -m mani_skill.examples.demo_random_action -e "ReplicaCAD_SceneManipulation-v1" --render-mode="human" # run with GUI (recommended!)

However, when I add ray-tracing and run the following commands, I still encounter an error:

python -m mani_skill.utils.download_asset "ReplicaCAD"
python -m mani_skill.examples.demo_random_action -e "ReplicaCAD_SceneManipulation-v1" --render-mode="human" --shader="rt-fast"

The output I receive is:

opts: []
env_kwargs: {}
Segmentation fault (core dumped)

Any further suggestions or assistance would be greatly appreciated.

Best regards,

xuanlinli17 commented 3 months ago

oh, render_mode="human" will open up a window for visualization; so you can't use it on server

ChenYi99 commented 3 months ago

Using the following command will cause the same problem:

python -m mani_skill.examples.demo_random_action -e "ReplicaCAD_SceneManipulation-v1" --render-mode="rgb_array" --shader="rt-fast"