`Segmentation fault (core dumped)` error when calling `env.step()` inside a docker

MasterXiong commented 1 week ago

Hi,

Thanks for sharing this brilliant package for real-2-sim evaluation!

I'm trying to run SimplerEnv inside a docker on a linux server with GPU support. The environment can be successfully created and reset, but a Segmentation fault (core dumped) error shows up when calling env.step(). I have followed the troubleshooting instructions in README, but still can't solve this issue. Could you please help have a look at what may be the issues here? Thanks a lot for your help!

Below is the docker file I use (modified from ManiSkill's dockerfile)

# Base Image
FROM nvidia/cudagl:11.3.1-devel-ubuntu20.04

# Args need to be below FROM!
ARG USER_ID
ARG PYTHON_VERSION=3.10

ENV NVIDIA_DRIVER_CAPABILITIES all

# Install os-level packages
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
    bash-completion \
    build-essential \
    ca-certificates \
    cmake \
    curl \
    git \
    htop \
    libegl1 \
    libxext6 \
    libjpeg-dev \
    libpng-dev  \
    libvulkan1 \
    rsync \
    tmux \
    unzip \
    vim \
    vulkan-utils \
    wget \
    xvfb \
    # lib for SAPIEN rendering
    libglvnd-dev \
    && rm -rf /var/lib/apt/lists/*

# Install (mini) conda
RUN curl -o ~/miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    chmod +x ~/miniconda.sh && \
    ~/miniconda.sh -b -p /opt/conda && \
    rm ~/miniconda.sh && \
    /opt/conda/bin/conda init && \
    /opt/conda/bin/conda install -y python="$PYTHON_VERSION" && \
    /opt/conda/bin/conda clean -ya

ENV PATH /opt/conda/bin:$PATH
SHELL ["/bin/bash", "-c"]

# https://github.com/haosulab/ManiSkill/issues/9
COPY docker/nvidia_icd.json /usr/share/vulkan/icd.d/nvidia_icd.json
COPY docker/nvidia_layers.json /etc/vulkan/implicit_layer.d/nvidia_layers.json

# env
RUN git clone https://github.com/simpler-env/SimplerEnv --recurse-submodules
RUN pip install -e ./SimplerEnv/ManiSkill2_real2sim
RUN pip install -e ./SimplerEnv

# Change permissions
RUN useradd --shell /bin/bash -u ${USER_ID} -o -d /user user

# Set python ENV variables
ENV PYTHONUNBUFFERED=1

The docker installation seems to work fine. And I'm using the same test script as given in README.

xuanlinli17 commented 1 week ago

Does vulkaninfo run w/o error on your end?

MasterXiong commented 1 week ago

Hi @xuanlinli17 , thanks for your help! Yes vulkaninfo runs normally on my end. But the value of some attributes is false, which I'm not sure is normal or not.

xuanlinli17 commented 1 week ago

What's the nvidia driver version? I'd recommend it to be 535+. It also needs to be newer than the cuda version.

MasterXiong commented 1 week ago

Thanks! I upgraded my nvidia driver version to 545 but still got the same error.

And when calling vulkaninfo, I got the following error message at the beginning: 'DISPLAY' environment variable not set... skipping surface info error: XDG_RUNTIME_DIR not set in the environment. Not sure if this may cause the error I got?

And is there any requirement on the minimal version of cuda?

xuanlinli17 commented 1 week ago

CUDA 11.8 for RT-1 and Octo to run properly on GPU. See readme for more details.

MasterXiong commented 1 week ago

Thanks! But the current issue happens when just using random actions, so I think it should not be caused by CUDA? Do you have any other suggestions on what to check in addition to the nvidia driver version? Thanks!

xuanlinli17 commented 1 week ago

idk since usually core dump will already occur before you step actions and as soon as you create an environment, if something is wrong.

MasterXiong commented 1 week ago

Yeah that's quite weird. Is there anything that only happens in step while not in reset? I think this may provide some hints on what operation causes the core dump error. Thanks!

zhiyuan-zhang0206 commented 1 week ago

Hi, I encoutnered a similar bug, using a conda env on my machine (ubuntu 22.04 & nvidia 4090 GPU) The command: vulkaninfo | head -n 5 gives: WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 0. Skipping ICD. Vulkan Instance Version: 1.3.204

Then running the example.ipynb gives kernel crash error. I converted it into a python file and the env.step line yielded: Segmentation fault (core dumped)

zhiyuan-zhang0206 commented 1 week ago

Hi, I encoutnered a similar bug, using a conda env on my machine (ubuntu 22.04 & nvidia 4090 GPU) The command: vulkaninfo | head -n 5 gives: WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 0. Skipping ICD. Vulkan Instance Version: 1.3.204

Then running the example.ipynb gives kernel crash error. I converted it into a python file and the env.step line yielded: Segmentation fault (core dumped)

I tried to get system updates and re-installed the nvidia driver, but it's still not working.

xuanlinli17 commented 1 week ago

These are typically setup issues related to e.g., Vulkan. If https://maniskill.readthedocs.io/en/latest/user_guide/getting_started/installation.html#troubleshooting or apt install nvidia-driver-xxx(some new version) doesn't solve the problem, then this is tricky and might be due to some specific setups that you're using...

xuanlinli17 commented 6 days ago

Additionally try sth like https://github.com/haosulab/SAPIEN/issues/115#issuecomment-1434899965 ?

zhiyuan-zhang0206 commented 3 days ago

I tried both, and they did not work, still the same error. I will try something else.

zhiyuan-zhang0206 commented 2 days ago

I have found that the problem lies in the ik computation of mani_skill2_real2sim. obs, reward, done, truncated, info = env.step(action) leads to BaseEnv class line 548 self.step_action(action) then line 561 self.agent.set_action(action) then BaseAgent line 165 self.controller.set_action(action) then CombinedController line 269 controller.set_action(action[start:end]) then PDEEPosController line 114 self._target_qpos = self.compute_ik(self._target_pose) then line 61 result, success, error = self.pmodel.compute_inverse_kinematics( self.ee_link_idx, target_pose, initial_qpos=self.articulation.get_qpos(), active_qmask=self.qmask, max_iterations=max_iterations, ) this line yields the segmentation fault. The arguments values: self.ee_link_idx: 13 target_pose: Pose([1.38064, -0.348417, 1.1829], [0.178008, -0.693489, 0.567687, -0.406346]) initial: array([-0.26394573, -0.26394573, -0.26394573, -0.26394573, -0.26394573, -0.26394573, -0.26394573, -0.26394573, -0.26394573, -0.26394573, -0.26394573], dtype=float32) active_mask: array([ True, True, True, True, True, True, True, False, False, True, True]) max_iterations: 100

xuanlinli17 commented 2 days ago

This is quite strange as if the setup isn't right, env.reset() will directly cause core dump, not at env.step() and pmodel.compute_inverse_kinematics(); idk what's happening

simpler-env / SimplerEnv

`Segmentation fault (core dumped)` error when calling `env.step()` inside a docker #6