qcr / benchbot

BenchBot is a tool for seamlessly testing & evaluating semantic scene understanding tools in both realistic 3D simulation & on real robots
BSD 3-Clause "New" or "Revised" License
110 stars 12 forks source link

Controller Crash & A Naive Solution #99

Open AronCao49 opened 1 year ago

AronCao49 commented 1 year ago

Hi, I recently try to install benchbot on two host machine. For the first one with 3070, the benchbot works smoothly. However, when I try to run the same command on the second machine with 4080, similar controller crashing as #92 happens. Specifically, the Isaac Sim window pops up at the first time holding for like 10 sec, then vanishes as soon as the starting the robot controller is "Ready".

I takes some time trying to fix this issue following some potential solutions from #92 but none of them work. Fortunately, somehow when I try to full screen the Issac Sim window and maximises the Console window to check its output...the Issac Sim survives!

Although it sounds a bit silly, I give quiet a few testing rounds and the Issac Sim window is successfully preserved for each round, which is also able to be used to conduct the exemplary demo like "hello_passive" provided. Shortly speaking, the key is to hide the Viewport window when initializing the Isaac Sim, like the screen shot below:

Screenshot from 2023-05-12 13-17-28

Comparing the console outputs from 3070 and 4080, the issue may lie upon two Error message:

Screenshot from 2023-05-12 10-35-40

which I can only find on 4080 but not 3070, possiblely causing the controller crash.

There are also some outputs from the console that are different. I will attach the screenshot for further discussion. The specifications of my host machines can be found below also.

  1. 3070 spec
    
    Core host system checks:
    Ubuntu version >= 20.04:                                  Passed (20.04)

Running Nvidia related system checks: NVIDIA GPU available: Found card of type '10de:2484' NVIDIA driver is running: Found NVIDIA driver version valid: Valid (530.30.02) NVIDIA driver from a standard PPA: PPA is valid CUDA drivers installed: Drivers found CUDA drivers version valid: Valid (530.30.02-1) CUDA drivers from the NVIDIA PPA: PPA is valid CUDA is installed: CUDA found CUDA version valid: Valid (12.1) CUDA is from the NVIDIA PPA: PPA is valid

Running Docker related system checks: Docker is available: Found Docker version valid: Valid (20.10.12) NVIDIA Container Toolkit installed: Found (1.13.1) Docker runs without root: Passed

Running checks of filesystem used for Docker: /var/lib/docker on ext4 filesystem: Yes (/dev/sdb6) /var/lib/docker supports suid: Enabled /var/lib/docker drive space check: Sufficient space (203G)

Miscellaneous requirements: Pip python package manager available: Found (21.3.1) Tkinter for Python installed: Found PIL (with ImageTk) for Python install Found

Manual installation steps for Omniverse-powered Isaac Sim: License accepted for Omniverse: Yes Access to nvcr.io Docker registry: Yes


2. 4080 spec

Core host system checks: Ubuntu version >= 20.04: Passed (20.04)

Running Nvidia related system checks: NVIDIA GPU available: Found card of type '10de:2704' NVIDIA driver is running: Found NVIDIA driver version valid: Valid (530.30.02) NVIDIA driver from a standard PPA: PPA is valid CUDA drivers installed: Drivers found CUDA drivers version valid: Valid (530.30.02-1) CUDA drivers from the NVIDIA PPA: PPA is valid CUDA is installed: CUDA found CUDA version valid: Valid (12.1) CUDA is from the NVIDIA PPA: PPA is valid

Running Docker related system checks: Docker is available: Found Docker version valid: Valid (23.0.6) NVIDIA Container Toolkit installed: Found (1.13.1) Docker runs without root: Passed

Running checks of filesystem used for Docker: /var/lib/docker on ext4 filesystem: Yes (/dev/sda1) /var/lib/docker supports suid: Enabled /var/lib/docker drive space check: Sufficient space (785G)

Miscellaneous requirements: Pip python package manager available: Found (23.1.2) Tkinter for Python installed: Found PIL (with ImageTk) for Python install Found

Manual installation steps for Omniverse-powered Isaac Sim: License accepted for Omniverse: Yes Access to nvcr.io Docker registry:



3. 3070 Isaac Sim Console Output:
![3070_Isaac_Sim_Output](https://github.com/qcr/benchbot/assets/50131988/22ec841d-4394-4075-91d9-fc72743e5941)

4. 4080 Isaac Sim Console Output:
![4080_Isaac_Sim_Output](https://github.com/qcr/benchbot/assets/50131988/df3b26d9-dea4-43cb-8376-ec010d8553d9)
david2611 commented 1 year ago

Thanks for bringing this to our attention and for all your supplied information. We will have a look into it. I hope in the meantime that the 3070 is adequate for you to make some headway in the challenge.

AronCao49 commented 1 year ago

Thanks for your prompt reply. Currently, I can run Benchbot on 4080 by hide the Viewport window when initializing the Isaac Sim, and then resume the Viewport window after the GPU crashing error passes. I think the controller instability should be due to this GPU crash error at the beginning, while it does not affect the controller in the following process (i.e., I guess, as soon as no rendering work is processing when the GPU crash error is happening, the Issac Sim can operate normally).

Anyway, hope this observation can bring some help to further perfect this project. Looking forward to your great work in the future!

david2611 commented 1 year ago

Hmmm the current advice we have been given for Omniverse is that the newer graphics drivers might actually be causing the issue. Have you tried downgrading your graphics driver to 525 (would also need to downgrade CUDA to 12.0). There is some advice as to how to do this for BenchBot here https://github.com/qcr/benchbot/issues/92#issuecomment-1505011875

AronCao49 commented 1 year ago

Yes, I have. The modification suggested in #92 does not work in my case. Though it can help to install driver 525 and CUDA 12 at the beginning, the following cuda-driver installation would force to update the driver version to 530. The validation step of the cuda-driver does not allow any other version of the cuda-driver but the newest one. A possible workaround is to change the cuda-driver/ in ben/benchbot_install to cuda-driver-*. However, even passed all validation, the installation still failed, which unfortunately I did not preserve any log or output of this step...

Based on my experience, the 3070 works perfectly with the newest version of CUDA (12.1) and driver (530) but not for the 4080. Maybe the incompatibility between 40x GPUs and CUDA or driver causes this issue.

darkain84 commented 1 year ago

Even if follow the guide to install cuda=12.0.0-1, it upgrade the nvidia-driver to the latest one such as

  1. The reason is that cuda package depends on cuda-driver without a specific version, so it tries to install the latest cuda-driver and it upgrade the nvidia-driver.

I found a way to downgrade nvidia-driver and cuda packages and succeed to launch benchbot without GPU crash at 4080 GPU card. My solution is:

Uninstall cuda/nvidia-driver and install below packages manually before run benchbot install script.

# Install nvidia-driver 525
$ sudo apt install nvidia-driver-525
# Check the latest cuda-drivers for 525 (e.g. 525.125.06-1) at 
# https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ and install it.
$ sudo apt install cuda-drivers=525.125.06-1
# Run benchbot installer
$ ./install or benchbot_install

I hope that it will be help to others who use 40xx graphic card.

david2611 commented 1 year ago

Thanks @darkain84 for pinging this issue again. I have just pushed an update to benchbot_install to try and fix the cuda driver/cuda version issues that people have been encountering.

If someone with hardware known to cause the crashes could you please try a fresh install (without pre-existing cuda and nvidia drivers) and confirm that this problem has been resolved?