Robot controller error due to Vulkan crash

cheneeheng commented 3 years ago

Hi there,

I have just installed the benchbot successfully on a machine with RTX2080 8GB, 32GB ram, i7-9700K CPU.

But when i tried to run benchbot_run --robot carter --env miniroom:1 --task semantic_slam:passive:ground_truth I keep getting a robot controller error. (small snippet below and the full log in the file attached.)

I'm wondering if you guys ever encountered this.

Thanks!

Chen.

... Supervisor is now available @ 'http://0.0.0.0:10000' ...

Waiting until a robot controller is found @ 'http://benchbot_robot:10000' ... Found Sending environment data & robot config to controller ... Ready

################################################################################ ####################### BENCHBOT ROBOT CONTROLLER ERROR ######################## ################################################################################

ERROR: The BenchBot Robot Controller container has exited unexpectedly. This should not happen under normal operating conditions. Please see the complete log below for a dump of the crash output:

Robot controller is now available @ 'http://0.0.0.0:10000' ... Waiting to receive valid config data... 172.20.0.102 - - [2021-03-22 15:04:04] "GET // HTTP/1.1" 200 152 0.000542 172.20.0.102 - - [2021-03-22 15:04:05] "POST //configure HTTP/1.1" 200 137 0.066839 Starting the requested real robot ROS stack ... THE PROCESS STARTED BY THE FOLLOWING COMMAND HAS CRASHED: sed -i "0,/\"pose\":/{s/(\"pose\": )(.)/\1[0.7, 0, 0, -0.7, 1.2, 1.5, 0.3]/}" /benchbot/isaac_sdk/apps/carter/carter_sim/bridge_config/carter_full_config.json && perl -0777 -i -pe 's/\"static_mesh\".?]/\"static_mesh\":[{"name": "bottle"}, {"name": "cup"}, {"name": "knife"}, {"name": "bowl"}, {"name": "wine glass"}, {"name": "fork"}, {"name": "spoon"}, {"name": "banana"}, {"name": "apple"}, {"name": "orange"}, {"name": "cake"}, {"name": "potted plant"}, {"name": "mouse"}, {"name": "keyboard"}, {"name": "laptop"}, {"name": "cell phone"}, {"name": "book"}, {"name": "clock"}, {"name": "chair"}, {"name": "table"}, {"name": "couch"}, {"name": "bed"}, {"name": "toilet"}, {"name": "tv"}, {"name": "microwave"}, {"name": "toaster"}, {"name": "refrigerator"}, {"name": "oven"}, {"name": "sink"}, {"name": "person"}]/s' /benchbot/isaac_sdk/apps/carter/carter_sim/bridge_config/carter_full_config.json && cd "/benchbot/addons/benchbot_addons/benchbot-addons/envs_isaac_develop/environments" && .sim_package/IsaacSimProject.sh "/Game/AI_vol3_03_base/Maps/AI_vol3_scene_03" -isaac_sim_config_json= "/benchbot/isaac_sdk/apps/carter/carter_sim/bridge_config/carter_full.json" -windowed -ResX=960 -ResY=540 -vulkan -game

...

log.txt

btalb commented 3 years ago

Thanks for reporting @cheneeheng .

We've seen this arbitrary Segmentation fault (core dumped) issue occur before when running on a machines with non-standard graphics configuration. Unfortunately it comes from Isaac Sim and doesn't give us much detail regarding the cause (everything that says "Error" in that log is part of a normal working run...).

Can we confirm how you are using the machine specified:

Are you directly on the machine, and does it have a physical screen attached?
Are you connecting via SSH with window forwarding?
Are you using other remote software like remote desktop or alternatives?

cheneeheng commented 3 years ago

Hi @btalb,

aiks that does not sound good.

But what do you mean by non-standard ?
Setup other than the ones mentioned in the prerequisite ?
What did you do the last time you encountered this issue?

As for your questions:

1. Are you directly on the machine, and does it have a physical screen attached? Yes and yes.

2. Are you connecting via SSH with window forwarding? No.

3. Are you using other remote software like remote desktop or alternatives? No. Although the plan is to do so once everything is running.

Thanks !

btalb commented 3 years ago

The core of the issue is Vulkan only seems to be happy when it is using a discrete GPU to render to a physical screen.

The reason I ask all of those questions is that we have had these issues when using configurations that tamper with that relationship. For example:

I have a working system next to me (see log_success.txt),
but when I instead SSH into it with window forwarding I get the same SegFault (see log_failure.txt)

I can see Vulkan is the cause as I see the following extra lines in the failure when I diff the logs:

[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkDestroySurfaceKHR
[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfaceSupportKHR
[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfaceCapabilitiesKHR
[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfaceFormatsKHR
[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfacePresentModesKHR
[2021.03.23-20.25.55:959][  0]LogLinux: Warning: MessageBox: Failed to find all required Vulkan entry points! Try updating your driver.: No Vulkan entry points found!:

I can also see those lines in the log you provided me. We need to dig a little deeper though to try and figure out why Vulkan is throwing those errors for the simulator:

Run a barebones Vulkan command to show a spinning cube:

https://user-images.githubusercontent.com/3508780/112221489-38fa7d80-8c73-11eb-918a-2b35820bdae0.mp4

Here's the command you need:

docker run -e DISPLAY --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --gpus all -it benchbot/backend:base /bin/bash -c 'vkcube'

If that fails, can you show me the output of the following diagnostic command for Vulkan:

docker run -e DISPLAY --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --gpus all -it benchbot/backend:base /bin/bash -c 'vulkaninfo'

Thanks; I wish these things were simpler...

cheneeheng commented 3 years ago

I shall try them out once i get back to the lab on friday.

I regularly work with programs using CUDA, so this is not the worst I have seen :smiley:

tyou1 commented 3 years ago

Hi,

I encountered the same error log as @cheneeheng today when I tried to run

benchbot_run --robot carter --env miniroom:1 --task semantic_slam:passive:ground_truth

Last week when I run this command, there's no error but just no simulator window coming out after this:

Supervisor is now available @ 'http://0.0.0.0:10000' ... Waiting until a robot controller is found @ 'http://benchbot_robot:10000' ... Found Sending environment data & robot config to controller ... Ready

I tried this command as suggested:

docker run -e DISPLAY --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --gpus all -it benchbot/backend:base /bin/bash -c 'vkcube

unfortunately I get this error

'Cannot find a compatible Vulkan installable client driver (ICD)'

I am using a x2goclient via SSH to a remote machine with RTX 3080. I wonder if it has anything to do with the remote desktop that the simulator window doesn't appear?

Thanks in advance!

btalb commented 3 years ago

Thanks for the information @tyou1 .

The behaviour you experienced last week with the window showing up and disappearing was a bug which was hiding the error log. We weren't correctly bubbling the crash log up to benchbot_run's stdout. That's been fixed in v2.1 (benchbot_run --version), so you should always see the crash log now when the simulator crashes.

What's important to understand with any remote access systems is how they actually perform the rendering. I don't know much about x2go, but here's some examples I know of where a laptop is the client and the GPU machine is the server:

SSH with X forwarding: this passes the X rendering commands to your client machine (i.e. if I SSH into my GPU server from my laptop, my laptop does the rendering)
VNC servers & clients: this forwards whatever the server renders on its screen to your client, which just shows an image of what's been rendered by the server
Remote desktop systems with virtual screens: these often create virtual X servers & forward their contents to the client. Where things get tricky is the nvidia driver can only be tied to a single X server & can't be swapped without killing the entire X server. So when you turn your computer on & sign in, your GPU is locked away in that default X server and can't be used by the virtual X server created for remote desktops. TL;DR: remote desktop software almost never provides hardware-accelerated rendering, and my guess is x2go falls in this bucket

How does this tie in with BenchBot? NVIDIA's Isaac Simulator relies on hardware-accelerated rendering powered by Vulkan. If the system doing the rendering doesn't meet those requirements, then we get a crash from the simulator (a crash that should be much more verbose & explicit.... but a crash nonetheless).

So from those requirements, there's only a couple of solutions I would expect to work for cases where your GPU is on a remote machine:

VNC: setup a VNC server on your GPU machine & connect to it via a VNC client. Make sure the VNC server is attached to the same X server as your GPU. This is the easiest option to use BenchBot on a remote machine, and is extremely simple if the remote machine has a physical screen attached (@david2611 uses a solution like this one all the time).
SSH with tweaked X forwarding: you essentially tell benchbot_run to render the simulator on your GPU machine's screen, not your laptop's. The downside is you won't see the simulator remotely on your laptop, but it will run successfully (I use this solution all the time). To tell benchbot_run where to render you simply adjust the DISPLAY environment variable. For example, terminals opened on my GPU machine show this:
```
ben@gpu-machine:~$ echo $DISPLAY
:1
```
Then when I'm SSHing from home, I manually set the DISPLAY target to :1 via:
```
ben@home-machine:~$ ssh -X ben@gpu-machine
ben@gpu-machine:~$ echo $DISPLAY
localhost:10.0
ben@gpu-machine:~$ export DISPLAY=:1
ben@gpu-machine:~$ benchbot_run ...
```

Hope this helps. I know it's not an ideal solution, but hardware-accelerated rendering under Linux with Vulkan support is something that's traditionally caused enough challenges by itself. Crisp solutions for remote use on top of this unfortunately aren't quite there yet.

We're always interested in better solutions though. If anyone knows of better ways to enable remote hardware-accelerated rendering, especially on headless machines, we'd love to hear them. Unfortunately, it's not something I have time to dig too far into at the moment.

cheneeheng commented 3 years ago

@btalb the vulkaninfo command is returning this error:

No protocol specified WARNING: [Loader Message] Code 0 : loader_icd_scan: Can not find 'ICD' object in ICD JSON file /usr/share/vulkan/icd.d/nvidia_layers.json. Skipping ICD JSON error: XDG_RUNTIME_DIR not set in the environment. No protocol specified XCB failed to connect to the X server due to error:1. ERROR at /build/vulkan-tools-1.2.162.1~rc1-1lunarg18.04/vulkaninfo/vulkaninfo.h:847: AppCreateXcbSurface failed to establish connection

Update 1: Reinstalled all nvidia drivers and cuda just to be safe. Added root access to x-server xhost local:root and both commands to debug vulkan are working, but the original error still persists.

Update 2: So I ran xhost local:root, and commented out the line xhost -local:root > /dev/null and it works. :smiley: Somehow running the original script removes the root access to the X-server and causes the error at the beginning of this comment to occur again.

Update 2.1: Worked through the tutorials, everything is working fine. (though some commands seem to be outdated :smiley: )

Update 2.2: It seems that removing this line xhost -local:root > /dev/null makes the error go away.

Issue can be closed if @btalb don't need anything more from my side.

btalb commented 3 years ago

That's excellent @cheneeheng, great to hear!

I'm not sure the relation of that series of errors (the first error I've never seen before, even before xhost local:root was added into the scripts). Did a reboot fix that error?

It's a little odd that line is causing issues with containers as it is running after all of the containers have started, so shouldn't effect them. But maybe there is some asynchronous behaviour causing race conditions. Thanks for pointing that out though, that's a really good find.

I'll close this issue here, but feel free to open a new issue with any outdated commands you find in the documentation / tutorials. I'm always keen to fix those when they're found. Unfortunately, I'm a little documentation blind by this point.

cheneeheng commented 3 years ago

Reboot (x3) did not fix the error. Only the xhost command did.

tyou1 commented 3 years ago

Hi

May I ask where is this line xhost -local:root > /dev/nullis = true you mentioned that you commented out @cheneeheng? Unfortunately, I still have this crash log after I switch to Remmina to connect to the remote machine (RDP). So, I wonder if there is something else besides the remote access problem that causes the crash log.

Or is it only works with VNC server&client that the simulator window appears successfully? May I ask what specific VNC server & client that @david2611 use to run benchbot smoothly ?

Thanks a lot! :)

cheneeheng commented 3 years ago

@tyou1

Here is the line : https://github.com/qcr/benchbot/blob/783e9ca27866ae64c891262aa5f46a7a53ee37c7/bin/benchbot_run#L347

You could try running xhost +local:root on terminal and run the vkcube command to check if the access rights is the problem. Also make sure the env DISPLAY is correctly set when you do this :smiley:

btalb commented 3 years ago

Hi @tyou1 , good question.

Only VNC will work as RDP generally creates a virtual X server which won't have the hardware accelerated rendering.

@david2611 uses NoMachine, just make sure it's not using a virtual screen.

There's plenty of simple VNC options out there also like:

TigerVNC (I've had succcess with this many years ago)
TightVNC
RealVNC
Xvnc
etc.

The crucial thing is just to make sure it is mirroring a physical screen, and not creating a virtual one.

btalb commented 3 years ago

Remmina also should be fine as a VNC client to conect to a server.

qcr / benchbot

Robot controller error due to Vulkan crash #18

...