Closed cheneeheng closed 3 years ago
Thanks for reporting @cheneeheng .
We've seen this arbitrary Segmentation fault (core dumped)
issue occur before when running on a machines with non-standard graphics configuration. Unfortunately it comes from Isaac Sim and doesn't give us much detail regarding the cause (everything that says "Error" in that log is part of a normal working run...).
Can we confirm how you are using the machine specified:
Hi @btalb,
aiks that does not sound good.
As for your questions:
1. Are you directly on the machine, and does it have a physical screen attached? Yes and yes.
2. Are you connecting via SSH with window forwarding? No.
3. Are you using other remote software like remote desktop or alternatives? No. Although the plan is to do so once everything is running.
Thanks !
The core of the issue is Vulkan only seems to be happy when it is using a discrete GPU to render to a physical screen.
The reason I ask all of those questions is that we have had these issues when using configurations that tamper with that relationship. For example:
SegFault
(see log_failure.txt)I can see Vulkan is the cause as I see the following extra lines in the failure when I diff the logs:
[2021.03.23-20.25.55:954][ 0]LogRHI: Warning: Failed to find entry point for vkDestroySurfaceKHR
[2021.03.23-20.25.55:954][ 0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfaceSupportKHR
[2021.03.23-20.25.55:954][ 0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfaceCapabilitiesKHR
[2021.03.23-20.25.55:954][ 0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfaceFormatsKHR
[2021.03.23-20.25.55:954][ 0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfacePresentModesKHR
[2021.03.23-20.25.55:959][ 0]LogLinux: Warning: MessageBox: Failed to find all required Vulkan entry points! Try updating your driver.: No Vulkan entry points found!:
I can also see those lines in the log you provided me. We need to dig a little deeper though to try and figure out why Vulkan is throwing those errors for the simulator:
https://user-images.githubusercontent.com/3508780/112221489-38fa7d80-8c73-11eb-918a-2b35820bdae0.mp4
Here's the command you need:
docker run -e DISPLAY --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --gpus all -it benchbot/backend:base /bin/bash -c 'vkcube'
docker run -e DISPLAY --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --gpus all -it benchbot/backend:base /bin/bash -c 'vulkaninfo'
Thanks; I wish these things were simpler...
I shall try them out once i get back to the lab on friday.
I regularly work with programs using CUDA, so this is not the worst I have seen :smiley:
Hi,
I encountered the same error log as @cheneeheng today when I tried to run
benchbot_run --robot carter --env miniroom:1 --task semantic_slam:passive:ground_truth
Last week when I run this command, there's no error but just no simulator window coming out after this:
Supervisor is now available @ 'http://0.0.0.0:10000' ... Waiting until a robot controller is found @ 'http://benchbot_robot:10000' ... Found Sending environment data & robot config to controller ... Ready
I tried this command as suggested:
docker run -e DISPLAY --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --gpus all -it benchbot/backend:base /bin/bash -c 'vkcube
unfortunately I get this error
'Cannot find a compatible Vulkan installable client driver (ICD)'
I am using a x2goclient via SSH to a remote machine with RTX 3080. I wonder if it has anything to do with the remote desktop that the simulator window doesn't appear?
Thanks in advance!
Thanks for the information @tyou1 .
The behaviour you experienced last week with the window showing up and disappearing was a bug which was hiding the error log. We weren't correctly bubbling the crash log up to benchbot_run
's stdout. That's been fixed in v2.1 (benchbot_run --version
), so you should always see the crash log now when the simulator crashes.
What's important to understand with any remote access systems is how they actually perform the rendering. I don't know much about x2go
, but here's some examples I know of where a laptop is the client and the GPU machine is the server:
nvidia
driver can only be tied to a single X server & can't be swapped without killing the entire X server. So when you turn your computer on & sign in, your GPU is locked away in that default X server and can't be used by the virtual X server created for remote desktops. TL;DR: remote desktop software almost never provides hardware-accelerated rendering, and my guess is x2go
falls in this bucketHow does this tie in with BenchBot? NVIDIA's Isaac Simulator relies on hardware-accelerated rendering powered by Vulkan. If the system doing the rendering doesn't meet those requirements, then we get a crash from the simulator (a crash that should be much more verbose & explicit.... but a crash nonetheless).
So from those requirements, there's only a couple of solutions I would expect to work for cases where your GPU is on a remote machine:
benchbot_run
to render the simulator on your GPU machine's screen, not your laptop's. The downside is you won't see the simulator remotely on your laptop, but it will run successfully (I use this solution all the time). To tell benchbot_run
where to render you simply adjust the DISPLAY
environment variable. For example, terminals opened on my GPU machine show this:
ben@gpu-machine:~$ echo $DISPLAY
:1
Then when I'm SSHing from home, I manually set the DISPLAY target to :1
via:
ben@home-machine:~$ ssh -X ben@gpu-machine
ben@gpu-machine:~$ echo $DISPLAY
localhost:10.0
ben@gpu-machine:~$ export DISPLAY=:1
ben@gpu-machine:~$ benchbot_run ...
Hope this helps. I know it's not an ideal solution, but hardware-accelerated rendering under Linux with Vulkan support is something that's traditionally caused enough challenges by itself. Crisp solutions for remote use on top of this unfortunately aren't quite there yet.
We're always interested in better solutions though. If anyone knows of better ways to enable remote hardware-accelerated rendering, especially on headless machines, we'd love to hear them. Unfortunately, it's not something I have time to dig too far into at the moment.
@btalb the vulkaninfo command is returning this error:
No protocol specified WARNING: [Loader Message] Code 0 : loader_icd_scan: Can not find 'ICD' object in ICD JSON file /usr/share/vulkan/icd.d/nvidia_layers.json. Skipping ICD JSON error: XDG_RUNTIME_DIR not set in the environment. No protocol specified XCB failed to connect to the X server due to error:1. ERROR at /build/vulkan-tools-1.2.162.1~rc1-1lunarg18.04/vulkaninfo/vulkaninfo.h:847: AppCreateXcbSurface failed to establish connection
Update 1:
Reinstalled all nvidia drivers and cuda just to be safe.
Added root access to x-server xhost local:root
and both commands to debug vulkan are working, but the original error still persists.
Update 2:
So I ran xhost local:root
, and commented out the line xhost -local:root > /dev/null
and it works. :smiley:
Somehow running the original script removes the root access to the X-server and causes the error at the beginning of this comment to occur again.
Update 2.1: Worked through the tutorials, everything is working fine. (though some commands seem to be outdated :smiley: )
Update 2.2:
It seems that removing this line xhost -local:root > /dev/null
makes the error go away.
Issue can be closed if @btalb don't need anything more from my side.
That's excellent @cheneeheng, great to hear!
I'm not sure the relation of that series of errors (the first error I've never seen before, even before xhost local:root
was added into the scripts). Did a reboot fix that error?
It's a little odd that line is causing issues with containers as it is running after all of the containers have started, so shouldn't effect them. But maybe there is some asynchronous behaviour causing race conditions. Thanks for pointing that out though, that's a really good find.
I'll close this issue here, but feel free to open a new issue with any outdated commands you find in the documentation / tutorials. I'm always keen to fix those when they're found. Unfortunately, I'm a little documentation blind by this point.
Reboot (x3) did not fix the error. Only the xhost command did.
Hi
May I ask where is this line xhost -local:root > /dev/nullis = true
you mentioned that you commented out @cheneeheng? Unfortunately, I still have this crash log after I switch to Remmina to connect to the remote machine (RDP). So, I wonder if there is something else besides the remote access problem that causes the crash log.
Or is it only works with VNC server&client that the simulator window appears successfully? May I ask what specific VNC server & client that @david2611 use to run benchbot smoothly ?
Thanks a lot! :)
@tyou1
Here is the line : https://github.com/qcr/benchbot/blob/783e9ca27866ae64c891262aa5f46a7a53ee37c7/bin/benchbot_run#L347
You could try running xhost +local:root
on terminal and run the vkcube command to check if the access rights is the problem. Also make sure the env DISPLAY is correctly set when you do this :smiley:
Hi @tyou1 , good question.
Only VNC will work as RDP generally creates a virtual X server which won't have the hardware accelerated rendering.
@david2611 uses NoMachine, just make sure it's not using a virtual screen.
There's plenty of simple VNC options out there also like:
The crucial thing is just to make sure it is mirroring a physical screen, and not creating a virtual one.
Remmina also should be fine as a VNC client to conect to a server.
Hi there,
I have just installed the benchbot successfully on a machine with RTX2080 8GB, 32GB ram, i7-9700K CPU.
But when i tried to run
benchbot_run --robot carter --env miniroom:1 --task semantic_slam:passive:ground_truth
I keep getting a robot controller error. (small snippet below and the full log in the file attached.)I'm wondering if you guys ever encountered this.
Thanks!
Chen.
... Supervisor is now available @ 'http://0.0.0.0:10000' ...
Waiting until a robot controller is found @ 'http://benchbot_robot:10000' ... Found Sending environment data & robot config to controller ... Ready
################################################################################ ####################### BENCHBOT ROBOT CONTROLLER ERROR ######################## ################################################################################
ERROR: The BenchBot Robot Controller container has exited unexpectedly. This should not happen under normal operating conditions. Please see the complete log below for a dump of the crash output:
Robot controller is now available @ 'http://0.0.0.0:10000' ... Waiting to receive valid config data... 172.20.0.102 - - [2021-03-22 15:04:04] "GET // HTTP/1.1" 200 152 0.000542 172.20.0.102 - - [2021-03-22 15:04:05] "POST //configure HTTP/1.1" 200 137 0.066839 Starting the requested real robot ROS stack ... THE PROCESS STARTED BY THE FOLLOWING COMMAND HAS CRASHED: sed -i "0,/\"pose\":/{s/(\"pose\": )(.)/\1[0.7, 0, 0, -0.7, 1.2, 1.5, 0.3]/}" /benchbot/isaac_sdk/apps/carter/carter_sim/bridge_config/carter_full_config.json && perl -0777 -i -pe 's/\"static_mesh\".?]/\"static_mesh\":[{"name": "bottle"}, {"name": "cup"}, {"name": "knife"}, {"name": "bowl"}, {"name": "wine glass"}, {"name": "fork"}, {"name": "spoon"}, {"name": "banana"}, {"name": "apple"}, {"name": "orange"}, {"name": "cake"}, {"name": "potted plant"}, {"name": "mouse"}, {"name": "keyboard"}, {"name": "laptop"}, {"name": "cell phone"}, {"name": "book"}, {"name": "clock"}, {"name": "chair"}, {"name": "table"}, {"name": "couch"}, {"name": "bed"}, {"name": "toilet"}, {"name": "tv"}, {"name": "microwave"}, {"name": "toaster"}, {"name": "refrigerator"}, {"name": "oven"}, {"name": "sink"}, {"name": "person"}]/s' /benchbot/isaac_sdk/apps/carter/carter_sim/bridge_config/carter_full_config.json && cd "/benchbot/addons/benchbot_addons/benchbot-addons/envs_isaac_develop/environments" && .sim_package/IsaacSimProject.sh "/Game/AI_vol3_03_base/Maps/AI_vol3_scene_03" -isaac_sim_config_json= "/benchbot/isaac_sdk/apps/carter/carter_sim/bridge_config/carter_full.json" -windowed -ResX=960 -ResY=540 -vulkan -game
...
log.txt