microsoft / AirSim

Open source simulator for autonomous vehicles built on Unreal Engine / Unity, from Microsoft AI & Research
https://microsoft.github.io/AirSim/
Other
16.28k stars 4.53k forks source link

Seg Fault when running Airsim in Docker #3703

Open tiusty opened 3 years ago

tiusty commented 3 years ago

Bug report

What's the issue you encountered?

Seg Fault when following instructions for running UE4 + Airsim in docker. Currently tried to run the Blocks environment as described in the docs

Screen Shot 2021-05-18 at 1 39 01 PM

I have changed none of the files and running the commands as instructed from the docker instructions page: https://microsoft.github.io/AirSim/docker_ubuntu/

Notes: I was able to get the UE4 + Airsim running for the blocks binary when not in docker and followed the directions from: https://microsoft.github.io/AirSim/build_linux/ but we would like to get it running with docker

Cuda Version: Running the command below gives: sudo docker run --rm --gpus all nvidia/cudagl:10.0-devel-ubuntu18.04 nvidia-smi

Screen Shot 2021-05-18 at 1 41 19 PM

Notes: We are running without a monitor and thus headless

Settings

Default from the Blocks environment in VCS

{ "SeeDocsAt": "https://github.com/Microsoft/AirSim/blob/master/docs/settings_json.md", "SettingsVersion": 1.2, "SimMode": "Multirotor", "ClockSpeed": 1.0, "Vehicles": { "SimpleFlight": { "VehicleType": "SimpleFlight", "DefaultVehicleState": "Armed", "EnableCollisionPassthrogh": false, "EnableCollisions": true, "AllowAPIAlways": true, "RC": { "RemoteControlID": 0, "AllowAPIWhenDisconnected": false } } } }

How can the issue be reproduced?

Following the instructions from https://microsoft.github.io/AirSim/docker_ubuntu/

  1. Installed nvidia-docker2
  2. Build Airsim: python build_airsim_image.py \ --base_image=nvidia/cudagl:10.0-devel-ubuntu18.04 \ --target_image=airsim_binary:10.0-devel-ubuntu18.04
  3. Download blocks environment: ./download_blocks_env_binary.sh
  4. Run binary: ./run_airsim_image_binary.sh airsim_binary:10.0-devel-ubuntu18.04 Blocks/Blocks.sh -windowed -ResX=1080 -ResY=720
  5. Also tried headless but same issue: ./run_airsim_image_binary.sh airsim_binary:10.0-devel-ubuntu18.04 Blocks/Blocks.sh -- headless

Include full error message in text form

Increasing per-process limit of core file size to infinity.
- Existing per-process limit (soft=18446744073709551615, hard=18446744073709551615) is enough for us (need only 18446744073709551615)
Signal 11 caught.
Malloc Size=131076 LargeMemoryPoolOffset=131092 
CommonLinuxCrashHandler: Signal=11
Malloc Size=65535 LargeMemoryPoolOffset=196655 
Malloc Size=78272 LargeMemoryPoolOffset=274944 
Segmentation fault (core dumped)

What's better than filing an issue? Filing a pull request :).

tiusty commented 3 years ago

Seems similar to this issue but slight different: https://github.com/microsoft/AirSim/issues/3450

roy860328 commented 3 years ago

Download higher version seem to be OK. But will get other Log Error.

huyaoyu commented 3 years ago

I'm assuming that @tiusty wants to run a packaged UE project on a remote headless server and perform some offscreen rendering. I am trying to do similar things and immediately run into trouble with projects packaged by UE4.25 and UE4.26. I just found a solution, after endless searching and trying, and I feel like here is a good place for sharing and discussion.

For my system: Development computer: Ubuntu 20.04 + Unreal Engine 4.26.2 + Vulkan 1.2.131 + CUDA11.2

Remote headless server: Docker 19.03.5 + CUDA11.0

Presumably the most important key points:

I know removing SDL_VIDEODRIVER=offscreen sounds directly conflicting with the AirSim instructions, but from the testing on my Docker images, it is a key step. It seems that the developers of Unreal Engine decided to use their own version of SDL (ref1, ref2). And in later releases of Unreal Engine, especially 4.26, the SDL does not work on a headless server with SDL_VIDEODRIVER set to "offscreen" or even an empty SDL_VIDEODRIVER.

For running packaged UE projects/binaries, the recommended base Docker image by AirSim is nvidia/cudagl:10.0-devel-ubuntu18.04. This image supports CUDA and OpenGL. Since UE4.26 seems to favor Vulkan and drop the support for OpenGL by default, this base image no longer works. Luckily, NVIDIA provides a Docker image that supports CUDA, OpenGL, and Vulkan (1.1.121) based on Ubuntu 18.04. I have tested this image and it works for a project packaged by UE4.26. An alternative option, if you want newer CUDA or Vulkan, or Ubuntu 20.04, you could build your own image. I tried to use the nvidia/cudagl:11.2.1-devel-ubuntu20.04 as a base image and build a new one that supports Vulkan 1.2.131. For the above two options, I prepared two sample Docker files as GitHub Gists.

Naively adding Vulkan support is not enough for these images. If we try to execute vulkaninfo we will get errors such as ERROR_INITIALIZATION_FAILED or ERROR_INCOMPATIBLE_DRIVER. If you run the packaged UE projects inside the Docker container, the project refuses to start and gives us errors related to SDL, e.g. InitSDL() failed. It seems that the freshly built Docker images do not have the right ICD files configured. The people using the Carla simulator have come up with a working solution to deal with it. In the above Docker files, I have already borrowed their solution.

Once we start a Docker container with the appropriate arguments, such as nvidia-docker run --gpus all or nvidia-docker run --runtime=nvidia depends on your Docker version, the remaining issue is the way we run the packaged UE projects. I found from unrealcontainers.com that we have to specify the -RenderOffscreen argument when starting a packaged project for off-screen rendering with Vulkan (typically these are UE4.25 and UE4.26 projects). From my testings with the Docker images created by the above Docker files, the packaged project cannot be started without this -RenderOffscreen option. I also found that the SDL_HINT_CUDA_DEVICE environment variable is not critical, like here. My projects can run without this environment variable inside the Docker container.

More references:

tiusty commented 3 years ago

Thanks for the feedback! I will hopefully try out what you suggested in not too long to see if it resolves the issue for me.

kishanpb commented 3 years ago

I get a similar error as here for ClockSpeed of 50 and above. It's not frequent though. Is this expected for such high ClockSpeeds? Are the testing done only for ClockSpeed=1?

image

zimmy87 commented 2 years ago

Hi @tiusty, are you still experiencing this issue?

tiusty commented 2 years ago

Haven't tried the fixes since we pivoted to just using bare bones rather than using Docker. At some point, we will switch to Docker again.