osrf / subt

This repostory contains software for the virtual track of the DARPA SubT Challenge. Within this repository you will find Gazebo simulation assets, ROS interfaces, support scripts and plugins, and documentation needed to compete in the SubT Virtual Challenge.
Other
309 stars 98 forks source link

Image Solution runs with catkin setup but not using docker compose #192

Closed osrf-migration closed 5 years ago

osrf-migration commented 5 years ago

Original report (archived issue) by Hector Escobar (Bitbucket: hector_escobar).

The original report had attachments: docker-compose.yml, run.bash


We are experiencing a problem where we have our solution image that runs fine using the catkin method running both the cloudsim_sim and cloudsim_bridge and finally our image using ./run.bash our_image:latest. Like this all our system runs fine. We are using cuda for our solution, and when we replace our image on the docker-compose.yml and ./run_docker_compose.sh as specified here, we get an error of not finding cuda. Are there any parameters that could be modified in the yml file as to allow cuda? I think we would have the same issue if we use the actual cloudsim. Our image was built FROM nvidia/cudagl:10.1-devel-ubuntu18.04 to account for it so we know our image has it.

Any suggestions?

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Hi Hector,

I guess this is a CUDA+OpenGL support issue either from Nvidia or Docker. Let’s hope to find out. Are your running the catkin method and the docker-compose method in the same machine/system? Can you provide the exact error you are getting?

Can you also run the following commands and post the output back here?

$ docker -v

$ docker-compose -v

Finally, make sure you added the content below into the /etc/docker/daemon.json file for --runtime=nvidia to work.

$ cd /etc/docker/
$ sudo vim daemon.json

# Add this into the file:

{
          "runtimes": {
              "nvidia": {
                  "path": "/usr/bin/nvidia-container-runtime",
                  "runtimeArgs": []
              }
          }
}

$ sudo service docker restart

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


Hi Alfredo Bencomo (bencomo) ,

Thanks for taking the time. To answer your first question, yes I am running the catkin method and docker-compose on the same local machine. The error is that I have a check to see if there are cuda capable devices as such:

cudaError_t status = cudaGetDevice(&n);

assert(status == cudaSuccess);

And that gives me the error of assert(0), meaning there are no devices.

Docker -v

Docker version 19.03.2, build 6a30dfc

docker-compose -v

docker-compose version 1.23.2, build 1110ad01

And I DO have the content on the daemon.json file.

I believed I tested this capability in the past with you as well regarding other issue when you created the ./run_docker_compose.sh file but now seems to not work anymore.

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


I believed I tested this capability in the past with you as well regarding other issue when you created the ./run_docker_compose.sh file but now seems to not work anymore.

Did you install new updates or packages since then?

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


I don’t think I have updated, but I do have done the latest hg pull && hg update on the tunnel_circuit repository. I’m rebuilding my images step by step to see if there’s any indication of the issue. The strange thing is that I can run it in catkin using my image but not on docker-compose, which makes me believe is a permission of the image to use the cuda.

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Which image do you believe have a permission issue? The cloudsim_sim, the cloudsim_bridge, or your solution image?

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


I am rebuilding my solution image. I downloaded fresh cloudsim_sim and cloudsim_bridge images with the ./run_docker_compose.sh.

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


Hi Alfredo Bencomo (bencomo) ,

I am still having the same issues. Is there a way to give the solution image permission to use cuda? My image has cuda enabled as I am able to compile it but when I run it on the ./run_docker_compose.sh is gives me the error that there are no cuda enabled devices.

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Hi Hector,

You get the assertion when you code reach the statement below correct?

cudaError_t status = cudaGetDevice(&n);

You are not running on ARM system like the Jetson, right?

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


Alfredo Bencomo (bencomo) ,

You are correct I get the error on that line and I am running on a laptop that has GPU. I’m able to run the same code with catkin, and with the mix of catkin sim/bridge and ./run.bash my_image but not using the ./run_docker_compose.sh

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Hector,

How did you install CUDA and which version?

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


Alfredo Bencomo (bencomo) ,

To install cuda I use the Nvidia provided image to start my dockerfile as

FROM nvidia/cudagl:10.1-devel-ubuntu18.04

Which is version 10.1.

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Hector,

Can you please attach here your modified docker-compose.yml file, the Dockerfile for your solution, the exact commands you enter in the terminal to launch and build your solution, and the exact console outputs you get when your solution fails to detect the cuda device cudaGetDevice(&n)

If you want to post that info here, then you send an email to this address subt-help@googlegroups.com

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


One more thing. If this problem occurs only when you run your solution image within Docker-Compose, but it works fine when you run it as a standalone docker image; then can you also try to upload and run your solution image in Cloudsim?

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


Hi Alfredo Bencomo (bencomo) ,

Is there a way to send you the Dockerfile directly to you? I tried uploading my image to the Cloudsim on Simple Tunnel 2 and it showed it Error: InitializationFailed.

To attach documents here I need to send the email to subt-help@googlegroups.com, correct?

osrf-migration commented 5 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


Edit: Yes, attach your Dockerfile and docker-compose.yml files to the email you send to subt-help@googlegroups.com.

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Arthur,

I’m not asking Hector to send the Docker image for his solution neither is code.

Hector,

Yes, you can attach your Dockerfile to docker-compose.yml file to an email and send them to subt-help@googlegroups.com.

Please read my two previous posts since you didn’t answer some of my questions/requests.

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


I’ll send the email then. Thanks for your help. I pinpoint is definitely something to do with Cuda, as if I turn Cuda off then the docker-compose method works fine.

And I what I meant by attaching is that I don’t get an option in this forum to attach documents, only images.

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


With Cuda On, does it work if you run the docker images (SubT + YouSolution) without using docker-compose?

Please provide as much details as possible (what commands your run, what outputs you get, how you turn Cuda On/Off, etc. etc.)

osrf-migration commented 5 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


Hector Escobar (hector_escobar) for a long time I also did not find the Attach functionality here. But it is available. Scroll up and you will find a button “Attach” below the “Create issue” button. :slight_smile:

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


Sophisticated Engineering (sopheng) Thanks for the tip of Attach at the top!

Alfredo Bencomo (bencomo) , I attached my docker-compose.yml. To run it I use your ./run_docker_compose.sh file. The error I get is shown below

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


solution1_1  |  * /A1_control/time_limit: 3000.0
solution1_1  |  * /A1_control/total_x1s: 1
solution1_1  |  * /A1_control/total_x2s: 0
solution1_1  |  * /A1_control/total_x3s: 0
solution1_1  |  * /A1_control/total_x4s: 0
solution1_1  |  * /A1_control/use_truth_odom: False
solution1_1  |  * /A1_tf_to_odom_publisher/use_truth_odom: False
solution1_1  |  * /rosdistro: melodic
solution1_1  |  * /rosversion: 1.14.3
solution1_1  | 
solution1_1  | NODES
solution1_1  |   /A1/
solution1_1  | Hello from stereo to 1d er!
solution1_1  | [ERROR] [1569341300.705791828]: Couldn't open joystick /dev/input/js0. Will retry every second.
solution1_1  | layer     filters    size              input                output
solution1_1  |     0 darknet_ros: /home/developer/subt_ws/src/ssci_src/darknet_ros/darknet/src/cuda.c:36: check_error: Assertion `0' failed.
solution1_1  | ================================================================================REQUIRED process [A1/darknet_ros-9] has died!
solution1_1  | process has died [pid 128, exit code -6, cmd /home/developer/subt_ws/install/lib/darknet_ros/darknet_ros __name:=darknet_ros __log:=/home/developer/.ros/log/6c8f0d56-dee5-11e9-8b74-0242ac1c0102/A1-darknet_ros-9.log].
solution1_1  | log file: /home/developer/.ros/log/6c8f0d56-dee5-11e9-8b74-0242ac1c0102/A1-darknet_ros-9*.log
solution1_1  | Initiating shutdown!

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


To answer your

“One more thing. If this problem occurs only when you run your solution image within Docker-Compose, but it works fine when you run it as a standalone docker image; then can you also try to upload and run your solution image in Cloudsim?”

I tried uploading it to the cloudsim and I get: Terminated

Error: InitializationFailed

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


With CUDA on, my solution works if I run the following:

Term 1:

ign launch cloudsim_sim.ign robotName1:=A1 robotConfig1:=X1_SENSOR_CONFIG_1

Term 2:

ign launch cloudsim_bridge.ign robotName1:=A1 robotConfig1:=X1_SENSOR_CONFIG_1

and Term 3 my solution by using the ./run.bash ssci_unified

./run.bash ssci_unified

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


I just tried it and it also runs without errors.

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Hector, disregard my previous message. Can you edit your docker-compose.yml file add the runtime: nvidia to the section for you solution as shown below. Then, try it again locally using docker-compose ($ ./run_docker_compose.sh)

    # The solution container runs control code for a single robot. This
  # solution container connects to the first bridge, and therefore controls
  # the X1 robot.
  solution1:
    image: ssci_unified:latest
    networks:
      relay_net1:
        ipv4_address: 172.29.1.2
    environment:
      - ROS_MASTER_URI=http://172.29.1.1:11311
    runtime: nvidia
    privileged: true
    security_opt:
      - seccomp=unconfined
    depends_on:
      - "bridge1"

@azeey pointed out that you might also need to add this.

privileged: true
security_opt:
  - seccomp=unconfined

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


Alfredo Bencomo (bencomo) ,

That worked! I just added the runtime: nvidia and is ok now.

Would this be a fix you need to do on the actual cloudsim??

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Hector,

I’m glad your solution now works with docker-compose. Regarding the cloudsim, I’m checking right now.

BTW, did your solution find any artifact when you ran it with docker-compose ?

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


I didn’t let it run that much. I’ll run the simple_tunnel_02 with docker-compose instead and test if it detects anything.

Thanks again!

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


I just confirmed that Cloudsim doesn’t need any fix, so the Error: InitializationFailed is not related to this issue. I’m going to resolve this one since you can now use 'docker-compose` locally with your solution.

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).