osrf / vorc-events

Repository containing team submissions for VORC events.
Apache License 2.0
1 stars 16 forks source link

Team CCOM submission for VORC 2020/phase2. #20

Closed rolker closed 3 years ago

seankrag commented 3 years ago

@caguero Team CCOM's docker image is accessible, but the vorc-competitor-system crashes during testing. I'm using a non-GPU laptop, and their image is labeled "rolker2000/project11_vorc_nvidia:latest". Not sure if this is GPU related. Can you please check?

rolker commented 3 years ago

Apologies for crashing your system. Our solution does include darknet compiled with Cuda support. Since my development machine is currently running Ubuntu 20.04, I used a docker image for development and integrating my teammates' components. Being my first time using Docker, I followed the directions on this page https://github.com/osrf/vrx/wiki/tutorials-buildRunLocalImage and used vrx's Dockerfile as a base to build the Dockerfile for this submission. I did use the -n option when calling build.bash for NVidia support. Let me know if I should build the image differently. Oh, and thanks to all who are making this event happen!

mabelzhang commented 3 years ago

I don't have a CUDA-enabled GPU to try it. I tried on a NVIDIA GPU without CUDA and am getting an error, but that might just be because I don't have the right hardware:

terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid
/opt/ros/melodic/etc/catkin/profile.d/50-rosmon.bash: line 10:    78 Aborted                 (core dumped) rosrun rosmon_core rosmon "$@"

We might wait till Monday to see if someone else on our team can give a hand.

If your team is dying to know and having nothing better to do over the weekend, you might try to run the evaluation from the vorc branch of vrx-docker on your CUDA-enabled GPU and see if it gives you no error. That's what we're using for evaluation (though the competition world configuration will be different). There's considerable setup involved, so unless you have a good half day or a day, go do something better for your weekend.

Thank you for the submission!

mabelzhang commented 3 years ago

So I looked again, and I do have a CUDA-capable GPU. Guess I haven't been working my horse hard enough.

Looks from here that I only need to install CUDA drivers on the host machine to use CUDA in a Docker container. I installed CUDA driver 11.1 on a Ubuntu 18.04 host and am able to run NVIDIA's samples to check that it's correctly installed and working.

But I'm still getting the core dump above.

Let me know if I should build the image differently.

Did you do extra things for your Dockerfile to have the CUDA toolkit, or did you use the Dockerfile in the VRX repo? The competitor Dockerfile we provide is only NVIDIA-enabled, but not CUDA-enabled. We just pull in nvidia/opengl:1.0-glvnd-devel-ubuntu18.04 as the base image in this line, but we do not install any CUDA packages.

I have never done a Dockerfile with CUDA support, but you might need to install CUDA packages in your container, looking at the illustration here https://github.com/NVIDIA/nvidia-docker and at the Dockerfiles NVIDIA provides on nvidia/cuda https://hub.docker.com/r/nvidia/cuda/ , e.g. this one for 18.04 https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/11.1/ubuntu18.04-x86_64/base/Dockerfile

Other than that, I'm guessing as long as your container is built with CUDA support, the VORC server container does not need to be changed in order for us to run it.

I'm signing off for the weekend. We can see if someone else has more experience with getting CUDA in their Dockerfile.

rolker commented 3 years ago

Thank you @mabelzhang for looking into this.

I haven't seen that error before so I'm not too sure what could cause it. I believe the mention of rosmon is probably because that's what we used to launch the system and not the cause of the problem. In the chance that rosmon is the problem, I could try submitting a new image that used roslaunch instead.

To test my submission, I did attempt to setup vrx-docker but ran into an issue due to my development machine running 20.04. I though that since it's running both parts in docker containers, it might work but I ran into problems where the worlds needed building and the vorc and vrx environments needed to be present on the host. Not being experienced with docker, I wasn't sure if I could also run the vrx-docker stuff in a docker container and it could run the containers within a container so I did not proceed further. I could run into the office to go grab an Ubuntu 18.04 laptop to pursue testing with vrx-docker.

As for adding cuda support, I started with the vrx Dockerfile and added to it. I added nvidia-cuda-toolkit to the apt install portion. I also didn't change the base image, so when calling the build.bash with the -n option, it uses nvidia/opengl:1.0-glvnd-devel-ubuntu18.04.

Our team got a late start with challenge only getting going after Thanksgiving so we've been at it 12-14 hours a day, 7 days a week since then. I promised my family that they'd have me back after yesterday's submission so if it's OK with the technical team, I will also sign off for the weekend and look deeper into this Monday morning.

glpuga commented 3 years ago

To test my submission, I did attempt to setup vrx-docker but ran into an issue due to my development machine running 20.04. I though that since it's running both parts in docker containers, it might work but I ran into problems where the worlds needed building and the vorc and vrx environments needed to be present on the host. Not being experienced with docker, I wasn't sure if I could also run the vrx-docker stuff in a docker container and it could run the containers within a container so I did not proceed further. I could run into the office to go grab an Ubuntu 18.04 laptop to pursue testing with vrx-docker.

@rolker In case it helps you, I ran into the same issue testing my submission. I managed to solve it by running the "prepare" steps from within the vrx container I used for development so that it had all it needed to build the world files, and then running the script that ran the tasks straight from my ubuntu 20.04.

rolker commented 3 years ago

To test my submission, I did attempt to setup vrx-docker but ran into an issue due to my development machine running 20.04. I though that since it's running both parts in docker containers, it might work but I ran into problems where the worlds needed building and the vorc and vrx environments needed to be present on the host. Not being experienced with docker, I wasn't sure if I could also run the vrx-docker stuff in a docker container and it could run the containers within a container so I did not proceed further. I could run into the office to go grab an Ubuntu 18.04 laptop to pursue testing with vrx-docker.

@rolker In case it helps you, I ran into the same issue testing my submission. I managed to solve it by running the "prepare" steps from within the vrx container I used for development so that it had all it needed to build the world files, and then running the script that ran the tasks straight from my ubuntu 20.04.

Great suggestion @glpuga! Thanks for sharing, I will give it a try.

rolker commented 3 years ago

Update: Thanks to glpuga's suggestion, I was able to reproduce @mabelzhang 's error. Switching from rosmon to roslaunch did not fix the problem, but it did produce what seems like a more usefull error message:

CUDA Error: no CUDA-capable device is detected: Bad file descriptor darknet_ros: /home/developer/project11/catkin_ws/src/local/yolov4-for-darknet_ros/darknet_ros/darknet/src/utils.c:326: error: Assertion `0' failed. [darknet_ros-13] process has died [pid 197, exit code -6, cmd /home/developer/project11/catkin_ws/devel/lib/darknet_ros/darknet_ros camera/rgb/image_raw:=/cora/sensors/cameras/front_left_camera/image_raw __name:=darknet_ros __log:=/home/developer/.ros/log/d717bae8-3e1f-11eb-bce5-0242ac100016/darknet_ros-13.log].

I will continue digging into this and provide an update later.

mabelzhang commented 3 years ago

I ran into problems where the worlds needed building and the vorc and vrx environments needed to be present on the host

This is a very valid issue. Thank you for bringing it up. We have been working on moving different pieces of vrx-docker into the Docker container so that it can eventually run without requiring the environments on the host. I'll ticket that.

Our team got a late start with challenge only getting going after Thanksgiving so we've been at it 12-14 hours a day, 7 days a week since then.

Whew!! We appreciate your participation!

caguero commented 3 years ago

Thanks for the update @rolker . Let us know if you manage to fix the issue. I'll wait to run your submission.

rolker commented 3 years ago

Thank you for your patience @mabelzhang , @caguero and the rest of the team.

By comparing vrx's docker/run.sh with vrx-docker's run_trial.bash. I narrowed down the difference in the docker run command that allows Cuda to work:

git diff
diff --git a/run_trial.bash b/run_trial.bash
index 42332d5..bd5a656 100755
--- a/run_trial.bash
+++ b/run_trial.bash
@@ -164,6 +164,7 @@ docker run \
     --env ROS_MASTER_URI=${ROS_MASTER_URI} \
     --env ROS_IP=${COMPETITOR_ROS_IP} \
     --ip ${COMPETITOR_ROS_IP} \
+    --privileged \
     ${DOCKERHUB_IMAGE} &

 # Run competition until server is ended

The addition of the --privileged flag is what did the trick.

The ability to test with the run_trial.bash script also helped uncover some launch file and logging issues. One item that was corrected based on comments in the run_trial.bash file was to make the ros logs available under /root/.ros/log.

I have added a new commit to this PR with a new docker image tag. That image is now being pushed to docker hub and that process could still take an hour or more. (One drawback of working from home is that narrow upload pipe!)

I'll provide a quick update when the upload is complete.

Again, thanks to all for your patience and your great work putting this challenge together.

rolker commented 3 years ago

The docker image has finished uploading.

mabelzhang commented 3 years ago

Thank you for fixing the image!

I was able to run it. I added the --privileged flag to run_trial.bash. I still got the CUDA error you mentioned above, but the vehicle moved and looked reasonable. So is the error expected / can be ignored?

CUDA Error: no CUDA-capable device is detected: Bad file descriptor
darknet_ros: /home/developer/project11/catkin_ws/src/local/yolov4-for-darknet_ros/darknet_ros/darknet/src/utils.c:326: error: Assertion `0' failed.
[darknet_ros-13] process has died [pid 200, exit code -6, cmd /home/developer/project11/catkin_ws/devel/lib/darknet_ros/darknet_ros camera/rgb/image_raw:=/cora/sensors/cameras/front_left_camera/image_raw __name:=darknet_ros __log:=/home/developer/.ros/log/1b3437ac-3eb0-11eb-aca4-0242ac100016/darknet_ros-13.log].
log file: /home/developer/.ros/log/1b3437ac-3eb0-11eb-aca4-0242ac100016/darknet_ros-13*.log
the rosdep view is empty: call 'sudo rosdep init' and 'rosdep update'

As part of the submission process, we do a few checks to help you confirm that your submission is correct. Here are the results:

Task Result Notes
Docker accessible 🟢
stationkeeping 🟢
wayfinding 🟢
perception 🔴 No messages are published to /vorc/perception/landmark topic
gymkhana 🟡

Legend: 🟢: The behavior looks reasonable. 🟡: The vessel shows activity but not necessary reasonable. 🔴: The vessel doesn't show any activity.

Are you attempting the perception and gymkhana tasks?

rolker commented 3 years ago

Thanks for checking again @mabelzhang . Unfortunately, the error cannot be ignored. Adding the --privileged flag on my system is what made that error go away. I'm not sure what else I can try on my end to solve this.

caguero commented 3 years ago

Thanks for checking again @mabelzhang . Unfortunately, the error cannot be ignored. Adding the --privileged flag on my system is what made that error go away. I'm not sure what else I can try on my end to solve this.

@rolker, It looks like your code is good to be merged. Although we can't promise anything, we'll do our best to run your solution with CUDA support.

mabelzhang commented 3 years ago

I can use some help to speed up the troubleshooting.

So, it appears I am able to see the CUDA driver in my Docker container, but I'm still getting that error about CUDA Error: no CUDA-capable device is detected.

I added --privileged and --runtime=nvidia flags for the competitor container as shown here https://github.com/osrf/vrx-docker/pull/30.

The --runtime=nvidia gave me the nvidia-smi command in the vorc-competitor-system Docker container, and it showed CUDA Version: 11.0. (My host machine actually has 11.1, I don't know if that difference matters?)

The --privileged made all the /dev/nvidia* devices on my host machine show up in the competitor container.

In your container, I get:

$ docker exec -it vorc-competitor-system bash
$ ls /dev/nvidia* -1
/dev/nvidia-modeset
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia0
/dev/nvidiactl

/dev/nvidia-caps:
nvidia-cap1
nvidia-cap2

$ nvidia-smi
Thu Dec 17 01:56:54 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P0    N/A /  N/A |   1898MiB /  4042MiB |     61%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

On my host machine, I have:

$ ls /dev/nvidia* -1
/dev/nvidia0
/dev/nvidiactl
/dev/nvidia-modeset
/dev/nvidia-uvm
/dev/nvidia-uvm-tools

/dev/nvidia-caps:
nvidia-cap1
nvidia-cap2

$ nvidia-smi
Thu Dec 17 05:02:32 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0    N/A /  N/A |    939MiB /  4042MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1936      G   /usr/lib/xorg/Xorg                 72MiB |
|    0   N/A  N/A      2527      G   /usr/bin/gnome-shell              116MiB |
|    0   N/A  N/A      4396      G   /usr/lib/xorg/Xorg                469MiB |
|    0   N/A  N/A      4554      G   /usr/bin/gnome-shell               61MiB |
|    0   N/A  N/A      5376      G   ...AAAAAAAAA= --shared-files      215MiB |
+-----------------------------------------------------------------------------+

Does anything look wrong?

I'm also able to run the official nvidia/cuda container, which shows 11.1:

$ docker run --rm --runtime=nvidia -ti nvidia/cuda
# nvidia-smi
Thu Dec 17 10:21:56 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0    N/A /  N/A |    964MiB /  4042MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Any other command I can run to check the CUDA configuration in the competitor container?

rolker commented 3 years ago

Thank you @caguero for accepting our submission and thank you very much @mabelzhang for all your efforts in trying to get this to work.

Unfortunately I don't know what else to try to fix this.

I'll try to get a plan B together before the end of the week where CUDA is not necessary.

rolker commented 3 years ago

Update: I've tried a few approaches to use the same network we trained without needing CUDA. The first, easier approach would take 10+ seconds per frame, so not fast enough for task 3, and probably not usable for task 4 either.

My second attempt is more promising, working at almost a frame per second on my machine. That should be adequate for task 3, and usable for task 4. I still need to finish integrating this approach with our solution and expect to have a revised submission Friday.

Thanks you all for your patience.

mabelzhang commented 3 years ago

Thank you for continuing the effort!

Given this and https://github.com/osrf/vorc/issues/38, next time around, we should make it clear in the instructions whether/how we'll support teams using CUDA.

caguero commented 3 years ago

@rolker , I also tested it on my machine and I got the exact same issue that Mabel described. If you can provide a non-CUDA solution, we're happy to rerun your tasks 3 and 4.

M1chaelM commented 3 years ago

Hope this isn't overkill, but I reproduced the error Mabel is describing also. I'm trying to dig in a little bit to see why the container can't find the device it wants. I'll post again if I make any progress.

rolker commented 3 years ago

Thanks all for looking into this. I just open a new pull request with an updated image.

mabelzhang commented 3 years ago

Thank you for your hard work! The new PR #24 has been approved and merged. We will rerun your tasks 3 and 4.