usnistgov / ARIAC

Repository for ARIAC (Agile Robotics for Industrial Automation Competition), consisting of kit building and assembly in a simulated warehouse
https://pages.nist.gov/ARIAC_docs/en/latest/index.html
Other
110 stars 61 forks source link

Problems with Docker testing #355

Open Zhangjiyuan1 opened 4 months ago

Zhangjiyuan1 commented 4 months ago

Is GPU acceleration definitely enabled for the competition? Since we used a deep learning algorithm for image recognition, without GPU acceleration, the race could not be completed smoothly in Docker. We tried using the Nvidia 4090 acceleration on our own computer and the race went as smoothly as we could debug it in the original environment.

jaybrecht commented 4 months ago

Yes, the automated evaluation is run on a computer with an Nvidia 3070. The automated evaluation passes the GPU to the docker container through the nvidia container toolkit. Can you try following the instructions for running the automated evaluation with GPU acceleration on your machine? If you run into issues can you post them here and we can try to help you debug.

AravindaDP commented 4 months ago

For us when we try to build competitor docker image with nvidia flag (e.g. ./build_container.sh runaround_robotics nvidia) it failed to install either nvidia-cuda-dev or nvidia-cuda-toolkit packages using apt-get in prebuild scripts or if we let to get them installed by rosdep. However strangely it succeeds if nvidia flag was not used in build_container.sh command

While we are not 100% sure. apart from that it appears to work OKish. It could also be possible that since our models are not that big perhaps they are falling back to run on CPU in absense of GPU.

saahu27 commented 4 months ago

Yes, by running without the NVIDIA tag it defaults to CPU container runtime. To enable the GPU, please install the NVIDIA container toolkit from here. [nvidia-cuda-toolkit] (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). After installing you also need to configure as shown in the link. Please let us know here if you run into any issues.

This installation has to be done in your host system not in the pre build scripts. Previously, if we had to configure you gpu to the container, we had to install dependencies in the container through docker file / build scripts. Since, last year nvidia has provided configuration and toolkit to install on your host system and the docker engine can then use resources for container runtimes.

AravindaDP commented 4 months ago

In fact we have container toolkit is installed on host machine. We used Installing with Apt option and Then Configuring Docker with nvidia-ctk command Following is current state of our host PC.

$ cat /etc/apt/sources.list.d/nvidia-container-toolkit.list 
deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /

Note that we did not configured optional experimantal packages. Is it required?

$ sudo apt-get install nvidia-container-toolkit
[sudo] password for aravindadp: 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
nvidia-container-toolkit is already the newest version (1.15.0-1).
$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Note we did not configured Rootless mode. Is that also essential? (But docker command can be run without sudo by adding user to docker group as outlined in https://docs.docker.com/engine/install/linux-postinstall/) Host system have docker installed using steps in https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository Apart from that our host system has nvidia-driver-535, nvidia-cuda-dev and nvidia-cuda-toolkit installed for host based development and works fine for that purpose.

If we don't have any nvidia-cuda-* packages on our prebuild script it builds the competitor container fine with nvidia flag. (But then we can't compile any competitor packages that rely on cuda toolkit)

Following is the error we get when we try to build competitor container with nvidia flag

$ ./build_container.sh nvidia_test nvidia
non-network local connections being added to access control list
1e6e3862b8043c6114c06560003d40d9b06b621e191407e676d4561ce1e7aa0b
Successfully copied 12.8kB to nvidia_test:/
Successfully copied 5.63kB to nvidia_test:/
Successfully copied 31.2kB to nvidia_test:/
Successfully copied 2.05kB to nvidia_test:/container_scripts
running nvidia_test.yaml
Cloning into '/workspace/src/nvidia_test'...
warning: redirecting to https://github.com/usnistgov/nist_competitor.git/
remote: Enumerating objects: 127, done.
remote: Counting objects: 100% (127/127), done.
remote: Compressing objects: 100% (75/75), done.
remote: Total 127 (delta 63), reused 80 (delta 35), pack-reused 0
Receiving objects: 100% (127/127), 109.72 KiB | 148.00 KiB/s, done.
Resolving deltas: 100% (63/63), done.
Note: switching to 'cee3a349bd330e901724c0f424a5302cf0ab0567'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

==== Installing apt dependencies
Get:1 http://packages.ros.org/ros2/ubuntu jammy InRelease [4682 B]             

Skipped some output for brevity

The following NEW packages will be installed:
  libaccinj64-11.5 libcub-dev libcublas11 libcublaslt11 libcudart11.0
  libcufft10 libcufftw10 libcuinj64-11.5 libcupti-dev libcupti-doc
  libcupti11.5 libcurand10 libcusolver11 libcusolvermg11 libcusparse11
  libnppc11 libnppial11 libnppicc11 libnppidei11 libnppif11 libnppig11
  libnppim11 libnppist11 libnppisu11 libnppitc11 libnpps11 libnvblas11
  libnvidia-compute-495 libnvidia-compute-510 libnvidia-compute-535
  libnvidia-ml-dev libnvjpeg11 libnvrtc-builtins11.5 libnvrtc11.2
  libnvtoolsext1 libnvvm4 libthrust-dev libvdpau-dev node-html5shiv
  nvidia-cuda-dev
0 upgraded, 40 newly installed, 0 to remove and 502 not upgraded.
Need to get 1324 MB of archives.
After this operation, 3820 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 libcupti11.5 amd64 11.5.114~11.5.1-1ubuntu1 [7696 kB]

Skipped some output for brevity

Fetched 1324 MB in 27min 47s (794 kB/s)                                        
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libcupti11.5:amd64.
(Reading database ... 145811 files and directories currently installed.)
Preparing to unpack .../00-libcupti11.5_11.5.114~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcupti11.5:amd64 (11.5.114~11.5.1-1ubuntu1) ...
Selecting previously unselected package libaccinj64-11.5:amd64.
Preparing to unpack .../01-libaccinj64-11.5_11.5.114~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libaccinj64-11.5:amd64 (11.5.114~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcub-dev.
Preparing to unpack .../02-libcub-dev_1.15.0-3_all.deb ...
Unpacking libcub-dev (1.15.0-3) ...
Selecting previously unselected package libcublaslt11:amd64.
Preparing to unpack .../03-libcublaslt11_11.7.4.6~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcublaslt11:amd64 (11.7.4.6~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcublas11:amd64.
Preparing to unpack .../04-libcublas11_11.7.4.6~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcublas11:amd64 (11.7.4.6~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcudart11.0:amd64.
Preparing to unpack .../05-libcudart11.0_11.5.117~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcudart11.0:amd64 (11.5.117~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcufft10:amd64.
Preparing to unpack .../06-libcufft10_11.1.1+~10.6.0.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcufft10:amd64 (11.1.1+~10.6.0.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcufftw10:amd64.
Preparing to unpack .../07-libcufftw10_11.1.1+~10.6.0.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcufftw10:amd64 (11.1.1+~10.6.0.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnvidia-compute-535:amd64.
Preparing to unpack .../08-libnvidia-compute-535_535.171.04-0ubuntu0.22.04.1_amd64.deb ...
Unpacking libnvidia-compute-535:amd64 (535.171.04-0ubuntu0.22.04.1) ...
dpkg: error processing archive /tmp/apt-dpkg-install-tG2FK0/08-libnvidia-compute-535_535.171.04-0ubuntu0.22.04.1_amd64.deb (--unpack):
 unable to make backup link of './usr/lib/x86_64-linux-gnu/libcuda.so.535.171.04' before installing new version: Invalid cross-device link
dpkg-deb: error: paste subprocess was killed by signal (Broken pipe)
Selecting previously unselected package libnvidia-compute-510:amd64.
Preparing to unpack .../09-libnvidia-compute-510_525.147.05-0ubuntu2.22.04.1_amd64.deb ...
Unpacking libnvidia-compute-510:amd64 (525.147.05-0ubuntu2.22.04.1) ...
Selecting previously unselected package libnvidia-compute-495:amd64.
Preparing to unpack .../10-libnvidia-compute-495_510.108.03-0ubuntu0.22.04.1_amd64.deb ...
Unpacking libnvidia-compute-495:amd64 (510.108.03-0ubuntu0.22.04.1) ...
Selecting previously unselected package libcuinj64-11.5:amd64.
Preparing to unpack .../11-libcuinj64-11.5_11.5.114~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcuinj64-11.5:amd64 (11.5.114~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcurand10:amd64.
Preparing to unpack .../12-libcurand10_11.1.1+~10.2.7.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcurand10:amd64 (11.1.1+~10.2.7.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcusolver11:amd64.
Preparing to unpack .../13-libcusolver11_11.3.2.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcusolver11:amd64 (11.3.2.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcusolvermg11:amd64.
Preparing to unpack .../14-libcusolvermg11_11.3.2.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcusolvermg11:amd64 (11.3.2.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcusparse11:amd64.
Preparing to unpack .../15-libcusparse11_11.7.0.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcusparse11:amd64 (11.7.0.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppc11:amd64.
Preparing to unpack .../16-libnppc11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppc11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppial11:amd64.
Preparing to unpack .../17-libnppial11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppial11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppicc11:amd64.
Preparing to unpack .../18-libnppicc11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppicc11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppidei11:amd64.
Preparing to unpack .../19-libnppidei11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppidei11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppif11:amd64.
Preparing to unpack .../20-libnppif11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppif11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppig11:amd64.
Preparing to unpack .../21-libnppig11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppig11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppim11:amd64.
Preparing to unpack .../22-libnppim11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppim11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppist11:amd64.
Preparing to unpack .../23-libnppist11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppist11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppisu11:amd64.
Preparing to unpack .../24-libnppisu11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppisu11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnppitc11:amd64.
Preparing to unpack .../25-libnppitc11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnppitc11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnpps11:amd64.
Preparing to unpack .../26-libnpps11_11.5.1.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnpps11:amd64 (11.5.1.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnvblas11:amd64.
Preparing to unpack .../27-libnvblas11_11.7.4.6~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnvblas11:amd64 (11.7.4.6~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnvidia-ml-dev:amd64.
Preparing to unpack .../28-libnvidia-ml-dev_11.5.50~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnvidia-ml-dev:amd64 (11.5.50~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnvjpeg11:amd64.
Preparing to unpack .../29-libnvjpeg11_11.5.4.107~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnvjpeg11:amd64 (11.5.4.107~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnvrtc-builtins11.5:amd64.
Preparing to unpack .../30-libnvrtc-builtins11.5_11.5.119~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnvrtc-builtins11.5:amd64 (11.5.119~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnvrtc11.2:amd64.
Preparing to unpack .../31-libnvrtc11.2_11.5.119~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnvrtc11.2:amd64 (11.5.119~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnvvm4:amd64.
Preparing to unpack .../32-libnvvm4_11.5.119~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnvvm4:amd64 (11.5.119~11.5.1-1ubuntu1) ...
Selecting previously unselected package libvdpau-dev:amd64.
Preparing to unpack .../33-libvdpau-dev_1.4-3build2_amd64.deb ...
Unpacking libvdpau-dev:amd64 (1.4-3build2) ...
Selecting previously unselected package node-html5shiv.
Preparing to unpack .../34-node-html5shiv_3.7.3+dfsg-4_all.deb ...
Unpacking node-html5shiv (3.7.3+dfsg-4) ...
Selecting previously unselected package libcupti-dev:amd64.
Preparing to unpack .../35-libcupti-dev_11.5.114~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libcupti-dev:amd64 (11.5.114~11.5.1-1ubuntu1) ...
Selecting previously unselected package libcupti-doc.
Preparing to unpack .../36-libcupti-doc_11.5.114~11.5.1-1ubuntu1_all.deb ...
Unpacking libcupti-doc (11.5.114~11.5.1-1ubuntu1) ...
Selecting previously unselected package libnvtoolsext1:amd64.
Preparing to unpack .../37-libnvtoolsext1_11.5.114~11.5.1-1ubuntu1_amd64.deb ...
Unpacking libnvtoolsext1:amd64 (11.5.114~11.5.1-1ubuntu1) ...
Selecting previously unselected package libthrust-dev.
Preparing to unpack .../38-libthrust-dev_1.15.0-1_all.deb ...
Unpacking libthrust-dev (1.15.0-1) ...
Selecting previously unselected package nvidia-cuda-dev:amd64.
Preparing to unpack .../39-nvidia-cuda-dev_11.5.1-1ubuntu1_amd64.deb ...
Unpacking nvidia-cuda-dev:amd64 (11.5.1-1ubuntu1) ...
Errors were encountered while processing:
 /tmp/apt-dpkg-install-tG2FK0/08-libnvidia-compute-535_535.171.04-0ubuntu0.22.04.1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

Is there any additional configuration needed? Or alternatively is there a way to somehow build without nvidia flag but run with nvidia enabled? (Assuming it would still work)

LKmubihei commented 4 months ago

Hi, our team use _. /build_container.sh nistcompetitor , this runs on the cpu by default, and the realtime factor in docker is very small, about 0.5, and it's very laggy causing some tasks to not complete properly. Instead, use _. /build_container.sh nistcompetitor nvidia the realtime factor in docker is about 0.9 and the code runs fine. So we want to make sure that we use this command with the nvidia tag in the final test?

saahu27 commented 4 months ago

The steps you took seem to be accurate and i dont see a problem in there. The question however is you would ideally not need any nvidia packages on the container. Can you please remove these from the pre build scripts and build the container and see if you are able to build a container with nvidia tag? @AravindaDP

saahu27 commented 4 months ago

The final is run on a computer with an Nvidia 3070. yes, we will make sure to use the nvidia tag when we run the finals. @LKmubihei

AravindaDP commented 4 months ago

The steps you took seem to be accurate and i dont see a problem in there. The question however is you would ideally not need any nvidia packages on the container. Can you please remove these from the pre build scripts and build the container and see if you are able to build a container with nvidia tag? @AravindaDP

As I explained building competitor container with nvidia flag works if we don't have any packages in source code form in our solution that rely on nvidia-cuda-toolkit (libararies and compilers for CUDA development) Problem occurs when we have packages in source code form in our solution that requires nvidia-cuda-toolkit to build.

Our understanding is this;

  1. base docker image (nistariac/ariac2024) used for building competitor container does not contain any compilers or libraries needed for CUDA development. Hence if those weren't installed inside the competitor container using pre build script you can't build any packages as part of competitor solution that rely on cuda toolkit. So I think statement of "you would ideally not need any nvidia packages on the container" is not valid for competitors who want to use GPU for perception stack etc in their solution.
  2. For some strange reason, nvidia-cuda-toolkit and related packages can't be installed using pre-build script while building the competitor container (if nvidia flag is used). Then this would fail subsequent competitor package building that rely on nvidia compilers (Not shown on above output of us it generally materialize as CMake failure to find CUDA)

Having said that, for some strange reason it appears our qualification submission somehow succeeded despite having nvidia-cuda-toolkit installed during pre-build script. Not sure it's because nvidia flag was not used during qualification round for us or this is a configuration issue on our host machine.

Also note that this is different from the instructions in "https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network" There appears to be package(s) named cuda-toolkit-* from above location. The issue with them we have is out of the box those packages don't become discoverable by CMake and still complains CUDA not installed. (Perhaps need additional manual steps not outlined in that page) So we are using nvidia-cuda-toolkit (We believe is from Ubuntu universe source) which works out of the box with CMake

AravindaDP commented 4 months ago

Update: We figured out that cuda-toolkit-* can be installed succesfully during prebuild script however it needs /usr/local/cuda/bin needs to be added to PATH environment variable for CMake to discover it. Since export statements used in prebuild script are not valid when it comes to build packages we resort to build the specific packages depending on CUDA to be built at pre_build script and skip them during usual colcon build of other competitor packages. (Perhaps as an alternative, can be used -DCUDAToolkit_ROOT=/some/path instead as described in https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html)

However one thing we noticed is that, docker container has a high rate of crashing most of the time with boost thread: trying joining itself: Resource deadlock avoided when running trials.

AravindaDP commented 4 months ago

Also is there a way to update the competitor container with new trial files without rebuilding from scratch? Current approach of removing container forcefully and running build_container again is quite expensive when we need to make a small update like this.

jaybrecht commented 4 months ago

Yes we built the feature to update the trials on the container. Currently that feature is only on the scoring branch. There is also a new way to run multiple trials using a GUI that might be helpful. It stops the container after each trial which we have found helps address some of the errors we see running multiple trials in a row. Just run python3 run_trials.py and a GUI will pop up. This has the feature to update trials without rebuilding.

AravindaDP commented 3 months ago

During recent docker testing we come across of scenario where our solution fails with following error Container was running with gazibo launching and running fine.

[start_behavior-29]     sl = self._semlock = _multiprocessing.SemLock(
[start_behavior-29] OSError: [Errno 28] No space left on device

When df command was run inside the container it turned out that /dev/shm was completely filled up.

AravindaDP commented 3 months ago

We also observed higher rate of failure to complete the runs on qualifier_trial_6.yaml used on qualifications (To the point that it's impossible to get a completed result) with following error causing gazebo crash.

terminate called after throwing an instance of 'boost::wrapexcept<boost::thread_resource_error>'
  what():  boost thread: trying joining itself: Resource deadlock avoided
Aborted (core dumped)
Gazebo not running

The difference we observe from other trials is that this trial has a very large amount of parts in the environment which leads to very low real time rate of simulation (~0.3x in our system). Perhaps considering to reduce unrelated parts in the environment during final trial design would be good?