AMD64 Docker Container fails to build with -DTorch_DIR flag set in cmake

erik-fauna commented 4 months ago

Note: this is a direct comparison against arm64 architecture, where this does work. Besides architecture, cmake versions were slightly different, with cmake 3.22.1 on amd64, and cmake 3.29.1 on arm64.

In a dockerfile, after installing all the dependencies, the following code is run:

RUN git clone https://github.com/introlab/rtabmap.git /workspace/rtabmap
RUN . /opt/ros/humble/setup.sh && \
    cd rtabmap/build && \
    cmake -DWITH_OPENGV=ON -DWITH_TORCH=ON -DWITH_PYTHON=ON .. && \
    make -j$(nproc) && \
    make install

In both architectures, this fails to provide rtabmap with CUDA enabled torch for SupertPoint.

However, adding the Torch_DIR flag like this works in arm64:

RUN git clone https://github.com/introlab/rtabmap.git /workspace/rtabmap
RUN . /opt/ros/humble/setup.sh && \
    cd rtabmap/build && \
    cmake -DWITH_OPENGV=ON -DWITH_TORCH=ON -DWITH_PYTHON=ON -DTorch_DIR=/usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch .. && \
    make -j$(nproc) && \
    make install

Unfortunately, it fails to compile on the amd64 architecture, where the location of the library is the same. There are some differences in the cmake output, in particular there are a lot (maybe 40?) of libraries missing from the Found output that are in the /usr/lib/x86_64-linux-gnu folder, such as:

PCL_COMMON
PCL_OCTREE
PCL_IO
PCL_KDTREE
PCL_SEARCH
PCL_SURFACE
PCL_FILTERS
etc

The LD_LIBRARY_PATH is not changed when testing with or without the flag, so the libraries should be visible regardless. Any idea why they wouldn't be found after adding that flag?

The failure to compile is mostly due to undefined references for things like rtabmap, uToUpperCase, UFile, grid_map, pcl, PointMatcher, uBool2str, UDirectory, etc

matlabbe commented 4 months ago

What is your Dockerfile or what is your base image? I only tested PyTorch on amd64. See this docker file for example: https://github.com/introlab/rtabmap/blob/master/docker/frontiers2022/Dockerfile

erik-fauna commented 3 months ago

The base of my image is different from yours, which could be part of the issue.

# Use an NVIDIA CUDA base image that supports Ubuntu 22.04
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

Generally we're doing the same thing for our installs, but I'll swap to an nvidia base image that has pytorch installed specifically. That being said, when I enter my image and run torch.cuda.is_available(), I do see True. Unfortunately the one you linked uses ubuntu 20.04, making it harder to install humble, which is required for tf2_geometry_msgs used by rtabmap_conversions.

matlabbe commented 3 months ago

It looks like 24.07-py3 image is available on arm64 and amd64, and it is on Ubuntu jammy:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:    22.04
Codename:   jammy

so ROS2 humble would be available. Torch is not installed at the same place than on my dockerfile above, it is now here:

$ find / -name "TorchConfig.cmake"
/usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake

It looks like we cannot use OpenCV built by nvidia that is installed in /usr/local because stitching module is missing:

CMake Error at /usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find OpenCV (missing: stitching) (found version "4.7.0")
Call Stack (most recent call first):
  /usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  /usr/local/lib/cmake/opencv4/OpenCVConfig.cmake:354 (find_package_handle_standard_args)
  CMakeLists.txt:234 (FIND_PACKAGE)

After installing libopencv-dev, we should explicitly link to system version:

cmake -DTorch_DIR=/usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch \
      -DOpenCV_DIR=/usr/lib/x86_64-linux-gnu/cmake/opencv4 \
      -DWITH_TORCH=ON \
      -DWITH_PYTHON=ON ..

It builds without errors on amd64 image. I didn't try under ros2, but the standalone is working:

erik-fauna commented 3 months ago

Thank you for the update, I was coming across the same issues but I was unaware that cmake flags could resolve these issues.

I'm unsure why, but installing libopencv-dev appears to break cv2 in python, which is necessary to create the superpoint.pt file from the trace.py script. Therefore, that file should be created before installing libopencv-dev in the Dockerfile.

Building rtabmap_ros from source also requires OpenCV with stitching, therefore you will need to add that flag for a successful build as well:

RUN git clone --branch ros2 https://github.com/introlab/rtabmap_ros.git src/rtabmap_ros \
    && source /opt/ros/humble/setup.bash \
    && colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release -DOpenCV_DIR=/usr/lib/x86_64-linux-gnu/cmake/opencv4

Aside from that, actually running the ros2 node had an issue finding this package: symbol lookup error: /opt/hpcx/ucc/lib/libucc.so.1: undefined symbol: ucs_config_doc_nop which was resolved by adding export LD_LIBRARY_PATH=/opt/hpcx/ucx/lib:$LD_LIBRARY_PATH before running the launch file.

With those additions, this now runs on ros2 humble, however it looks like SuperGlue is still having issues. Currently I'm seeing this error message.

RegistrationVis.cpp:183::parseParameters() PyMatcher/Path parameter should be set to use Python3 matching (Vis/CorNNType=6), using default 1.

matlabbe commented 3 months ago

You may have to add parameter "PyMatcher/Path": "/PATH/TO/SuperGluePretrainedNetwork/rtabmap_superglue.py" for rtabmap node.

erik-fauna commented 3 months ago

That worked, thanks!

introlab / rtabmap

AMD64 Docker Container fails to build with -DTorch_DIR flag set in cmake #1316