ros2 / rclpy

rclpy (ROS Client Library for Python)
Apache License 2.0
291 stars 224 forks source link

Humble has higher CPU usage compared to Foxy #1190

Closed rty813 closed 1 week ago

rty813 commented 11 months ago

Bug report

Required Info:

Steps to reproduce issue

ros2 topic pub -r 100 /testt std_msgs/msg/String "{data: hello}"
ros2 topic hz /testt

Expected behavior

Humble has the same cpu usage as Foxy

Actual behavior

Humble has higher CPU usage

Humble: image

Foxy: image

Additional information

If I use C++ to receive, the CPU usage remains largely unchanged, with Humble even slightly lower. Hence, I am more suspicious that the issue lies with rclpy. It is likely that ros2 topic hz is also implemented using Python.

Humble: image

Foxy: image


At the same time, I conducted tests on an X86 computer, and the performance of Humble was decent, with only a slight increase in CPU usage compared to Foxy. image

rty813 commented 11 months ago

But if I test with the Docker image arm64v8/ros:humble-ros-core, the CPU usage is normal. Could it be because there's an issue with the versions of rclpy libraries I compiled? image

fujitatomoya commented 11 months ago

Nvidia Jetson Xavier NX, Ubuntu 18.04

Are you running the container image on top of this host system?

rclpy(Foxy): 1.0.13

with Ubuntu 20.04? (btw, Foxy is E.O.L)

rclpy(Humble): 3.3.9

and with Ubuntu 22.04?

Could it be because there's an issue with the versions of rclpy libraries I compiled?

unlikely, both versions are almost HEAD, I do not think so. there are many difference between them, i am not sure which fix generates this performance difference though.

rty813 commented 11 months ago

Nvidia Jetson Xavier NX, Ubuntu 18.04

Are you running the container image on top of this host system?

container image

rclpy(Foxy): 1.0.13

with Ubuntu 20.04? (btw, Foxy is E.O.L)

Ubuntu 18.04.

rclpy(Humble): 3.3.9

and with Ubuntu 22.04?

Ubuntu 18.04, This image is the same as Foxy's image.This is built based on the Nvidia L4T image.

Could it be because there's an issue with the versions of rclpy libraries I compiled?

unlikely, both versions are almost HEAD, I do not think so. there are many difference between them, i am not sure which fix generates this performance difference though.

rty813 commented 11 months ago

This is my dockerfile:

FROM nvcr.io/nvidia/l4t-base:r32.5.0

ENV OPENBLAS_CORETYPE=ARMV8
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

ARG DEBIAN_FRONTEND=noninteractive

ARG TOOLS=" \
    build-essential \
    fish \
    sudo \
    git \
    vim \
    python3-pip \
    gnupg \
    supervisor \
    iputils-ping \
    iproute2 \
    net-tools \
    locales \
    lrzsz \
    curl \
    wget \
    v4l-utils \
    libopenblas-dev \
    nvidia-tensorrt \
    nvidia-cudnn8 \
    nvidia-cuda \
    nvidia-opencv \
    python3-libnvinfer \
    python3-opencv"

ADD assets/cmake-3.20.0-linux-aarch64.tar.gz /usr/local/
RUN sed -i 's@http://.*.ubuntu.com@http://mirrors.tuna.tsinghua.edu.cn@g' /etc/apt/sources.list \
    && localedef -c -f UTF-8 -i zh_CN zh_CN.UTF-8 \
    && ln -s /usr/local/cmake-3.20.0-linux-aarch64/bin/* /usr/local/bin/ \
    && apt-get update \
    && apt-get install --no-install-recommends -y software-properties-common dirmngr gpg-agent \
    && apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc \
    && echo "deb https://repo.download.nvidia.com/jetson/common r32.5 main" > /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
    && echo "deb https://repo.download.nvidia.com/jetson/t194 r32.5 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
    && apt-add-repository -y ppa:fish-shell/release-3 \
    && apt-get install --no-install-recommends -y ${TOOLS} \
    && rm -rf /var/lib/apt/lists/*

# Install python package
RUN git config --global http.sslVerify "false" \
    && pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade pip \
    && pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple \
    && cd /tmp \
    && wget -m ftp://...........................:12221/home/Drive/wheels/nano \
    && pip3 install --ignore-requires-python --ignore-installed /tmp/nas.oca-teh.cn:12221/home/Drive/wheels/nano/* \
    && rm -rf /tmp/nas.oca-teh.cn:12221

# ROS2
RUN locale-gen zh_CN zh_CN.UTF-8 \
    && update-locale LC_ALL=zh_CN.UTF-8 LANG=zh_CN.UTF-8
ENV LANG=en_US.UTF-8
RUN add-apt-repository universe \
    && curl -sSL https://raw.githubusercontent.com/ros/rosdistro/master/ros.key -o /usr/share/keyrings/ros-archive-keyring.gpg \
    && echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/ros-archive-keyring.gpg] http://packages.ros.org/ros2/ubuntu $(. /etc/os-release && echo $UBUNTU_CODENAME) main" | tee /etc/apt/sources.list.d/ros2.list > /dev/null \
    && add-apt-repository -y -n ppa:borglab/gtsam-release-4.1 \
    && apt-get update \
    && apt-get install -y --no-install-recommends \
        python3-rosinstall-generator \
        python3-colcon-common-extensions \
        python3-rosdep \
        libpython3-dev \
        libflann-dev \
        libboost-all-dev \
        libgtsam-dev \
        libgtsam-unstable-dev \
        libtbb-dev \
    && pip3 install -U vcstool argcomplete

RUN mkdir -p /opt/ros/foxy/src \
    && cd /opt/ros/foxy \
    && vcs import --input https://........................:9091/-/snippets/855/raw/master/ros2.repo src \
    && rosdep init \
    && rosdep update \
    && rosdep install --from-paths src --ignore-src -y --skip-keys "fastcdr rti-connext-dds-5.3.1 urdfdom_headers libpcl-all-dev" \
    && cd /tmp \
    && wget -nv https://github.com/PointCloudLibrary/pcl/releases/download/pcl-1.13.1/source.tar.gz \
    && tar -xzf source.tar.gz \
    && mkdir pcl/build \
    && cd pcl/build \
    && cmake -DCMAKE_BUILD_TYPE=Release .. \
    && make -j30 \
    && make install \
    && rm -rf /tmp/source.tar.gz \
    && rm -rf /tmp/pcl \
    && cd /opt/ros/foxy \
    && colcon build --merge-install \
    && rm -rf build log src \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /root
rty813 commented 11 months ago

If I compile the ROS Humble on Ubuntu 22.04, the CPU usage returns to the same level as Foxy. But it still remains higher than arm64v8/ros:humble-ros-core.

image

rty813 commented 11 months ago

any advice ?

fujitatomoya commented 11 months ago

unfortunately i do not have any clue on this, you might need to use profiler or process space analyzer which call stack increases the complexity to narrow down, i guess.

rty813 commented 10 months ago

@fujitatomoya Through an open-source analysis tool, VizTracer library, I discovered that within each spin_once iteration, the add_node function in the Humble takes longer compared to Foxy. Subsequently, when I commented out the line self._guard.trigger() within the add_node function of the Humble, CPU usage noticeably decreased by 10%. Therefore, is there any modification in the garbage collection aspect causing this performance degradation?

image image

fujitatomoya commented 10 months ago

add_node function in the Humble takes longer compared to Foxy.

i am not sure what exactly leads to this performance comparison, but from foxy to humble, there were huge changes including WaitSet with pybind11. that might be some performance impact, and i do not think we could roll back these changes to foxy...

Subsequently, when I commented out the line self._guard.trigger() within the add_node function of the Humble

this explains since collecting entities is the work. but i think you do not want to do this in the application. this trigger() will wake up the WaitSet to go to the next loop to collect entities to wait. (note, that is not garbage collection, guard condition.)

rty813 commented 10 months ago

Why is add_node called first in each spin_once iteration, followed by remove_node? Can using StaticSingleThreadedExecutor help avoid this situation?

Barry-Xu-2018 commented 10 months ago

Why is add_node called first in each spin_once iteration, followed by remove_node? Can using StaticSingleThreadedExecutor help avoid this situation?

I'm not sure if rclpy::spin() is you want.
https://github.com/ros2/rclpy/blob/6cd6fd19e8f27c4e55a1fd082db98dc99b1f085d/rclpy/rclpy/__init__.py#L225-L242

rty813 commented 10 months ago

So, this means that ros2 topic hz is essentially looping through calls to rclpy.spin_once(node), and each spin_once involves both add_node and remove_node. Since the performance of add_node is slightly worse in Humble compared to Foxy, is this the cause of the performance difference?

Next, I will test calling spin(node) to avoid redundant add_node calls and observe the differences between Foxy and Humble.

Barry-Xu-2018 commented 10 months ago

So, this means that ros2 topic hz is essentially looping through calls to rclpy.spin_once(node), and each spin_once involves both add_node and remove_node. Since the performance of add_node is slightly worse in Humble compared to Foxy, is this the cause of the performance difference?

Refer to code

https://github.com/ros2/ros2cli/blob/419afea1282ae4daa43ea446a2c0bcc7394f856c/ros2topic/ros2topic/verb/hz.py#L262-L264

As fujitatomoya said, there are huge changes between foxy and Humble. Performance impacts could introduce from the changes from many different modules.

rty813 commented 10 months ago

I continued testing using the spin function and indeed, the repetitive calls to add_node were eliminated. However, there still exists a difference in CPU usage on the ARMv7 platform. During testing on the ARMv7 platform, the difference in CPU usage exceeded two-fold, with Foxy around 35% and Humble around 70%. Upon analysis, it seems that the prolonged execution of the _take_subscription and trigger functions is causing this disparity. I am unsure about how to proceed with further analysis.

image image

fujitatomoya commented 10 months ago

I am not sure either how to proceed further right now. If that was our requirement, i would use traceability to check the each function call including implementation to get to the bottle neck.

rty813 commented 1 week ago

Through further testing, I found that the performance difference might be related to pybind11 on armv7. I added an empty function in the C code for both Foxy and Humble, then called it in Python at 1ms intervals. Humble performed worse. Next, I will further compare Jazzy and Humble.

rty813 commented 1 week ago

I found the issue: the performance of pybind11 was poor because the cmake_build_type was not set to release during compilation.

fujitatomoya commented 1 week ago

@rty813 thanks for sharing the status.

glad to hear that you fixed the problem. the problem was your build environment but ros2 packages at all? just curious that if some fixes need to be applied to the ros2 system or not.

rty813 commented 1 week ago

When using colcon build to build ROS packages, the default cmake_build_type is Debug. On ARM architecture, pybind11 performs poorly. If you specify --cmake-args -DCMAKE_BUILD_TYPE=Release, it can achieve normal performance levels

rty813 commented 1 week ago

When spin_once is called more frequently, the performance issues caused by pybind11 become more noticeable. The Foxy version uses the C extension for CPython instead of pybind11, so this issue does not exist.