ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.47k stars 1.25k forks source link

Obstacle_layer plugin cause high CPU usage in controller_Server and planner_server(sometime over 120% ) #2333

Closed wilsionhuang closed 3 years ago

wilsionhuang commented 3 years ago

Bug report

Obstacle_layer plugin cause high CPU usage in controller_Server and planner_server

Required Info:

Steps to reproduce issue

1. launch the laser node  and navigation2 software stack
2. keep the robot standstill or do task of moving point A to piont B
3. open htop in new terminal and you can find the cpu of controller_server and planner_server is unstable and high(sometimes the crest value may over 120%)

Expected behavior

The cpu usage of controller_server and planner_server is stable and not high

Actual behavior

The cpu usage of controller_server and planner_server is unstable and increase abruptly over 120% Screenshot from 2021-05-11 15-39-34

Additional information

Temporary solution

  1. Transform the scan data(laser_node) to pointcloud and then transform the pointcloud data from the laser_link to map before publish
  2. Transform scan data from laser_link to map directly

The result of temporary solution

the cpu usage of controller_server and planner_server is decreased and keeping stable obviously(not increase abruptly) Screenshot from 2021-05-11 16-32-07

Three points may cause this problem and Debug

  1. projector_.transformLaserScanToPointCloud(message->header.frameid, message, cloud, tf) cause the most cpu in callback function -- through transform the scan to pointcloud data and not transform data from laser_link to map before publish, the bug is still alive
  2. buffer->bufferCloud(cloud) -- ongoing debug
  3. QOS buffer size is lager -- decrease the buffer size, the problem is still remaining

Co-woker

@tal-grossman @Andersdu @duanyunxin

Although the temporary solution can fix this problem, we still could not find the point which cause this problem in obstacle_layer plugins. If anyone have ideas with this issue, we are hopefully Looking forward to your reply. Thanks.

SteveMacenski commented 3 years ago

What's your compute platform?

I'm looking at htop on main branch and seeing only 5.4% static CPU load. The only thing over 10% is Rviz and Gazebo on my 7th gen i5 laptop.

Version or commit hash: From source

I'm sure you know better, but you did build it with release flags, correct?

If you've identified the issue with the laser projector, should you file a ticket with laser geometry instead? I don't see, however, any changes in laser geometry in over a year, so I don't think that's probably actually the issue. Have you run GDB over this to see specifically the offending parts of the controller server when you see this spike?

wilsionhuang commented 3 years ago

thanks for your reply. The additional information provied below

Hardware plantform: X3399 development board Nav2 version: Foxy and build from source

SteveMacenski commented 3 years ago

Please answer all the questions & profile where the slow down is happening so we can potentially address it

wilsionhuang commented 3 years ago

If we publish the laser scan data directly, the obstacle_layer.cpp was subscribe the scan topic and process this scan data in laserScanCallback function. There have two main step.

1. Transform the scan data to pointcloud by projector_.transformLaserScanToPointCloud(message->header.frame_id, *message, cloud, *tf_)
2. Transforms a PointCloud to the global frame and buffers it by buffer->bufferCloud(cloud)

But after we transform the scan data to pointcloud and transform the pointcloud from laser_link to map before publish. The obstacle_layer.cpp will call the pointCloud2Callback function. It only have one step.

1.Transforms a PointCloud to the global frame and buffers it by buffer->bufferCloud(cloud) 

Note: as we had transformed pointcloud to map before publish, so bufferCloud(cloud) has no real sense

Publish pointcloud will sent pointcloud data to buffer directly ,while publish laser scan data should transform to pointcloud and transform pointcloud to map firstly and then send pointcloud to buffer. As this solution actual decrease the cpu usage of controller_server and plannerserver which should load obstacle layer plugin(aforemetioned). So we have reason to believe **projector.transformLaserScanToPointCloud and bufferCloud(cloud)** may cause this problem.

If anyone have different idea, please post it and come to discuss, thanks.

SteveMacenski commented 3 years ago

Without the ability to reproduce or profiling to find out where exactly this issue lays, I don't think there's anything we can do to help you. If you're right that it is an issue with the laser geometry package, you should see if you can find where some update was made resulting in an issue, but from my checking, I did not find any changes in over a year that would have impacted you.

Overall, my instinct is to close this issue as wontfix unless we can't diagnose more or others are reporting it.

SteveMacenski commented 3 years ago

Have you had a chance to debug this further to see if there's something we can do?

wilsionhuang commented 3 years ago

yes, I'll keep debugging this problem later, and if i had any progress, i'll post it here to discuss.

charlielito commented 3 years ago

What's your compute platform?

I'm looking at htop on main branch and seeing only 5.4% static CPU load. The only thing over 10% is Rviz and Gazebo on my 7th gen i5 laptop.

Version or commit hash: From source

I'm sure you know better, but you did build it with release flags, correct?

If you've identified the issue with the laser projector, should you file a ticket with laser geometry instead? I don't see, however, any changes in laser geometry in over a year, so I don't think that's probably actually the issue. Have you run GDB over this to see specifically the offending parts of the controller server when you see this spike?

Before seeing this issue I didn't bother to check the performance of nav2 on my machine, but it seems that this issue also happens to me on a 10th gen i7 laptop. Controller and planner server consume about 70-80% each. I am running ros2 foxy from binaries and navigation2 also from binaries. What do you mean with release flags? Those flags should compile a more performant version? I also tested compiling nav2 and got more or less the same consumption (don't know if with release flags)

SteveMacenski commented 3 years ago

Release flags mean to build with cmake flags to build a release version rather than debug. Please refer to colcon / cmake docs for that info.

Interesting -- do you have some time to profile this to see where the CPU is being used up? If you have alot of sensors and costmap layers, it can use up that much CPU, but that shouldn't be the baseline for just a sensor or two.

charlielito commented 3 years ago

Sure, I am keen to dig into that. I tried valgrind to profile the controller server but the process got really slow that I was not able to navigate because of an action timeout. I am also a little lost on how to interpret the result file with kcachegrind. Do you have a tool you recommend for profiling ros code?

callgrind.zip

SteveMacenski commented 3 years ago

Unfortunately, I do not have the cycles in the immediate future to analyze this issue. I'd welcome a PR to resolve any concerns but for the time being this is not something I can address in detail

SteveMacenski commented 3 years ago

Did you discover anything actionable?

BriceRenaudeau commented 3 years ago

Hello, We are having similar troubles on Galactic (cycloneDDS) with nav2 from source. We are using TEB as controller The CPU of controller_serveur is not stable. For no apparent reasons it got high (>120%) hence the robot becomes very laggy. The CPU consumption changes even if the robot is not moving or without sending action calls to the navigation stack.

I don't know if it is the cause or the consequence but we also get a warning from local_costamp saying a laser message is bumped because the queue is full.

The Flame graph of controller_server while being at 120% on our robot (not moving) looks like that: flame_graph_controller_server

SteveMacenski commented 3 years ago

TEB takes much more CPU than any of the other Nav2 controllers, so be aware that you're also selecting a controller that is known to require more compute than a typical one. This user is reporting this on both the planner and controller servers, so if that is not the case for you as well, this is probably unrelated. We have been debugging down to some PCL / conversion functions. Just because you are not moving the robot, doesn't means sensor data isn't being processed which crunches CPU.

BriceRenaudeau commented 3 years ago

In normal conditions, TEB is running correctly using 20%CPU.

Just because you are not moving the robot, doesn't mean sensor data isn't being processed which crunches CPU.

Yes, but it means we are not processing the servoing but only the local_costmap updates.

SteveMacenski commented 3 years ago

I'm not exactly sure what's actionable in your situation. Can you be specific as to your issue? I don't think this is related to the user's issue this ticket was filed under. I'm looking at that flame graph and I don't see anything in there actionable for Nav2. Its all TF and executors. You may want to instead file a ticket in TF2 if you can create a minimal example those maintainers can use to see the problem on their systems.

BriceRenaudeau commented 3 years ago

I posted it here because the node controller_server was the problematic one and the issue looks similar. But I don't know what triggers this issue, it is hard to reproduce constantly. Looking at the flame graph, it seems TF2 is taking much more space than it should but I cannot figure out why.

SteveMacenski commented 3 years ago

That's something you should file with TF2 repo with a minimal reproduction example if you want any hope of it being addressed. Though I'd recommend doing a bit more debugging to get someone on the right track

charlielito commented 3 years ago

I think got a way to reproduce in some extend this issue (hopefully). Testing with the turtlebot3 in simulation sometimes happens that the CPU load goes up to 40% of one core for controller and 40% one core for planner server in my beefy Laptop 10th Gen i7 while the robot is IDLE. They start consuming very low CPU, and suddenly they start to consume more up to 40% each and then drop down again and so on (this is kind of random, not consistent).

My Dockerfile

FROM ros:foxy

ENV DEBIAN_FRONTEND noninteractive
ENV ROS_DISTRO=foxy

# Install deps
RUN apt-get update && apt-get install --no-install-recommends -y \
    ros-${ROS_DISTRO}-navigation2 \
    ros-${ROS_DISTRO}-nav2-bringup \
    ros-${ROS_DISTRO}-turtlebot3* \
    htop \
    && apt autoremove -y && apt clean -y \
    && rm -rf /var/lib/apt/lists/*

ENV TURTLEBOT3_MODEL waffle
RUN rm -rf /opt/ros/${ROS_DISTRO}/share/turtlebot3_simulations \
    && git clone https://github.com/ROBOTIS-GIT/turtlebot3_simulations /opt/ros/${ROS_DISTRO}/share/turtlebot3_simulations

RUN ROS_SOURCE="source /opt/ros/${ROS_DISTRO}/setup.sh" \
    && echo $ROS_SOURCE >> "/root/.bashrc"
RUN GAZEBO_SOURCE="source /usr/share/gazebo-11/setup.sh;export GAZEBO_MODEL_PATH=\$GAZEBO_MODEL_PATH:/opt/ros/${ROS_DISTRO}/share/turtlebot3_simulations/turtlebot3_gazebo/models" \
    && echo $GAZEBO_SOURCE >> "/root/.bashrc"

And I am running

ros2 launch nav2_bringup tb3_simulation_launch.py use_rviz:=False headless:=True
ros2 topic pub -1 /initialpose geometry_msgs/PoseWithCovarianceStamped "{header: {frame_id: 'map'}, pose: {pose: {position: {x: -1.9, y: -0.5, z: 0.0}}, covariance: [0.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.069]}}"
Kaetera commented 3 years ago

Hello, We got similar high usages on our i5 machine (from beginning of the year when we started the project).

The CPU usage reported by htop for controller_server and planner_server is stuck at >105%.

Inspired by the workaround of @wilsionhuang I just reduced my laser scanner publish rate from 20Hz to 8Hz, and the CPU usage of controller_server and planner_server got reduced from ~104% to ~80%.

I'm looking at htop on main branch and seeing only 5.4% static CPU load. The only thing over 10% is Rviz and Gazebo on my 7th gen i5 laptop.

I though it was just a HW performance issue but now it seems that this usage is abnormal since we have similar platforms...

Notes:

SteveMacenski commented 3 years ago

How much RAM do you have and is that being saturated?

This is obviously an annoying problem, but we kind of need a bit more to go off of here. It seems like it may be related to DDS/network vendors (did you try Cyclone?) or TF2. If I were to guess in the dark, I'd say it has to do with Fast-DDS since you are both reporting this issue with it and I know many Nav2 users have had better experiences with Cyclone. But with the flame graph, it could be TF -- though its certainly more than possible we're just seeing the TF stuff as a result of that being the highest network traffic tool in use to display a given DDS issue.

80% static with no commands even if processing a laser scan at 20hz is ridiculously large. Could you also compile with debug flags and take a look at where the compute time is going? It would be good to know if this traces back to anything internal to Nav2 (costmaps, servers, etc) or external (TF, DDS, ros2 core)

Kaetera commented 3 years ago

RAM Our Intel NUC i5 has 8Go RAM, with <1.1Gb used for all our system (same usage for 8Hz and 20Hz scans)

DDS With different DDS, both with 20Hz laserscans:

Much nicer with CycloneDDS indeed !

SteveMacenski commented 3 years ago

@JaimeMartin @EduPonz please take note of the discussions above, Fast-DDS in Foxy and Galactic are unable to run Nav2 effectively, users are being forced to change to Cyclone to get reasonable performance. Flame graph / calgrind output can be seen above to potentially help in debugging.

SteveMacenski commented 3 years ago

@Kaetera @wilsionhuang @charlielito @BriceRenaudeau this is alot of people running into this issue. I'm trying to rack my brain for a solution here to help guide people in the right direction. I would appreciate your inputs on how we can make this more immediately evident that Nav2 is currently better serviced by Cyclone.

I would like to balance it with being clear that there are currently issues with Fast-DDS resulting in this issue, but that we're not taking sides. I don't want to make some 'official declaration' that Cyclone-DDS is 'the official DDS of Nav2' but it would be nice to have a note that users have reported issues in all currently supported ROS distributions that Cyclone's performance is better, at least for now.

JaimeMartin commented 3 years ago

@SteveMacenski how do you explain @BriceRenaudeau is having exactly the same issue on Galactic (CycloneDDS)?

Are you able to create a simple reproducer demonstrating where is the problem?

SteveMacenski commented 3 years ago

This is a good question, I missed that. @BriceRenaudeau can you chime in potentially and confirm? Are your issues consistent with the callgrind output @charlielito provided (or do you think these are 2 separate issues)? Your statement I don't know if it is the cause or the consequence but we also get a warning from local_costamp saying a laser message is bumped because the queue is full. given graph shows alot of TF activity makes sense -- clearing things aren't getting through. Tell me more about your sensor input rates / odometry publish rate / TF static transform rates.

JaimeMartin commented 3 years ago

@BriceRenaudeau could you just switch to Fast DDS in Galactic and run again your test?

@SteveMacenski it seems a complex problem, happening with two different versions of ROS2 and 2 different DDS implementations.

BriceRenaudeau commented 3 years ago

@BriceRenaudeau could you just switch to Fast DDS in Galactic and run again your test?

Sorry but FastDDS is incompatible with our system.

@BriceRenaudeau can you chime in potentially and confirm?

We found out that a use_sim_time: true was hidden in the controller_server YAML file. Creating issues in the progress_checker and TF as /clock wasn't published on real robots. Since removing this error (10 days), we didn't get new occurrences of the high CPU usage on controller_server.

SteveMacenski commented 3 years ago

Ok. That seems to plausibly explain your issues, wrt Cyclone. Others in the ticket: can you verify this issue isn’t also what’s causing your issues wrt your DDS vendor?

SteveMacenski commented 3 years ago

@Kaetera @wilsionhuang @charlielito can you test if either of these work?

Kaetera commented 3 years ago

On Foxy (we only have this distribution):

The original issue of this thread may be a little different: instead of having the CPU constantly stuck at an abnormally high value (what several of us experimented), the original issuer had the CPU load varying varies and increases abruptly. Maybe these are two different problems ?

doisyg commented 3 years ago

What we experience with @BriceRenaudeau seems possibly a separate issue from the original post. It still happens very rarely (once a week on a constantly navigation system) and I didn't find yet the root cause (controller_server or tf? or something outside cascading), see https://github.com/ros2/geometry2/issues/366 for somebody who seems to have the same issue as ours. I will open a separate ticket once I will know a bit more.

doisyg commented 3 years ago

After a painful investigation, our issue (again, not sure if it is the same as the original) can be easily reproduced by setting a very small transform_tolerance in AMCL and stressing the CPU. What I believe happens is that the tf2_ros::MessageFilter gets filled with messages that cannot be processed due to the extrapolation error and treating all these messages all the time causes very high CPU usage. Even when stopping the tool used to stress the CPU (and used to trigger the issue), the tf2_ros::MessageFilter queue stays full and continue to eat all the power available on a mobile i7 core. There is probably a bug in tf2_ros::MessageFilter that I cannot pinpoint. And it also seems that rcutils_snprintf (used to convert a timestamp to a string) is quite inefficient. However the problem can be managed by using a timeout on message filter. See PR here : https://github.com/ros-planning/navigation2/pull/2518 It solves the issue for us.

SteveMacenski commented 3 years ago

That sounds about right to what everyone else is reporting. That could plausibly explain why changing use_sim_time that was wrongly set could impact things (TF buffer is really having a bad time) and why changing DDS vendors could impact things (slightly more efficient with the types of messages to stop a blockage from occurring in the first place).

I see that same rcutils_snprintf in the flame graph above. I think we mistakenly thought it had to do with our ability to transform the pointcloud using the transformLaserScanToPointCloud but in fact the issue was before we even tried to transform in the message filters getting to a point the callbacks to be triggered to do that transformation.

From reading over this entire thread, that solution is consistent across all reported instances that I can see. I can plausibly conclude that that is indeed the core problem! It would be great if @wilsionhuang @charlielito @Kaetera were to test these PRs or the resulting merge into Nav2 and confirm they no longer experience this issue.

SteveMacenski commented 3 years ago

I'm linking the PR, I think this is the solution. Once merged, this ticket will autoclose, but I am happy (er um unhappy?) to reopen if folks still experience it

doisyg commented 3 years ago

FYI, I believe this issue also affects rviz