tf_remap very CPU-intensive

peci1 commented 6 years ago

I was wondering what's eating my CPU when replaying bagfiles, and I found out it's tf_remap. In the bagfile, I have TF messages at about 1-2 kHz, and the code I'm running spawns like 10 TF listeners. This is enough to saturate 2 CPU cores on my machine just by running the tf_remap node.

I verified the remapping itself is not the bottleneck - the bottleneck really is the rospy.Publisher.publish(), probably not able to publish at such a high rate from python to that many subscribers.

So I've written an experimental C++ tf_remap, and the CPU load is much lower (about 30-40% of one CPU core). This package also solves a few more pain points of tf_remap, like the inability to work with tf_static, or the inability to remove TF frames.

Now the question is: do we want to merge this package as a tool to geometry2 (not to geometry, since it's TF2-native)? Or should I rather release it as a standalone package? I assume that completely substituting the current Python node with the C++ one is not an option for many reasons...

Another thing after the previous question is solved: should there be a warning added to tf/tf_remap that it's known to be inefficient, and with a link to the C++ node if the user observes performance problems?

Here's the code that can easily achieve the high-load situation I'm describing (note that tf/static_transform_publisher does not use /tf_static, but it periodically publishes to /tf, in this case at 1 kHz):

#!/bin/bash

trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT

roscore -p $(echo $ROS_MASTER_URI | cut -d':' -f3) &

sleep 5

for i in `seq 1 10`; do
    rosrun tf tf_monitor 2>/dev/null 1>/dev/null &
done

rosrun tf static_transform_publisher 0 0 0 0 0 0 a b 1 /tf:=/tf_old &
rosrun tf static_transform_publisher 0 0 0 0 0 0 d e 1 /tf:=/tf_old &

rosrun tf tf_remap _mappings:='[{"old": "b", "new": "c"}]' &

htop -p $(ps aux | grep tf_remap | grep -v grep | sed 's/\s\+/ /g' | cut -d' ' -f2)

For completeness, I attach profiler outputs from profiling the original Python node and the C++ node: profilers.zip . In Python, thread.lock.acquire takes almost all the self time, seconded by write_data with 1.5% . In C++, pthread_mutex_lock is also the most time-consuming call, but still taking only 17% of the self time.

chadrockey commented 6 years ago

This sounds promising. I'll wait for other to chime in, but if there are some sort of tests or other method to prove that the C++ node performs equivalent with the same usage to the python node, I would be in support of adding a C++ target with the name tf_remap and moving the python script ot tf_remap_legacy or similar. Could add a ROS out warning for one distro to make users aware of the change. The only breaking usage case I can think of off the top of my head is if a user launches the script with python /opt/ros/.../tf_remap. rosrun, roslaunch, and ./ should all work.

I would prefer bringing the C++ version upstream for wider distribution in either case. I should be available to help this process.

tfoote commented 6 years ago

Certainly if we can get better performance from the C++ version it's definitely worth considering switching.

peci1 commented 5 years ago

As this kind of timed-out, I released the package into melodic separately. However, I'm still open for the possibility of adding it to ros/geometry.

peci1 commented 5 years ago

Should I still start a PR replacing the Python version?

ros / geometry

tf_remap very CPU-intensive #175