Open peci1 opened 6 years ago
This sounds promising. I'll wait for other to chime in, but if there are some sort of tests or other method to prove that the C++ node performs equivalent with the same usage to the python node, I would be in support of adding a C++ target with the name tf_remap and moving the python script ot tf_remap_legacy or similar. Could add a ROS out warning for one distro to make users aware of the change. The only breaking usage case I can think of off the top of my head is if a user launches the script with python /opt/ros/.../tf_remap. rosrun, roslaunch, and ./ should all work.
I would prefer bringing the C++ version upstream for wider distribution in either case. I should be available to help this process.
Certainly if we can get better performance from the C++ version it's definitely worth considering switching.
As this kind of timed-out, I released the package into melodic separately. However, I'm still open for the possibility of adding it to ros/geometry.
Should I still start a PR replacing the Python version?
I was wondering what's eating my CPU when replaying bagfiles, and I found out it's tf_remap. In the bagfile, I have TF messages at about 1-2 kHz, and the code I'm running spawns like 10 TF listeners. This is enough to saturate 2 CPU cores on my machine just by running the tf_remap node.
I verified the remapping itself is not the bottleneck - the bottleneck really is the
rospy.Publisher.publish()
, probably not able to publish at such a high rate from python to that many subscribers.So I've written an experimental C++ tf_remap, and the CPU load is much lower (about 30-40% of one CPU core). This package also solves a few more pain points of tf_remap, like the inability to work with tf_static, or the inability to remove TF frames.
Now the question is: do we want to merge this package as a tool to geometry2 (not to geometry, since it's TF2-native)? Or should I rather release it as a standalone package? I assume that completely substituting the current Python node with the C++ one is not an option for many reasons...
Another thing after the previous question is solved: should there be a warning added to
tf/tf_remap
that it's known to be inefficient, and with a link to the C++ node if the user observes performance problems?Here's the code that can easily achieve the high-load situation I'm describing (note that
tf/static_transform_publisher
does not use/tf_static
, but it periodically publishes to/tf
, in this case at 1 kHz):For completeness, I attach profiler outputs from profiling the original Python node and the C++ node: profilers.zip . In Python,
thread.lock.acquire
takes almost all the self time, seconded bywrite_data
with 1.5% . In C++,pthread_mutex_lock
is also the most time-consuming call, but still taking only 17% of the self time.