Closed EricCousineau-TRI closed 1 month ago
From initial glance at
::insertData
and::pruneList
, I assumestorage_
is sorted, so perhaps that can simple be changed to
Yep, it is sorted, so your suggestion is probably the first thing to try. If you'd like to give that a whirl and open a PR, I'd be happy to review it. Please target it at rolling
first, and we can backport from there.
P.S. I did look at trying a different data structure here at one point, but it turns out that given the very dynamic nature of tf2, it is actually not clear that another structure would be more performant than a linked list.
I confirmed that #677 appears to fix the performance regression on our end-to-end test, based off of humble
However, I tried to then rebase on rolling
, and didn't immediately understand the changes there. Will have to look a bit more closely.
Any chance you know what may have changed?
Ah, I see it's the refactorings from #658. Rereading, makes sense, just confusing that #658 wasn't backported.
hm... #658 is nice that it changed to using algorithms, but given the nuance of sorting + having an iterator + possibly duplicate timestamps, I'm struggling to fully reason about it.
Additionally, getLatestTimestamp()
is dependent on storage_.empty()
, but that logic gets a bit obscured.
My suggestion is to revert that refactoring change to TimeCache::insertData()
back to relatively straightforward iterators on rolling
.
Then (a) the logic about insertion is perhaps clearer, and (b) no need to backport #658.
Any objections / other suggestions?
Just quick test: https://github.com/EricCousineau-TRI/repro/commit/f12a994eb29c733f258730216cfbb451ae1f4e49 Yeah, I think #658 may have other implications, e.g. changing the sorting order (possibly).
It OK if we revert that portion of it?
It OK if we revert that portion of it?
I think we should hold off on that. What we should do is to make that piece of code look like what we want on the rolling
branch. Once we are happy with it from a performance and functionality point-of-view, we can figure out how to backport things.
In particular, that data structure has to hold the following properties:
Instead of reverting things, let's just make the code look like we want. I'm also going to suggest that we add in tests for all of these properties, because while we have some in https://github.com/ros2/geometry2/blob/rolling/tf2/test/cache_unittest.cpp , it looks like we could probably use more.
That sounds like a much better path, thanks!
The above properties seem like a good testable contract, but perhaps best to keep that relegated to internal components. I will try out either using something like friend class
for testing, or just implement a smaller container like SortedCacheList
that can be more directly tested.
Any preference on either of those? (or another alternative?)
Proposal for implementation-specific testing: #678 Perhaps may go away if there is a more encapsulated / performant implementation that arises.
Dumb q: Are there existing performance tests for tf2
?
Added a basic benchmark in #679
Not exactly sure right commands to do benchmarks correctly within colcon / ament. Just guessing around:
cd ros2_ws
# clone rolling ros2.repos into src/, do rosdep, checkout this branch on src/ros2/geometry
colcon build --packages-up-to tf2
source ./install/setup.bash
cd build/tf2
make -j cache_benchmark install && ./cache_benchmark --benchmark_time_unit=ms
With #679, I see the following:
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
benchmark_insertion 921 ms 921 ms 1
If I revert the commits relates to #676, i.e. git revert --no-commit 875252656 1621942bc2
in source tree:
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
benchmark_insertion 0.395 ms 0.395 ms 1393
Note the significant performance impact due to 875252656 and 1621942bc2, which changes both insertData
and pruneList
With #680, I now see timing that is only slightly slower than reverting:
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
benchmark_insertion 0.502 ms 0.502 ms 1285
@EricCousineau-TRI, fantastic work! Thanks for fixing this!
I just had some minutes to look into this. Could you summarize what happened? I see many PRs, issues, and comments, and I'm a bit lost with the information flow (sorry, I'm quite dumb)
Additionally, can I do a benchmark before/after for our robots? We haven't seen any visible performance degradation so far, and 1 second looks like a lot of delay
EDIT: Why does the first benchmark only run for 1 iteration and the other for 1393
???
Thanks!
Could you summarize what happened?
Done here: https://github.com/ros2/geometry2/pull/680#discussion_r1601922131
Additionally, can I do a benchmark before/after for our robots?
Certainly. It depends on what you may have for your robots, and how dense your tf2 graph is. You may not see the impact, which is perhaps why per your #636 PR you stated that it worked for y'all.
In our case, we may have a denser tf2 graph, and perhaps publish at a higher rate.
This was observed in a camera image subscriber that just so happened to have a tf2_ros::Buffer
that was active during this, and starved a very high-rate loop that I was running to ensure I had good metrics on delays for image publish -> image receipt.
I wanted minimal delay + minimum variance delay, as well as minimum variance in frequency given a target frequency, which I can achieve if I revert the related PRs, shorten our tf2_ros::Buffer
duration to 500ms, and/or use a fixed version of tf2
using #680.
[...] and 1 second looks like a lot of delay
This isn't strictly the delay you may see. Instead, per the benchmark, you can see it as "how expensive is it to insert 10s of data", and the result is "we spend 10% of time on TimeCache::insertData()
", which is a very large amount for what should be a highly optimized operation.
Why does the first benchmark only run for 1 iteration and the other for 1393
This is a facet of google/benchmark
. From their documentation:
https://github.com/google/benchmark/blob/main/docs/user_guide.md#runtime-and-reporting-considerations
The number of iterations to run is determined dynamically by running the benchmark a few times and measuring the time taken and ensuring that the ultimate result will be statistically stable. As such, faster benchmark functions will be run for more iterations than slower benchmark functions, and the number of iterations is thus reported.
Bug report
Required Info:
0.25.3-1jammy.20230721.203549
to0.25.6-1jammy.20240217.053651
(Debian)rmw_cyclonedds_cpp
rclcpp
Steps to reproduce issue
This is observed performance involving sim, control, and camera publishers / subscribers. We (@calderpg-tri and I) observed this by looking at delays for image receipt, and noticed that #636 (specifically the backport in bda8c499cc4dc3eb8ca8daccebeb10ae2e566c7d) causes an observable (significant?) delay from when an image is published to when it is received, while tf is concurrently being used.
If I use
0.25.6
but revert the given commit, I see good performance "expected behavior" in terms of timing.~I do not yet have a minimal reproduction case.~ See benchmark in #679, and results posted below in this issue.
Expected behavior
Negligible performance impact
Actual behavior
Observable (significant?) performance impact
Additional information
we should consider using something more performant than
std::find()
on astd::list
. Not immediately sure what that'll be. From initial glance at::insertData
and::pruneList
, I assumestorage_
is sorted, so perhaps that can simple be changed tofyi @nachovizzo (\cc @clalancette @ahcorde)