Performance regression due to #636

EricCousineau-TRI commented 2 months ago

Bug report

Required Info:

Operating System: Ubuntu 22.04
Installation type: binaries
Version or commit hash: upgrade from 0.25.3-1jammy.20230721.203549 to 0.25.6-1jammy.20240217.053651 (Debian)
DDS implementation: rmw_cyclonedds_cpp
Client library (if applicable): rclcpp

Steps to reproduce issue

This is observed performance involving sim, control, and camera publishers / subscribers. We (@calderpg-tri and I) observed this by looking at delays for image receipt, and noticed that #636 (specifically the backport in bda8c499cc4dc3eb8ca8daccebeb10ae2e566c7d) causes an observable (significant?) delay from when an image is published to when it is received, while tf is concurrently being used.

If I use 0.25.6 but revert the given commit, I see good performance "expected behavior" in terms of timing.

~I do not yet have a minimal reproduction case.~ See benchmark in #679, and results posted below in this issue.

Expected behavior

Negligible performance impact

Actual behavior

Observable (significant?) performance impact

Additional information

we should consider using something more performant than std::find() on a std::list. Not immediately sure what that'll be. From initial glance at ::insertData and ::pruneList, I assume storage_ is sorted, so perhaps that can simple be changed to

bool should_insert = true;
// Search along all data with same timestamp (sorted), and only insert if we did not find the exact same data.
while (storage_it != storage_.end() && storage_it->stamp_ == new_data->stamp_) {
  if (*storage_it == new_data) {
    should_insert = false;
    break;
  }
  storage_it++;
}
if (should_insert) {
  storage_.insert(storage_it, new_data);
}

fyi @nachovizzo (\cc @clalancette @ahcorde)

clalancette commented 2 months ago

From initial glance at ::insertData and ::pruneList, I assume storage_ is sorted, so perhaps that can simple be changed to

Yep, it is sorted, so your suggestion is probably the first thing to try. If you'd like to give that a whirl and open a PR, I'd be happy to review it. Please target it at rolling first, and we can backport from there.

P.S. I did look at trying a different data structure here at one point, but it turns out that given the very dynamic nature of tf2, it is actually not clear that another structure would be more performant than a linked list.

EricCousineau-TRI commented 2 months ago

I confirmed that #677 appears to fix the performance regression on our end-to-end test, based off of humble However, I tried to then rebase on rolling, and didn't immediately understand the changes there. Will have to look a bit more closely.

Any chance you know what may have changed?

EricCousineau-TRI commented 2 months ago

Ah, I see it's the refactorings from #658. Rereading, makes sense, just confusing that #658 wasn't backported.

EricCousineau-TRI commented 2 months ago

hm... #658 is nice that it changed to using algorithms, but given the nuance of sorting + having an iterator + possibly duplicate timestamps, I'm struggling to fully reason about it. Additionally, getLatestTimestamp() is dependent on storage_.empty(), but that logic gets a bit obscured.

My suggestion is to revert that refactoring change to TimeCache::insertData() back to relatively straightforward iterators on rolling. Then (a) the logic about insertion is perhaps clearer, and (b) no need to backport #658.

Any objections / other suggestions?

EricCousineau-TRI commented 2 months ago

Just quick test: https://github.com/EricCousineau-TRI/repro/commit/f12a994eb29c733f258730216cfbb451ae1f4e49 Yeah, I think #658 may have other implications, e.g. changing the sorting order (possibly).

It OK if we revert that portion of it?

clalancette commented 2 months ago

It OK if we revert that portion of it?

I think we should hold off on that. What we should do is to make that piece of code look like what we want on the rolling branch. Once we are happy with it from a performance and functionality point-of-view, we can figure out how to backport things.

In particular, that data structure has to hold the following properties:

It needs to hold an arbitrary number of entries, because we don't know the rate at which we will get data.
The items needs to be in sorted order by timestamp, from the newest entry to the oldest.
New items need to be inserted into arbitrary positions, as timestamps may come in out-of-order.
Duplicates shouldn't be allowed.
Old items need to be pruned out.

Instead of reverting things, let's just make the code look like we want. I'm also going to suggest that we add in tests for all of these properties, because while we have some in https://github.com/ros2/geometry2/blob/rolling/tf2/test/cache_unittest.cpp , it looks like we could probably use more.

EricCousineau-TRI commented 1 month ago

That sounds like a much better path, thanks!

The above properties seem like a good testable contract, but perhaps best to keep that relegated to internal components. I will try out either using something like friend class for testing, or just implement a smaller container like SortedCacheList that can be more directly tested.

Any preference on either of those? (or another alternative?)

EricCousineau-TRI commented 1 month ago

Proposal for implementation-specific testing: #678 Perhaps may go away if there is a more encapsulated / performant implementation that arises.

EricCousineau-TRI commented 1 month ago

Dumb q: Are there existing performance tests for tf2?

EricCousineau-TRI commented 1 month ago

Added a basic benchmark in #679

Not exactly sure right commands to do benchmarks correctly within colcon / ament. Just guessing around:

cd ros2_ws
# clone rolling ros2.repos into src/, do rosdep, checkout this branch on src/ros2/geometry
colcon build --packages-up-to tf2
source ./install/setup.bash
cd build/tf2

make -j cache_benchmark install && ./cache_benchmark --benchmark_time_unit=ms

With #679, I see the following:

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
benchmark_insertion        921 ms          921 ms            1

If I revert the commits relates to #676, i.e. git revert --no-commit 875252656 1621942bc2 in source tree:

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
benchmark_insertion      0.395 ms        0.395 ms         1393

Note the significant performance impact due to 875252656 and 1621942bc2, which changes both insertData and pruneList

EricCousineau-TRI commented 1 month ago

With #680, I now see timing that is only slightly slower than reverting:

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
benchmark_insertion      0.502 ms        0.502 ms         1285

nachovizzo commented 1 month ago

@EricCousineau-TRI, fantastic work! Thanks for fixing this!

I just had some minutes to look into this. Could you summarize what happened? I see many PRs, issues, and comments, and I'm a bit lost with the information flow (sorry, I'm quite dumb)

Additionally, can I do a benchmark before/after for our robots? We haven't seen any visible performance degradation so far, and 1 second looks like a lot of delay

EDIT: Why does the first benchmark only run for 1 iteration and the other for 1393 ???

Thanks!

EricCousineau-TRI commented 1 month ago

Could you summarize what happened?

Done here: https://github.com/ros2/geometry2/pull/680#discussion_r1601922131

Additionally, can I do a benchmark before/after for our robots?

Certainly. It depends on what you may have for your robots, and how dense your tf2 graph is. You may not see the impact, which is perhaps why per your #636 PR you stated that it worked for y'all.

In our case, we may have a denser tf2 graph, and perhaps publish at a higher rate. This was observed in a camera image subscriber that just so happened to have a tf2_ros::Buffer that was active during this, and starved a very high-rate loop that I was running to ensure I had good metrics on delays for image publish -> image receipt. I wanted minimal delay + minimum variance delay, as well as minimum variance in frequency given a target frequency, which I can achieve if I revert the related PRs, shorten our tf2_ros::Buffer duration to 500ms, and/or use a fixed version of tf2 using #680.

[...] and 1 second looks like a lot of delay

This isn't strictly the delay you may see. Instead, per the benchmark, you can see it as "how expensive is it to insert 10s of data", and the result is "we spend 10% of time on TimeCache::insertData()", which is a very large amount for what should be a highly optimized operation.

Why does the first benchmark only run for 1 iteration and the other for 1393

This is a facet of google/benchmark. From their documentation: https://github.com/google/benchmark/blob/main/docs/user_guide.md#runtime-and-reporting-considerations

The number of iterations to run is determined dynamically by running the benchmark a few times and measuring the time taken and ensuring that the ultimate result will be statistically stable. As such, faster benchmark functions will be run for more iterations than slower benchmark functions, and the number of iterations is thus reported.

ros2 / geometry2