Fix wrong TF_REPEATED_DATA warnings when resetting sim time

rhaschke commented 1 year ago

When using sim time and /clock has been reset, incoming messages with old time stamps (from before the reset) will probably result in TF_REPEATED_DATA warnings when messages arrive with the same timestamp in the future. Even worse, those messages will be ignored. Thus, after a reset, messages with a time stamp into the future should be ignored.

This PR detects a reset within the tf2::BufferCore via an empty TimeCache and then rejects messages with a stamp 1s or more into the future compared to the current ROS time.

rhaschke commented 1 year ago

Looks like there are some failing tests now. Need to check later... I'm also not yet satisfied with the approach yet: Handling "future messages" in BufferCore might have undesired effects long after the actual reset. Maybe, I should better handle this in the TransformListener and only discard "future messages" shortly (max 1s) after a time reset. This would require remembering the wall? time of the last reset.

rhaschke commented 1 year ago

This last approach is much better: there is a 0.5s time window after a time reset, where incoming messages with a stamp in the future are discarded. Ready for review. When merging, the 3 commits should be squashed. There are only here to see the history of my different approaches.

tfoote commented 1 year ago

Can you explain a little bit more about your use case and concern? As I see it when you reset your system there's still potentially some transform messages in flight that might arrive after the TransformCache is reset. To that end you're adding a heuristic that will try to reject transforms received from the original timeline and not the reset timeline.

Repeated data would be correct in this case, but that's a smaller concern as if you get data from before the time jump your cache time will be potentially way in the fuutre, and out of the cache buffer size and all older messages will be dropped until you get within the cache buffer size of the end of the loop. Default 10 seconds. So a 5 minute loop would drop all messages until 4:50 into the loop, and then potentially get repeated data at the tail where the data overlaps.

Why did you pick Walltime 0.5 seconds as a backoff?

And future dated transforms are valid and regularly used so that's not going to be a good threshold.

it seems like it might be simpler to provide some downtime in your system to allow the old messages to be delivered before executing the time reset. This would be much simpler and not be injecting magic number heuristics into the algorithm that might surprise other people with different applications.

Alternatively you could just call reset on the listener 0.5 seconds after you've reset your system to make sure that there's nothing old in the cache.

rhaschke commented 1 year ago

When you reset your system, there are still potentially some transform messages in flight that might arrive after the TransformCache is reset. To that end you're adding a heuristic that will try to reject transforms received from the original timeline and not the reset timeline.

Exactly.

So a 5-minute loop would drop all messages until 4:50 into the loop.

Indeed, that's even worse! Hence, it is even more important to not insert those pending messages from the original timeline into the cache. While debugging the problem, I only considered loop sizes of 1s...5s. But, I have seen TF_OLD_DATA errors for longer loops as well. Now, I know their origin :wink:

Why did you pick WallTime of 0.5 seconds as a backoff?

I was assuming that 0.5s after a reset is large enough to accommodate all pending messages from the old timeline and small enough to not suppress future-dated transforms in "normal operation".

It seems like it might be simpler to provide some downtime in your system to allow the old messages to be delivered.

Yes, indeed, it might be a solution to just pause the simulated system a few milliseconds before publishing the reset time. The problem with this approach is that all sim-time providers need to be adapted in this fashion. Do Gazebo and rosbag already use such a mechanism? If so, yes, I should follow the same approach. If not, it would be better to implement the handling on the TransformCache side.

Alternatively you could just call reset on the listener 0.5 seconds after you've reset your system

How should I do this from another ROS node? The wiki mentions the topic /reset_time, but I couldn't find any notion of this topic in the sources!

rhaschke commented 1 year ago

We have implemented @tfoote's suggestion to pause our simulation a few milliseconds before publishing the reset time. Closing here.

ros / geometry2

Fix wrong TF_REPEATED_DATA warnings when resetting sim time #542