New warning message for repeated transforms

ayrton04 commented 4 years ago

I just released robot_localization for Noetic, and I'm noting that while the tests pass, we are getting spammed with an endless stream of this message:

Warning: TF_REPEATED_DATA ignoring data with redundant timestamp for frame base_link at time 1432235782.066014 according to authority unknown_publisher
         at line 278 in /tmp/binarydeb/ros-noetic-tf2-0.7.1/src/buffer_core.cpp

For background, this is a common scenario with r_l:

Cycle 1: receive measurement from sensor with time stamp t1. Fuse measurement. Publish state (and transform) at time t1. Cycle 2: receive measurement from sensor with time stamp t2 < t1. Rewind filter, insert measurement from t2, fuse measurement from t2 and then measurement from t1. Publish corrected state at time t1.

From the way things look, it seems that tf2 used to re-insert the transform, until this change:

https://github.com/ros/geometry2/commit/eefb50935bfd28223c87b3d708e916f7bcc4b4ca

That would have been helpful for the use case in question, as overwriting the first transform is what we'd want to do.

However, it was later changed:

https://github.com/ros/geometry2/commit/04625380bdff3f3e9e860fc0e85f71674ddd1587#diff-c25683741e1f7b4e50eb6c5cdcad9661R275

Now insertData returns false and prints out an error.

Either of the previous behaviors would be preferable to the new one. What are the ramifications of restoring the behavior from this commit?

SteveMacenski commented 4 years ago

+1 for reverting this or at minimum silencing that warning. There are plenty of legitimate reasons to update a transform at a given identical timestamp due to uncertainty in the system.

ayrton04 commented 4 years ago

Sorry for the direct mention, but do you have any thoughts on this, @tfoote?

doisyg commented 4 years ago

Same here, in favor of reverting this warning. I am getting flooded with this warning when using cartographer

ayrton04 commented 4 years ago

My first choice would be to get rid of the warning and overwrite the existing transform, if that's an option.

Hugal31 commented 4 years ago

I created PR #475 to fully restore the old behavior. Feel free to merge or close.

tfoote commented 4 years ago

The warning is there to tell you not to use it because it's not doing what you thought it was doing in the past.

Please see the discussions in #414 and #426 for why it was changed to be dropped not replaced. The system is designed to propagate final estimates of positions. If you expect to revise an estimate don't publish it until you're confident you don't need to revise it. The tf system is designed to be latency tolerant, I believe that most use cases would be better off waiting for the extra cycle than to get earlier less accurate data. tf users will have to wait for the union of all their links to become available. If there are very high latency requirements they shouldn't go through tf but use one of the other channels like the odom topic which gives the latest information directly.

We can design an extension to the system to support mutable data to make sure there's consistency. But just trying to overwrite every timestamp by using a timestamp allignment over a potentially lossy transport will lead to data inconsistencies and potentially very invalid values. Also potentially out of order delivery makes this problematic as the latest one "wins". So in most cases it might just work, but that's not a safe solution.

ayrton04 commented 4 years ago

Thanks for that. I have a better understanding of the change now.

While I can see the reason for not continually inserting the same transform, but I have two concerns about not permitting the overwrite:

IMO, there may be perfectly valid cases where less accurate data is better than old data. I'd be willing to bet there are a fair number of fielded systems relying on this very concept. Granted, if you just ask for the most recent transform, you'll still get a usable-if-slightly-inaccurate transform, so perhaps that doesn't change anything, except that the user now has to change code to request the most recent transform instead of the one at the correct time.
This represents a pretty big change from every other ROS 1 release, and it's affecting more than one package. My obvious interest in robot_localization aside, it's also (according to @doisyg) affecting packages like Cartographer. I know that new releases are good times to make such changes, but if I had to guess, this change is going to lead to a fair number of downstream code changes. For example, I'll need to add a feature to r_l (which I should have added ages ago, to be fair) to lag the generation of the state estimate.

Also potentially out of order delivery makes this problematic as the latest one "wins".

This would be an ideal use for sequence numbers, but I know those were deprecated (and I understand why).

I like the idea of designing an extension to allow for mutable transforms. If that is something for which nobody has extra cycles, would you consider changing this behavior so that, at least in this instance, we change this to only issue a ROS_WARN_ONCE or something?

SteveMacenski commented 4 years ago

Also there's a bit of an issue assuming that we have knowledge / control over when we're ready to publish a transform for a timestamp. If you have a constant cycling thread updating a filter, you don't know at t + 1 whether a new measurement may come in to update the filter to a new timestamp t + 1 or refine the filter from a higher latency data source for t. Waiting to t + 1 to publish t due to that uncertainty would inject latency throughout the entire system. I suppose it could be updated so that when the filter is rolled back to include the new information, we just don't update tf but that doesn't seem like the best outcome.

ROS_WARN_ONCE -> or my suggestion is to just keep as is but silence it to debug throttled to 1 minute or so such that all the context is given to a user debugging a problem

doisyg commented 4 years ago

ROS_WARN_ONCE -> or my suggestion is to just keep as is but silence it to debug throttled to 1 minute or so such that all the context is given to a user debugging a problem

+1

tfoote commented 3 years ago

I suppose it could be updated so that when the filter is rolled back to include the new information, we just don't update tf but that doesn't seem like the best outcome.

Repeating my comment from https://github.com/ros/geometry2/pull/475#issuecomment-673688812

For comparison we've had a similar issue with extrapolation in the past. It's "easy" to mathematically allow extrapolation so you don't have to wait for all data to come in. But then it turns out that you amplify any errors in your system, and consequently all control algorithms just get worse performance. Nothing seemed wrong and we had much better latency, but after analyzing the performance with and without extrapolation what seemed like extra power was actually worse in every use case we tested.

The system is designed to handle latency. In a distributed system we know we have latency. You can choose to publish the refined pose for t at t+1 or you can publish the original for t and then publish t+1 which includes the internally revised value from t. Your example here of doing a revision of one time stamp is relatively simple. But that's just the tip of what might be considered. Looking at the bigger picture if you start thinking about something like moving horizion estimation. You might be revising N timesteps in the past based on each update. Now you could publish N timestamps back in history not just 1. The estimate back N timesteps will be "more accurate" but what's the latency? And how can any algorithm rely on the history? For localization we have the solution of having your incremental updates come out (ala odometry) and then there's the correction which fixes any mistakes that the incremental, but unrevisable odometry produced.

I'm open to adding throttling to the warning. But I want it to be loud. It should not be ignored. The insidious silent failures of the past are as bad or worse.

As I've said we can consider creating another better mutable transform. And analyze it for all the different use cases, and make sure that it's consistent across time and robust to data loss etc.

SteveMacenski commented 3 years ago

I think just warning once the first occurrence is sufficiently loud and it seems @ayrton04, @doisyg, and I like that option the best as an immediate patch.

While I understand your point, a change this substantial should have been opened up for discussion, it looks like it was merged day-of (https://github.com/ros/geometry2/pull/459) and now there's 6 users here commenting/emoji-ing (there must be a verb for that now...) that this is a real problem for us. There should be an option to turn this off if you want this protection by default. There are legitimate needs for not having this, as Tom points out.

You can choose to publish the refined pose for t at t+1 or you can publish the original for t and then publish t+1 which includes the internally revised value from t.

For instance, lets imagine you have a global positioning data source that gives you hyper accurate corrections at irregular intervals based on when it has visibility of a target over cellular in the forest. This data source is being fused into a program that is doing event based fusion (new data comes in over a topic, triggers an update to TF). This filter has this positioning data coming in and also some odometry source at 1hz. At time 17, the filter updates from a new odometry message. At time 17.2, you get a message from your super accurate infrastructure system that you crossed into line of sight momentarily so you know exactly where you are, but you did so at time 17. Now, you want to update your view of the world to account for this drift correction, not just in local state, but for controllers currently running. You re-wind the filter, process that measurement, and now want to re-publish at time 17. That would fail. Then for the next 0.8s your entire system has a degraded view of the world that could be prevented. Eventually things would be caught up once they next odometry signal comes in to trigger an update at time 18. However, perhaps this is a robot in a sensitive application and that 800ms could result in some severe failure modes. Obviously this is an extreme example and not very well designed system, but the first example that came to mind about when this could be especially helpful to have access to.

I agree for most systems this should be fine over a longer time horizon (after reducing logging), however if you have very sensitive equipment, like that in which ROS2 is supposed to be able to support now, something like this could be a show stopper.

AustinDeric commented 3 years ago

I am seeing this issue and it seems to be isolated in the melodic to noetic upgrade. Maybe the issue should be addressed in robot_localization and not geometry2. However, robot_localization is saying geometry2 should be updated: https://github.com/cra-ros-pkg/robot_localization/issues/574. there seems to be an impasse which i hope gets resolved soon.

tfoote commented 3 years ago

While I understand your point, a change this substantial should have been opened up for discussion, it looks like it was merged day-of (#459) and now there's 6 users here commenting/emoji-ing (there must be a verb for that now...) that this is a real problem for us. There should be an option to turn this off if you want this protection by default. There are legitimate needs for not having this, as Tom points out.

This is not helpful. The specific PR you point to calls out in it's description that it's just a rebase of the long standing PR https://github.com/ros/geometry2/pull/426 it was specifically targeted to only land in the new rosdistro release. And it was given a strong warning to make it clear to downstream developer/users that they should pay attention here and fix the broken usage.

For instance, lets imagine you have a global positioning data source that gives you hyper accurate corrections at irregular intervals based on when it has visibility of a target over cellular in the forest. This data source is being fused into a program that is doing event based fusion (new data comes in over a topic, triggers an update to TF). This filter has this positioning data coming in and also some odometry source at 1hz. At time 17, the filter updates from a new odometry message. At time 17.2, you get a message from your super accurate infrastructure system that you crossed into line of sight momentarily so you know exactly where you are, but you did so at time 17. Now, you want to update your view of the world to account for this drift correction, not just in local state, but for controllers currently running. You re-wind the filter, process that measurement, and now want to re-publish at time 17. That would fail. Then for the next 0.8s your entire system has a degraded view of the world that could be prevented. Eventually things would be caught up once they next odometry signal comes in to trigger an update at time 18. However, perhaps this is a robot in a sensitive application and that 800ms could result in some severe failure modes. Obviously this is an extreme example and not very well designed system, but the first example that came to mind about when this could be especially helpful to have access to.

This is a bit of a contrived example to hit precisely the use case you call out. Your example of just updating one point is very unlikely if you have a high precision, high latency system, but it's still coming in within once cycle of the odometry. Odometry is typically published much faster to allow for any of these issues. Similarly following REP 105 odometry should never jump. And really this is the standard setup that we use all the time to differentiate odometric updates from localization updates.

The more powerful use case is for something like loop closure for SLAM where the full localization history will want to be revised and I would be quite happy to brainstorm a design to support that. I've got some good ideas after doing some thought experiments. However, that's well out of the scope of this discussion.

The robot_state_publisher and other implementations should be changed to not try to overwrite previously published data. What they thought it was doing was not happening. Instead it was producing memory growth in all tf listeners on the system as well as causing discontinuities in the data as well as potential race conditions across the distributed system. All of this silently passed and was actually likely degrading the experience of all users. This warning has shown the places where this misuse of the API was causing problems and I'm fine with throttling the warning. But the warning is there for a reason to let you know that it's not doing what you thought it was doing. It never did it and now it's warning about it.

As I've said I'm ok with throttling the warning if the volume of them is very high. We definitely shouldn't only warn once, because if this is an ongoing issue, which is triggered by a remote nodes activity. And may be triggered by more than one node at different times during runtime. And if you're debugging the behavior it should be displayed at the same time as the behavior is happening.

SteveMacenski commented 3 years ago

Hi, Tully and I got on the phone and we decided on the following compromise:

Throttled debug logs about this error (every 10 seconds ?) so that if looking through debug logs, you can clearly see these error but hidden from terminal spam for most users.
1 time error with a link to this ticket on the first instance.
Work on adding a patch to R_L if possible to turn off this behavior. This may not be possible due to the architecture and frankly Tom or I have the time to re-architect R_L for this change.

Anyone have an issue with that compromise?

ros-discourse commented 3 years ago

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-agenda-2020-09-17/16464/1

peci1 commented 3 years ago

I'll just add our bit of view. We've got a mobile robot with 50 Hz odometry and 0.3 Hz SLAM. These pretty slow SLAM updates forced us all the time to really think things through. What we ended up with is a TF tree like this:

map ---- odom ---- base_link
              \___ map_fast

The structure is kind of not ideal (map_fast would better be visualized as another parent of odom, but we need to keep the tree structure). map_fast is a frame created by taking the last known map->odom and connecting it with the last known odom->base_link. And it gets the timestamp of the faster odom->base_link TF. It is a kind of jumping odometry which, however, should still be more or less globally consistent.

This clearly allows nodes to select whether they are interested into precise transforms (by transforming into map), or if they need something that's fast, but may be slightly inaccurate for a short time (by transforming into map_fast).

I know this doesn't solve all the possible use-cases, but it seems to me it is in our case a more "correct" approach than overwriting the history. And it actually forced us into thinking which nodes need fast and which need precise data. With overwriting history, we'd never know which we actually got...

SteveMacenski commented 3 years ago

I think that structure doesn't make a whole lot of sense to me, however I don't want to digress into that on this issue.

Thanks for the datapoint from another user. Do you have any objections with our proposal though?

peci1 commented 3 years ago

The proposal seems okay to me.

Maybe this issue points to some places in TF tutorials that could be improved to better explain what is and is not possible to do within the framework...

ayrton04 commented 3 years ago

I think the proposal sounds really solid, thanks @tfoote and @SteveMacenski. Meanwhile, I have a PR so that prevents r_l from re-publishing data with duplicate timestamps.

SteveMacenski commented 3 years ago

I'm no percisely sure the mechanics to make that happen. https://github.com/ros/geometry2/pull/459/files#diff-23f1780342d49d752e5fabba0fa7de1cR278 it looks like that logError is shared by other error cases we should probably keep. Unless @tfoote you think we can change that to a CONSOLE_BRIDGE_logDebug?

doisyg commented 3 years ago

Hi, Tully and I got on the phone and we decided on the following compromise:

* Throttled  debug logs about this error (every 10 seconds ?) so that if looking through debug logs, you can clearly see these error but hidden from terminal spam for most users.

* 1 time error with a link to this ticket on the first instance.

* Work on adding a patch to R_L if possible to turn off this behavior. This may not be possible due to the architecture and frankly Tom or I have the time to re-architect R_L for this change.

Anyone have an issue with that compromise?

I think it is a good compromise. Throttling 10s is fine and will not flood the console anymore. However, if we want the users to notice it, maybe leaving it to ROS_INFO or ROS_WARN?

machinekoder commented 3 years ago

I'm seeing the same error with pristine robot_state_publisher after recompiling moveit. Pretty weird.

machinekoder commented 3 years ago

Might this be the result of multiple TransformStamped in the transforms message with the same timestamp (but different frames)? If so this seems like a bug to me.

machinekoder commented 3 years ago

Okay, looks like the error is coming from my moveit branch and it's internal use of tf2.

RyanDraves commented 3 years ago

If the error is appearing in your own code, the fix I found was to store the previous timestamp published by a TransformBroadcaster and only publish if the new timestamp is different. I understand this isn't a solution for a lot of software/packages using TF2, but I had a hard time understanding a simple fix for simple code, so here it is.

If one is publishing their own transforms at a sufficiently fast rate, say a Gazebo plugin, ros::Tme::now() will produce these matching timestamps extremely fast.

ayrton04 commented 3 years ago

Yes, that was also how we solved it in robot_localization.

arunumd commented 3 years ago

I have the same issue. In my case, I do not use the robot_localization package. Instead I use my own custom localization node based on particle filter. When I manually build the transform and try to visualize it in Rviz, I get very similar warnings like other people who use robot_localization package. The repeated warnings quickly overwhelm my terminal.

[ WARN] [1612376984.688112923]: TF_REPEATED_DATA ignoring data with redundant timestamp for frame lidar_base at time 365481490.000000 according to authority unknown_publisher
[ WARN] [1612376984.688172776]: TF_REPEATED_DATA ignoring data with redundant timestamp for frame base_link at time 365481490.000000 according to authority unknown_publisher

dmorris0 commented 3 years ago

Just wondering, as a ROS user, how do I turn down the warning rate in my terminal?

I am getting the repeated warnings:

Warning: TF_REPEATED_DATA ignoring data with redundant timestamp for frame wheel_right_link at time 3539.636000 according to authority unknown_publisher
         at line
278         at line  in /tmp/binarydeb/ros-noetic-tf2-0.7.5/src/buffer_core.cpp

peci1 commented 3 years ago

how do I turn down the warning rate

You cannot change the rate. You can turn on/off whole sets of messages from different nodes. See http://wiki.ros.org/rosconsole#Configuration-1 .

dmorris0 commented 3 years ago

I see a bit, although from that wiki it is not clear how to change the config file when my ROS_ROOT is /opt/ros/noetic/share/ros.
And I was not successful updating the launch file with something like this:

 <env name="ROSCONSOLE_FORMAT" value="[${severity}] [${time}] [${node}]: ${message}"/>

It did not change the warning.

Any thoughts?

RyanDraves commented 3 years ago

The documentation above specifies that the ROSCONSOLE_CONFIG_FILE variable will let you set your own configuration file. Regardless, you can also resolve the repeated printouts with any of the solutions above. Once I understood that rebroadcasting a transform under the same timestamp caused the issue, it was not too difficult to fix.

dmorris0 commented 3 years ago

Here is what I did to attempt to stop the warnings which I see are coming from line 278 of buffer_core.cpp in the tf2 package: (1) Created a .conf file with this:

log4j.logger.ros=INFO
log4j.logger.ros.tf2=ERROR

(2) Set ROSCONSOLE_CONFIG_FILE to the full pathname of this .conf file (3) Ran my package

But as you see I still get repeated warnings

Warning: TF_REPEATED_DATA ignoring data with redundant timestamp for frame wheel_right_link at time 1996.807000 according to authority /robot_state_publisher
         at line 278 in /tmp/binarydeb/ros-noetic-tf2-0.7.5/src/buffer_core.cpp

Any thoughts on what I did incorrectly?

peci1 commented 3 years ago

It's possible that the messages are published under the namespace of the node where you actually use the TF buffer. Launch rqt_console to display the messages, and when you double-click one of them, you should see the name of the logger.

dmorris0 commented 3 years ago

Ah, good point. Those messages don't appear in the rqt_console, just on the terminal. Not sure why. I took a look at the source code: http://docs.ros.org/en/jade/api/tf2/html/buffer__core_8cpp_source.html, and it is calling logWarn(...) at line 277.

The problem for me is that I want to use the terminal for other IO, and it is being swamped by these messages.

peci1 commented 3 years ago

1>/dev/null

then?

dmorris0 commented 3 years ago

Ah: 2>/dev/null solves it! I still get the loginfo messages, but the warning messages are killed!

Thanks very much @peci1

ayrton04 commented 3 years ago

I'm closing this issue, because we have a solution, which is to ensure that nodes do not re-broadcast a given transform with the same timestamp as a previously broadcast transform. IIRC, tf2 isn't using the second transform anyway, so there's no point in re-broadcasting it.

cosmicog commented 3 years ago

This can be really annoying since it's coming from the core and doesn't respect ROSCONSOLE_FORMAT='[${severity}] [${node}]: ${message}' and the publisher is unknown_publisher:

...
Warning: TF_REPEATED_DATA ignoring data with redundant timestamp for frame r0/odom at time 22.829000 according to authority unknown_publisher
         at line 278 in /tmp/binarydeb/ros-noetic-tf2-0.7.5/src/buffer_core.cpp
[ WARN] [/r0/rviz]: TF_REPEATED_DATA ignoring data with redundant timestamp for frame r0/odom at time 22,829000 according to authority unknown_publisher
Warning: TF_REPEATED_DATA ignoring data with redundant timestamp for frame r0/odom at time 22.829000 according to authority unknown_publisher
...

Disabling rviz node's output would help but I see this in both in Gazebo and when working on real robot.

Temporarily, I did this on my fork to avoid it from invading my terminal. Since the error strings are hardcoded:

if (error_string[3] != 'R') // "TF_REPEATED_DATA ignorin..."
{
  CONSOLE_BRIDGE_logWarn((error_string+" for frame %s at time %lf according to authority %s").c_str(), stripped.child_frame_id.c_str(), stripped.header.stamp.toSec(), authority.c_str());
}
return false;

TBH, I'm lost. I thought this issue is related to diff_drive_controller's realtime publisher since it is publishing the robot_id/odom frame (above, in terminal output) but couldn't spot the actual reason so I don't know where to look.

tpet commented 3 years ago

I've settled on this "solution": 2> >(grep -v TF_REPEATED_DATA)

tpet commented 2 years ago

I decided to add few thoughts to this as I encounter this same problem again and again. I think that these recent changes, that is (1) not being able to update the transform for given timestamp and (2) issuing these TF_REPEATED_DATA warnings, which cannot be disabled, somewhat contradict the general ROS/tf design (as advertised). As far as I know, tf should work in a distributed system where message delivery is not guaranteed. It seems unfortunate not to be able to re-transmit a message in such a setting, without getting warnings. Also, generally all past transforms can be modified when new message arrives due to interpolation employed in lookup - the only exception being the transform at the exact timestamps previously set. It seems to be little bit counter-intuitive for this to go against this general rule. Currently there seems to be no interface to check whether there is a particular transform at given time if I'm not mistaken, or is it? Another use case where this becomes a problem is when I want to set a transform into the buffer before sending it. When I compute the transform and want to use it immediately as part of the tree I set it to buffer manually. Then I publish it and a warning follows... I'm not sure if I'm alone in this, but I would like this to be reconsidered.

eborghi10 commented 2 years ago

I personally recommend this way of fixing it:

# Initialize last_published_timestamp that tracks the last time a Tf was broadcasted
last_published_timestamp = rospy.Time(0)
# Fill in tf_message stamp
tf_message.header.stamp = rospy.Time.now() # Or choose another time
# Only publish if the current stamp is different than the last one
if (last_published_timestamp < tf_message.header.stamp):
    tf_broadcaster.sendTransform(localization_transform_message)
    # Update the tracked stamp
    last_published_timestamp = tf_message.header.stamp

tkazik commented 2 years ago

Sorry for keeping this closed issue alive: While playing back a bag containing transforms (with the -l option) and visualizing the transforms in rviz, we also get this warning. After one loop, these warnings show up, rviz then disregards the tfs and the frames fade away. Any ideas on how to deal with that? What is the correct way to do it?

tfoote commented 2 years ago

When the bag loops all instances of the tf buffer need to be reset. Make sure that you're publishing simulated time from the bag too. Most things like rviz should be detecting the published time going backwards and do this automatically.

tkazik commented 2 years ago

I still get this warning, even if I play back the bags using rosbag play my_bag.bag -l --clock

When the bag loops all instances of the tf buffer need to be reset

How can I do that explicitly? Also this might also be related. Thx for your insight!

kvnptl commented 1 year ago

grep

2> >(grep -v TF_REPEATED_DATA buffer_core)

This one hides the entire warning message. Tested on ROS Noetic.

broughtong commented 1 year ago

grep

2> >(grep -v TF_REPEATED_DATA buffer_core)

This one hides the entire warning message. Tested on ROS Noetic.

I would like to point out that solutions like this aren't really great as they mess around with terminal formatting (eg. tqdm progress bars). Although so do the original warnings being spammed to stderr

Gunz-katomaran commented 1 year ago

is any other way to resolve this problem?

raabuchanan commented 1 year ago

grep

2> >(grep -v TF_REPEATED_DATA buffer_core)

This one hides the entire warning message. Tested on ROS Noetic.

Can you explain how to use this? Typing roslaunch my_package my_launch_file.launch 2> >(grep -v TF_REPEATED_DATA buffer_core) does not work

souravsanyal06 commented 1 year ago

I have the same question as @raabuchanan .

kvnptl commented 1 year ago

The usage is correct. Are you getting the same warning message as mentioned in the original issue? If not, then modify the command as per the repetitive warning message.

Note: As mentioned above, be careful, it might mess up the terminal formatting.

ros / geometry2

New warning message for repeated transforms #467