Should tf be Reliable or BestEffort?

rotu commented 5 years ago

I notice a lot of cases (e.g. spinning up nodes, transform_listeners without their own thread, network latency due to WiFi multicast) where tf throughput will slow to a crawl. Would it make sense to change the /tf QoS from reliable to best effort so it is not subject to this backpressure?

rotu commented 5 years ago

See https://github.com/ros-planning/navigation2/issues/1094 for a concrete example of what (I think) this design choice is causing.

clalancette commented 5 years ago

So the design choice for /tf to be reliable is for two major reasons:

In many applications, it does make sense that all transforms do get delivered; you don't necessarily want to drop some on the floor.
This is compatible with how ROS 1 does things.

If this is a bug in how Opensplice is handling things, then I think it is worthwhile to connect with ADLink and see if they have any thoughts here. If you'd like to make /tf QoS configurable (still defaulting to reliable), we'd certainly entertain a PR that does that. Changing the /tf default to be best effort could also be considered, but we'd need a good rationale to do that (and some thought about the fallout from doing that). Just let us know how you want to proceed.

rotu commented 5 years ago

I think it's not a bug with how OpenSplice handles things, but rather becomes obvious because of the rather extreme case that came up with OpenSplice traffic jams.

In many applications, it does make sense that all transforms do get delivered; you don't necessarily want to drop some on the floor.

That may be so, but this design choice prevents meaningful usage of SensorDataQoS with sensor data in multiple TF frames, since data that is delivered via best-effort must wait until TFs are delivered in order to reconcile them. The question is is the occasional dropped TF a problem, since geometry is going to interpolate over that missing data anyway? Is it a bigger problem than delayed handling of every spatial sensor data until retransmission of TFs and working through a backlog of up to 100 messages?

If reliable is indeed the correct default, what is the proper workaround for transforming sensor data that is supposed to be dealt with in real-time?

tfoote commented 5 years ago

The general idea behind tf was that it would be robust to dropped packets due to interpolation, but we have never really fleshed that out since the ROS transport never really tested that. And there are some corner cases that might need an improved API if significant dropping is going on. For example, a max interpolation distance might be in order. Two datasamples 10 seconds apart do not necessarily tell you much about where an arm was in the middle 5 seconds.

For anything realtime I don't think that running through the tf system is likely the right thing to do. There's a large amount of overhead and many moving parts that you cannot guarantee timings on. You're already hitting the network etc. To that end if you have a backlog of 100 tf messages that would have been dropped your performance is already significantly degraded to the point that you probably can't trust interpolation either. And you're also likely completely saturating the bandwidth of the link. You're also making an assumption that the tf messages will back up while the sensor data will get through much faster. The importance of best-effort makes a much bigger difference with large data type transmissions for things like images versus a tf message that can easily fit inside your average datagram and retransmitting it only takes one extra cycle.

Stepping back to a higher level, If you're reaching the saturation point on your wifi, it's hard to believe that anything is going to perform well. And by default I think it's much cleaner if tf always returns to you it's most accurate result possible rather than a best effort. If an image is dropped that's obvious when computing that you're only seeing the messages that have arrived, even if they arrive every 10 seconds. Whereas, if you then compute the heading to the marker in the image and the tf frames are only coming through ever 10 seconds suddenly you'll be computing information that's actually completely wrong. For example think of a robot spinning in place with a camera on it's head looking at QR codes. If you spin at once every 10 seconds. The tf pose interpolation at any time will be in the same direction because every sample came in at 10 seconds. This is actually a completely wrong result, and should not be something that tf would be willing to return.

Because it's "best effort" and it will save you some number of milliseconds when looking up a transform on an image that came in over a best effort link. If you are saturating a link with the tf messages there are better solutions such as rebroadcasting a downsampled stream which is done in a smart and reasonable way, with logic for max displacements and time between samples instead of just letting things randomly drop based on network congestion.

Note that in my example we're only talking about one transform getting sent. If you're getting messages from the whole system, if one subset of the network is more congested and we use best effort for tf messages, it's quite possible that some frames would come through most of the time and others would be completely starved. Thus you could actually get to the point that you have images, but cannot transform them at all.

In the past we've experimented with having tf try to help extrapolate into the future using the history of the transforms. And this only lead to very hard to debug errors. I think that this is another case where we could play tricks to try to highly optimize for a low latency system, but it's not worth it at the cost of accuracy.

So to that end I'd strongly recommend that we stick with the Reliable transport and recommend that if tf latency is an issue to use an application specific throttling mechanism that can take into account the semantics to provide a better tradeoff between bandwidth and accuracy.

jespersmith commented 3 years ago

I would like to weigh in on this issue. As a reasonably humanoid robot control system will be able to publish TF's around 250Hz - 1000Hz, the overhead resends can become significant and even take the network down if the depth qos is big. The reliability will come from the frequency instead of resends.

By defaulting to reliable, all tooling breaks when making the publisher best_effort, because tools like rviz2/tf2_monitor etc do not expose the QoS settings.

I think that both reliable and best_effort have their places. However by choosing reliable as default, it is not possible to use tooling like rviz2 and tf2_monitor with a best_effort publisher.

ros2 / geometry2

Should tf be Reliable or BestEffort? #154