Improve latching/transient_local support in ROS 2

wjwwood commented 6 years ago

This have come up several times in the past, but in summary DDS's durability setting doesn't behave exactly like latching in ROS 1. So, the question that needs a decision is "do we just use DDS's notion of durability, or implement our own latching mechanism, or both"?

ROS 1 Latching vs DDS durability

There two main differences between the latching mechanism in ROS 1 and the durability QoS in DDS.

The first is that in DDS, whether or not old messages are sent through a connection is determined by a combination of properties on both the publisher and the subscriber. By contrast, in ROS 1 a subscriber does not express a preference for latched or unlatched publishers, latching is only expressed on the publishing side. Furthermore in DDS there's no "best available" option for durability: if a datareader requests durable data, it will only connect to datawriters that offer durable data. The matrix breaks down as such:

ROS 1:

Publisher	Subscriber	Connection	Result
Not Latched	--	Yes	New Messages
Latched	--	Yes	New and Old Messages

ROS 2:

Publisher	Subscriber	Connection	Result
Volatile	Volatile	Yes	New Messages
Volatile	Transient local	No	No Communication
Transient local	Volatile	Yes	New Messages
Transient local	Transient local	Yes	New and Old Messages

So, to mimic ROS 1's latching, what we'd ideally have in ROS 2 is a durability setting for datareaders (subscribers) which is like "best available". But that doesn't exist. I brought this up with RTI last time we met, and they said that it might makes sense to have a setting for that in DDS, but it would need to be part of the standard and/or implemented by all our middlewares as an extension in the meantime. They also pointed out they have something like this for reliable/best effort, but I can't find that in their docs at the moment so I'll have to ask them about that.

So, if we choose to just use durability from DDS as-is then there's no direct equivalent to how latching works in ROS 1. (see below for a possible workaround)

The other main difference is that the number of backlogged messages delivered to "late subscribers". In ROS 1, it's always the last message only. In DDS, the number is based on the history setting (basically anything in the history cache for the datawriter is delivered), which is primarily affected by the history setting (keep all vs keep last) and the history cache depth. As far as I know, there's no way in DDS to have a different "durability depth" from the "history depth", so in that case there's no direct analog to a publisher in ROS 1 with a queue size of 1000, but latched (implicit latching depth of 1).

So, (again) if we choose to use durability from DDS as-is, there'd be no way to directly emulate, for example, a ROS 1 topic with queue size of 10 (to account for bursty data) that is latched (with an implicit latching size of 1).

Possible work around

One work around that @dhood mentioned (not necessarily proposing it, but she described it first) was to make all publishers transient_local, that way all subscribers would always talk to it, and such that volatile subscribers would get only new messages, and transient_local subscribers would get both old and new messages. This basically inverts the way it is expressed in ROS 1, so that all publishers are latched, but the subscribers are responsible for expressing a desire to receive latched messages or not.

So while this provides a behavior that's more flexible, it still does not provide a direct mapping to ROS 1's behavior. Also, always providing durability of transient local on publishers would be potentially very expensive, e.g. consider a image publisher with keep all or keep last of 1000. It might be then required to keep a large number of image messages around all the time, using a lot of memory, even if no subscriber asks for old data.

Proposal for our own latching mechanism

It's been tossed around in the past that we could provide our own latching mechanism in ROS 2 (either in stead of durability or in addition to it), which would act like the mechanism in ROS 1, and would:

only opt-in on the publisher side and opt-out on the subscriber side
- e.g. publisher would a new parameter like latch_n_messages=N, where N=0 would disable
- e.g. subscription would get new parameter like ignore_latched_messages=false
- unless explicitly ignored, subscribers would take latched messages if available, but not complain (and still talk to the publisher normally) if not available
would deliver a configurable number of backlogged messages to late subscribers
- number of messages backlogged would be separate from the history setting or history depth

This could be implemented as a service call which shadows each latched topic, e.g. rostopic:///chatter and rosservice:///chatter/_request_latched_messages, and subscribers could do this on creation:

if service exists topic_name + '_request_latched_messages'
- then call service and deliver (or schedule) array of backlogged messages to subscriber
finish creating subscription

I think there are lots of little details to work out here (node specific service addressing, how to schedule message delivery before data from normal path, is the service definition type specific?, etc...), but the idea would work in theory. There are probably other ways it could implemented as well. I think it could be considered an implementation detail (and could even not be based on topics and/or services).

The down side is that this has a lot of overlap with an existing feature of DDS in durability and it would add some considerable overhead for this feature compares to ROS 1's latching implementation and compared to DDS's durability.

Questions

So, I think it's fair to say there's no direct mapping from ROS 1's latching to DDS and so the first question I have is:

do we implement something that is more like ROS 1's latching?

And I think the answer to that is based on whether or not we believe the durability in DDS to be insufficient for our users, e.g. is the lack of a "best available" for datareaders a deal breaker?

If we say: yes, we need our own mechanism, then the follow up question is:

how do we implement custom latching and is that really feasible and scalable?

But if we say: no, DDS's durability is good enough, then we have to ask the next question:

how do we map ROS 1's latched topics (ros1_bridge, rosbag v2 playback, and ported nodes) to ROS 2?
- is the constraint that history settings/depth control durabilities scope ok?

Who is affected

This is blocking decisions about how to map latched topics over the bridge, with notable examples like /tf_static (https://github.com/ros2/geometry2/issues/12) and /map (https://github.com/ros2/rmw/pull/111).

For a topic like /map that is used in a lot of different packages, making a publisher use latching is about the same amount of work as the publisher using Transient Local. Is the cost of making every user set Transient Local when subscribing to that kind of topic greater than the cost of implementing our own latching mechanism?

paulbovbel commented 6 years ago

Stumbled on this, so I thought I could offer up a user perspective. While it was a little surprising that latching had to be set sub-side as well as pub-side in ROS2, I don't think it's particularly cumbersome given proper documentation.

DDS durability feels 'good enough' in terms of matching ROS1 latch functionality:

for history vs queue length, in a latched scenario, I can imagine that any messages not important enough to be durable, is also unimportant enough to be dropped off the publisher queue.
- the durability contract being a two-sided issue also makes sense, since while the publisher must offer up the data, the subscriber must also opt-in to handle not just future data, but whatever past data the publisher has in history.

I do like the idea of setting all publishers to be 'Transient Local' by default and avoid impedance mismatch, but I imagine some publishers would still want to opt-into 'Volatile' avoid storing history, due to any number of application-specific concerns.

ros-discourse commented 3 years ago

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/about-qos-of-images/18744/13

clalancette commented 3 years ago

Since the time that this issue was opened, we have clearly stuck with the DDS concept of "latched" topics; the publisher and subscriber must agree on this setting in order for true latching to work.

That being said, it would still be nice to make this work more automatically (and thus make it easier for users). @Karsten1987 said that there is some code in rosbag2 that kind of does this negotiation of settings, and that that code could be made more generic. https://github.com/ros2/ros2cli/issues/594 hints at using the same infrastructure to do that for the CLI tools.

So I'm going to leave this old issue open in the hopes of improving the latching story in ROS 2. That being said, we have no immediate plans to work on it so I'm going to mark it as backlog for now.

emersonknapp commented 3 years ago

That being said, it would still be nice to make this work more automatically (and thus make it easier for users). @Karsten1987 said that there is some code in rosbag2 that kind of does this negotiation of settings, and that that code could be made more generic. ros2/ros2cli#594 hints at using the same infrastructure to do that for the CLI tools.

Yes, the rosbag2 subscriptions detect all existing publishers and subscribe with a QoS that is guaranteed to match them all. Generalizing that functionality to the ROS 2 core is tracked as https://github.com/ros2/rosbag2/issues/601 - not sure if we want a ros2/ros2 ticket for it instead/additionally

clalancette commented 3 years ago

Generalizing that functionality to the ROS 2 core is tracked as ros2/rosbag2#601 - not sure if we want a ros2/ros2 ticket for it instead/additionally

Ha, I had forgotten about that issue, thanks for reminding me.

I still think it makes sense to leave this one open. The rosbag2 issue is about generalizing the code, but there would still be follow-up work elsewhere to use that code as appropriate. So I'm going to leave this one open for now.

dcconner commented 1 year ago

Can I request that we add this discussion to https://docs.ros.org/en/rolling/Concepts/Intermediate/About-Quality-of-Service-Settings.html

Especially the results column in table above. The current table of "compatible" is misleading, and we need a discussion of latching front and center.

clalancette commented 1 year ago

Please feel free to open a PR against the documentation at https://github.com/ros2/ros2_documentation, and we can iterate there.

dcconner commented 1 year ago

@clalancette fair enough. https://github.com/ros2/ros2_documentation/pull/3804

ros2 / ros2