ros2 / rclcpp

rclcpp (ROS Client Library for C++)
Apache License 2.0
514 stars 410 forks source link

Throw on subscriber with incompatible QoS by default. #2499

Closed Zyrin closed 1 month ago

Zyrin commented 2 months ago

Printing a warning here is not enough as subscribers that want to get "latched" messages will never get ANY messages.

Zyrin commented 2 months ago

Actually why do subscribers with transient local never get any data from a volatile publisher? Wouldn't it have be fine to print a warning that there are no stored messages on the publisher but still receive new messages?

clalancette commented 2 months ago

Actually why do subscribers with transient local never get any data from a volatile publisher? Wouldn't it have be fine to print a warning that there are no stored messages on the publisher but still receive new messages?

Yeah, and I thought that was the case. But in a simple test it doesn't work here either:

Terminal 1:

ros2 topic pub -r 1 /chatter --qos-durability volatile std_msgs/msg/String '{data: hello}'

Terminal 2:

ros2 topic echo --qos-durability transient_local /chatter std_msgs/msg/String
[WARN] [1712937664.106230183] [_ros2cli_160175]: New publisher discovered on topic '/chatter', offering incompatible QoS. No messages will be received from it. Last incompatible policy: DURABILITY

@eboasson @MiguelCompany @eduponz could you maybe comment here to let us know if it is expected that volatile publishers can publish to transient_local subscribers?

alsora commented 2 months ago

We just discussed this in the Client Library Working Group (with @mjcarroll @jmachowinski)

The behavior proposed in the PR (throwing an exception if there's a qos mismatch) seems dangerous as it would allow to "crash" robots remotely, by creating a bad publisher. A temporary solution consists of overriding the qos mismatch event callback to throw the exception if some users want that.

Having said that, the current behavior is very user unfriendly. From the QoS compatibility page we indeed see that a volatile publisher won't match with a transient local subscriber. Publishers offer "volatile", but subscribers request "transient local".

The desired behavior would be to still match, but not deliver the old messages.

We should investigate why the DDS QoS works this way. If the DDS QoS can't support this use-case, we may need to create a "ROS QoS" at the RMW layer to define this policy

wjwwood commented 2 months ago

From the QoS compatibility page we indeed see that a volatile publisher won't match with a transient local subscriber. Publishers offer "volatile", but subscribers request "transient local".

The reason it's this way is because DDS does it this way.

I would prefer that the subscriptions were more "promiscuous" and match with any publishers, taking transient local if available, but communicating anyway if not. However, this is not trivial to implement in the ROS layer and DDS doesn't support it.

As to why, I believe (but I don't know for sure, I'm just speculating) the reason is a cultural difference in DDS and ROS, where DDS prefers precise matching rules that avoid unintended interactions, whereas ROS preferred an approach that allowed for maximum flexibility on matching components with little to zero configuration changes.

The oversight (in my opinion) comes in when you have multiple subscribers and a single publisher, with different requirements. I think DDS folks would argue the system is designed wrong and you should have two publishers or something like that, but it's nice if you can have say an algorithm ask for transient local, but occasionally have the recorder or some visualizer subscribe too but without transient local.

See this issue: https://github.com/ros2/ros2/issues/464


As for this issue, I agree we simply cannot throw when the matching doesn't work, for the above mentioned security concerns. The warning is the best we can do.

fujitatomoya commented 2 months ago

@wjwwood thanks for the comments.

As to why, I believe (but I don't know for sure, I'm just speculating) the reason is a cultural difference in DDS and ROS, where DDS prefers precise matching rules that avoid unintended interactions, whereas ROS preferred an approach that allowed for maximum flexibility on matching components with little to zero configuration changes.

i think both reasonable. i believe the use cases for robotics and autonomous system, avoid unexpected behavior to stop communication would be better, instead of printing warning (user could oversight) to keep the communication. (probably later on, we find the problem with this behavior...) just idea, there could be some opt-in configuration flag for user to allow incompatible QoS to keep the communication explicitly, in that case user must be able to know what they are doing.

Ryanf55 commented 2 months ago

As a developer in full control of my system, where ROS_LOCALHOST_ONLY is enabled, and I use SROS, there is no security concern. Why not allow this this opt-in. In some of our more complicated test frameworks, it's hard to catch things like these in logs because ROSOUT isn't captured as part of the test. Having it throw would be way better in case I forgot to update the QOS in my test code somewhere, instead of manually chasing down and looking at the ROS logs.

There's plenty of other ways to crash ROS if you are on the network with malformed DDS messages.

alsora commented 2 months ago

Why not allow this this opt-in

The "opt-in" would consist of a subscription option, e.g. bool throw_on_qos_incompatible. IMO that's redundant. You already can set a custom callback for handling qos events, and this gives you the freedom to do whatever you want (e.g. throwing an exception) without need for having flags for each possible mode.

There's plenty of other ways to crash ROS if you are on the network with malformed DDS messages.

That's something we should fix, and not introduce more ways to cause problems. I'm not necessarily concerned about malicious actors, but more about developers mistakes and bad user experience.

fujitatomoya commented 2 months ago

just idea, there could be some opt-in configuration flag for user to allow incompatible QoS to keep the communication explicitly, in that case user must be able to know what they are doing.

btw, i am not even sure if this opt-in policy can be supported with current rmw implementations. as far as i see the docs, DDS cannot allow this incompatible cases at all as specification... CC: @MiguelCompany @eboasson @asorbini

EduPonz commented 2 months ago

Throwing in my two cents, I'd argue that a Volatile DataWriter only exists if the application explicitly sets the DurabilityQosPolicy to Volatile. Furthermore, a Transient Local DataReader only exists if the application explicitly changes the policy to Transient Local. Seeing that this need to be intended changes, I'd not allow Volatile writers to match with Transient Local readers. In fact, the very reason why the default is Transient Local in writers and Volatile in readers is so that applications can control the latching behaviour by simply setting the policy on the reader side.

Perhaps with Durability is less clear, as a Transient Local reader not receiving historical data from a Volatile writer could be interpreted upon reception as that there was no historical data. However, if you take the analogous case of ReliabilityQosPolicy, a Reliable reader is explicitly expressing that it wants the writer to resend data in case of loss. I don't think we'd want that reader to match with a Best Effort writer that cannot resend any loss sample. IMO having discrepancies in the Qos incompatibility would be even more confusing.

One thing we could do is not to expose those policies for publishers (leaving the default values which are designed to matched with everyone), and let the QosPolicies of the communication be defined on the subscription side only.

jmachowinski commented 2 months ago

In fact, the very reason why the default is Transient Local in writers

This is NOT the default for writers. The default is volatile.

Edit: Do you mean publishers when talking about writers ?

eboasson commented 2 months ago

@wjwwood

From the QoS compatibility page we indeed see that a volatile publisher won't match with a transient local subscriber. [...] I would prefer that the subscriptions were more "promiscuous" and match with any publishers, taking transient local if available, but communicating anyway if not. However, this is not trivial to implement in the ROS layer and DDS doesn't support it. As to why, I believe (but I don't know for sure, I'm just speculating) the reason is a cultural difference in DDS and ROS, where DDS prefers precise matching rules that avoid unintended interactions, whereas ROS preferred an approach that allowed for maximum flexibility on matching components with little to zero configuration changes.

I want to make sure I don't get lumped in with the DDS folks here 😀 From the day I learnt that DDS had this "request-offered" matching of QoS settings, I have been convinced that it was a bad decision. That was some 15 years ago, and I have not seen a single reason since that led me to reconsider that position. I did learn of various new ways in which it is a bad thing. The trouble is that it is set in stone in DDS ...

@fujitatomoya

just idea, there could be some opt-in configuration flag for user to allow incompatible QoS to keep the communication explicitly, in that case user must be able to know what they are doing.

btw, i am not even sure if this opt-in policy can be supported with current rmw implementations. as far as i see the docs, DDS cannot allow this incompatible cases at all as specification...

There is a way of sneaking it into a DDS implementation without causing too much trouble (I've done it in the past), and that is by adding a QoS for configuring applicable matching rules, and then distribute that setting using a vendor-specific parameter code in the discovery information with the "must understand" flag set, indicating that any implementation that receives that the discovery data but doesn't understand this particular setting has to ignore it. Then you only need to fiddle a bit with feature flags/version numbers in discovery so that you can handle the case where you're interoperating with the same DDS implementation but an older version.

I doubt the OMG would go for it, and in any case, that process is so slow that it might as well be ignored. That makes it necessarily very vendor-specific.

@jmachowinski

In fact, the very reason why the default is Transient Local in writers

This is NOT the default for writers. The default is volatile.

Indeed.

Edit: Do you mean publishers when talking about writers ?

I have no desire to speak for someone else, but I might be able to provide some clarification. In DDS the basic entities are called "(data) writers" and "(data) readers", and the ROS 2 "publisher" is equivalent to the DDS "writer".

To make things more fun, in DDS you also have "publishers" and "subscribers". Each data writer is owned by exactly one publisher, but a publisher can own multiple data writers (analogous for readers/subscribers). So it is easy for confusion to arise.

EduPonz commented 2 months ago

In fact, the very reason why the default is Transient Local in writers

This is NOT the default for writers. The default is volatile.

Edit: Do you mean publishers when talking about writers ?

Sorry for not been clear. When I wrote writers I meant DDS DataWriters, same with readers (DDS DataReaders). I always found strange that the DDS speck allows configuring these policies on both sides, instead of having the receiver deciding on them. It is my understanding that this was done to allow for optimizations in some cases, but it is something that ROS 2 may not expose, as it is somehow error prone. In that sense, changing durability or reliability on the DataWriter side should be presented in a only-touch-this-if-you-know-what-you-are-doing basis.

On the meantime, we could change the Publisher defaults for ROS 2 to map those of DDS so that setting the volatility or reliability on only on the subscriptions side does not lead to this issue. Furthermore, I'd always advocate for applications to implement all the callbacks and not just the data ones. We could either force them to do it (possible a bit harsh but it's what you would do in a safety context for instance), or we could have a default implementation that at the very least logs something (I agree that throwing is indeed a dangerous default). It is true that the error may still go unnoticed for some applications, but I guess it'd better than nothing.

sloretz commented 1 month ago

@fujitatomoya would you be willing to summarize this PR, close it, and open a discussion issue on ros2/ros2? 🧇

fujitatomoya commented 1 month ago

As a conclusion to this PR, we are going to close this. According to the discussion above, it seems no objections but feel free to post the comments and reopen.

The reason is that this PR allows user to crash (generating exception) remotely on the subscriber by creating publisher on the topic. This could be security issue, and https://github.com/ros2/rclcpp/pull/2499#issuecomment-2052049179 describes the reason as well.

For another discussion on How strict QoS compatibility should be?, i created the the another issue to track the discussion and feedback, please visit https://github.com/ros2/ros2/issues/1562