ros2 / design

Design documentation for ROS 2.0 effort
http://design.ros2.org/
Apache License 2.0
215 stars 194 forks source link

Unique network flows #304

Closed anamud closed 3 years ago

anamud commented 3 years ago

This is a proposal for unique network flow identifiers for publishers and subscribers in communicating nodes. Unique network flow identifiers enable publishers and subscribers to obtain required QoS unambiguously from machine-machine networks such as 5G.

Signed-off-by: Ananya Muddukrishna ananya.x.muddukrishna@ericsson.com

anamud commented 3 years ago

@wjwwood Please have a look at this proposal. I would like to present it to the Middleware working group on a suitable occasion.

anamud commented 3 years ago

This PR was presented to the middleware working group on 04-11-2020 (minutes).

ros-discourse commented 3 years ago

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2020-11-19/17570/1

ros-discourse commented 3 years ago

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-native-5g-support-from-ericsson-and-eprosima/17705/2

anamud commented 3 years ago

Implementation pull-requests are now submitted to relevant repositories.

Initial contributions stem from Ericsson and eProsima.

eboasson commented 3 years ago

@anamud I'm sorry I'm responding so late, that seems to happen more often than I'd like ...

What is the envisioned behaviour of a system where I have one publisher and two subscribers, and both subscribers request a unique network flow? Does that imply a multicast must be transformed into two unicast messages? And if those two subscribers happen to be in the same node (or DDS domain participant, so the same RMW context), is it then the intent that the same data is sent twice to the single subscribing process?

The implication from the design document seems to be that this is the case, but I don't think that's wise: in a pub/sub system, it is perfectly feasible to identify "unique flows" on the publisher side, but doing so at the subscriber side gets you into trouble.

I would say that assigning a unique flow id at the publisher side is meaningful. I would equally say that requesting separate data path on the both sides (i.e., socket, threads, buffers, the works) for processing data is also reasonable, but those are different concepts.

Furthermore, as discussed during the discussion in the MW working group some time back, I am not generally in favour of sprinkling such low-level details in application code at the level of readers and writers: the QoS needs are really determined at system level, and mostly bound to topics. So side-loading it would generally be better in my view, and I would then like being able to set them on topics.

Also mentioned in that call was that the point of abstracting the network with a pub/sub system is allowing you to avoid dealing with network protocols, addresses, port numbers, how many bits are available for uniquely identifying flows, &c. Here you explicitly introduce the IP model and protocol suite in the interface, but that means any other networking technology will have to fake things. For example, the LightFleet system introduced recently in a WG meeting gives you hardware reliable pub/sub, without IP addresses ...

The cleaner (to me at least) but admittedly much harder model is to let the middleware map transport-independent QoS settings in whichever way it sees fit to provide the requested QoS. I.e., DDS natively has QoS settings for transport priority and latency budget, and one could conceivably add something about bit error rate (or likelihood of message loss). How exactly that should be mapped to IP QoS settings/network flows is a difficult but interesting subject. I wouldn't be surprised if that requires side-loading mapping tables, and that, incidentally, fits well with how TSN is designed to operate.

anamud commented 3 years ago

The implication from the design document seems to be that this is the case, but I don't think that's wise: in a pub/sub system, it is perfectly feasible to identify "unique flows" on the publisher side, but doing so at the subscriber side gets you into trouble.

@eboasson, thanks for the comment. I would like to clarify that the design proposes interfaces to create unique network flows for subscriptions and publishers. RMW implementations can decide to support either or both. Specifically, RMW implementations can decide not to support subscription-side unique flows if it complicates their design. Similarly, the rather odd case that two or more subscriptions to the same topic in the same node requesting for unique flows can be easily recognized and refused by RMW implementations.

anamud commented 3 years ago

Also mentioned in that call was that the point of abstracting the network with a pub/sub system is allowing you to avoid dealing with network protocols, addresses, port numbers, how many bits are available for uniquely identifying flows, &c. Here you explicitly introduce the IP model and protocol suite in the interface, but that means any other networking technology will have to fake things. For example, the LightFleet system introduced recently in a WG meeting gives you hardware reliable pub/sub, without IP addresses ...

@eboasson, I agree that low-level concepts should not bleed into the upper layers and upset the separation of concerns. In our case, however, the introduction of low-level information is harmless and even sought after by ROS users. Just to clarify, according to our proposal, the only low-level information coming up to the RCLCPP layer are read-only details about network flows i.e., IP addresses, port numbers, flow labels, and protocol details. These are details that ROS programmers generally care about while debugging. For example, there are hundreds of questions on ROS Answers just about IP addresses.

Also, an RMW that uses special networking (say an RMW based on LightFleet) can easily extend our proposed implementation for get_network_flow() with its own specific flow details. A flow is essentially a tuple with source and destination addresses and is fundamental to every communication mechanism.

gavanderhoorn commented 3 years ago

These are details that ROS programmers generally care about while debugging. For example, there are hundreds of questions on ROS Answers just about IP addresses.

as someone who was answered many of those "hundreds of questions" about IP addresses on ROS Answers: I don't think those are relevant here.

@eboasson is concerned about breaking the abstraction by leaking implementation details upwards into higher level layers. That concern seems justified.

Providing infrastructure to facilitate debugging is always good. Doing it by punching holes in abstractions must be done only after careful consideration.

A flow is essentially a tuple with source and destination addresses and is fundamental to every communication mechanism.

isn't the point raised by @eboasson that you're introducing that assumption here but it may not actually be true? Communication may have sources and destinations, but those are not always called addresses, nor does there necessarily have to be a 1-to-1 mapping of those to network concepts.

anamud commented 3 years ago

Furthermore, as discussed during the discussion in the MW working group some time back, I am not generally in favour of sprinkling such low-level details in application code at the level of readers and writers: the QoS needs are really determined at system level, and mostly bound to topics. So side-loading it would generally be better in my view, and I would then like being able to set them on topics.

The cleaner (to me at least) but admittedly much harder model is to let the middleware map transport-independent QoS settings in whichever way it sees fit to provide the requested QoS. I.e., DDS natively has QoS settings for transport priority and latency budget, and one could conceivably add something about bit error rate (or likelihood of message loss). How exactly that should be mapped to IP QoS settings/network flows is a difficult but interesting subject. I wouldn't be surprised if that requires side-loading mapping tables, and that, incidentally, fits well with how TSN is designed to operate.

@eboasson I like the ideas of building middleware that map DDS QoS to network QoS and accept mapping tables. That QoS should be not an application programmer concern but a system concern is a principle I take to heart as well. However, these thread of thoughts are a bit out of scope of our proposal. Our proposal is motivated by (5G) network QoS configuration and concretely suggests a minimal side-loading-friendly interface that enables ROS2 users to request for unique network flows and understand network flows. There is no suggestion in our proposal about how ROS should configure (5G) network QoS.

anamud commented 3 years ago

@anamud I'm sorry I'm responding so late, that seems to happen more often than I'd like ...

@eboasson No problem! I understand the delay. Your comments helped me understand that the design proposal text is long-winded and does not describe the implementation choice clearly. I have now updated the design proposal to reflect the latest implementation. I believe this update improves readability and clears up confusion to a good degree. Please have a look and let me know if things are unclear.

anamud commented 3 years ago

Providing infrastructure to facilitate debugging is always good. Doing it by punching holes in abstractions must be done only after careful consideration.

I agree. This is why the implementation accompanying the design proposal is submitted to the scrutiny of the middleware working group and the general ROS community on GitHub. Please provide comments on the implementation if you have time.

anamud commented 3 years ago

isn't the point raised by @eboasson that you're introducing that assumption here but it may not actually be true? Communication may have sources and destinations, but those are not always called addresses, nor does there necessarily have to be a 1-to-1 mapping of those to network concepts.

This is indeed a valid concern. I understand that the idea of a flow can be different across networking stacks. Our proposal provides RMWs with the flexibility to define their own idea of a flow. They need not invent the concept of an address if they do not have such.

anamud commented 3 years ago

@eboasson and @gavanderhoorn: Thanks for raising the concern about leaking abstractions in the proposed design. We have now updated the proposal to reduce the leak. Before, we had introduced IP and transport layer specific data structures into the RMW. Now these are removed and replaced by a byte-array/string that RMW implementations can populate in their own custom manner, similar to the existing get_gid() method in rclcpp. Please have a look and let us know your thoughts.

eboasson commented 3 years ago

@anamud thanks for clarifying various points and updating the PR to address these concerns about leaky abstractions. I am unconvinced by the solution, but that touches upon a larger point that perhaps needs a discussion in a wider audience.

This proposal introduces a "unique flow" on subscriptions and publishers in the ROS 2 API, but in my current understanding, the meaning of both "unique" and "flow" are left unspecified, allowing them to be different between the two sides, not requiring them to do anything at all, and with any configuration information returned in an unspecified format in the operations for querying them.

When I write it down like this, I have a feeling that I may be coming across as a bit negative about it. There is an aspect that I am indeed negative about (the extent to which things are unspecified), but at the same time I really do think that what you are trying to do (being able to set network-level QoS) is very important.

I know there is a trade-off between trying to specify things tightly with the risk of being unable to express what you want or being unable to map it to the lower layers, and giving so much freedom in the interface that in order to use it effectively, you have to have knowledge of the specific lower layer (e.g., DDS over 5G) in the application for it to be useful. In my current view, this sits firmly on the side of too much freedom.

From your comment:

I like the ideas of building middleware that map DDS QoS to network QoS and accept mapping tables. That QoS should be not an application programmer concern but a system concern is a principle I take to heart as well. However, these thread of thoughts are a bit out of scope of our proposal. Our proposal is motivated by (5G) network QoS configuration and concretely suggests a minimal side-loading-friendly interface that enables ROS2 users to request for unique network flows and understand network flows. There is no suggestion in our proposal about how ROS should configure (5G) network QoS.

I take it that we're not really far apart but just have a different idea of the right way to get there.

I see adding this proposed interface as polluting the ROS 2 API with low-level details. Everything else that's there today tries to define clean abstractions and tries to limit itself to what is understood well enough to be confident that it will stand the test of time. (No doubt with some mistakes.) For example the pub/sub model with these QoS are (a subset of) an established model for abstracting communications; and executors that define how and when callbacks will fire avoids locking oneself into a single model that then turns out to be a bad choice (an area where things are changing right now).

To stick with the way the ROS interface is defined, I think one must try to express the applications needs in terms of fairly generic QoS settings, and then map those to the underlying network. The current QoS of ROS are insufficient, but if one adds priority and latency to them, you are much closer to being able to define a mapping to flows. Perhaps those two alone will be enough, perhaps not.

It will make things harder for the RMW implementations and the middleware, without a doubt. So be it.

anamud commented 3 years ago

@eboasson Thanks for a thorough review. You have pointed to deep shortcomings in the proposal.

This proposal introduces a "unique flow" on subscriptions and publishers in the ROS 2 API, but in my current understanding, the meaning of both "unique" and "flow" are left unspecified, allowing them to be different between the two sides, not requiring them to do anything at all, and with any configuration information returned in an unspecified format in the operations for querying them.

I know there is a trade-off between trying to specify things tightly with the risk of being unable to express what you want or being unable to map it to the lower layers, and giving so much freedom in the interface that in order to use it effectively, you have to have knowledge of the specific lower layer (e.g., DDS over 5G) in the application for it to be useful. In my current view, this sits firmly on the side of too much freedom.

This was greatly helpful. The unspecific terms in the proposal were a result of my own sloppiness in some places and a deliberate attempt at increasing implementation freedom in others.

I have now made the proposal much more concrete. Network flows, their endpoints, and their uniqueness are defined with greater precision. Please help me make them more rigorous if you find them lacking. I have removed the freedom of specification of network flows and their endpoints in the updated proposal. RMW implementations are now required to return specific structures defined in the rmw layer. The freedom to support the creation of either publisher- or subscriber-side unique network flow endpoints or both is retained because I believe RMW implementations would feel coerced otherwise.

An important point in the concrete definitions of network flows and their endpoints is that they are specific to the Internet Protocol Suite and focus on resources from UDP, TCP in the transport layer and IPV4, IPV6 in the internet layer. The rationale is that UDP, TCP, IPV4, and IPV6 are the majority protocols in use today, more so in RMW implementations. The proposed definitions can be extended later on to support relevant non-IP based protocols such at deterministic ethernet. RMW implementations that use resources outside the scope of these definitions (I cannot think of a single one) will not be able to support unique network flow endpoints or return network flow endpoints and I am willing to stand by this position.

I see adding this proposed interface as polluting the ROS 2 API with low-level details.

That such low-level details are harmless and useful to have in the application layer is a position I stand for. I believe that users who care about performance, debugging and visualization actively look for network flow endpoint details (as per our proposed definition) already. ROS applications are rarely stand-alone and are surrounded by other systems on the side. Integrating with these side-systems in an automated way is highly productive for ROS. One such side-system is the 5GS Network Exposure Function (5GS NEF). Our proposed interface provides the application layer with information required to integrate with the 5GS NEF. The automation benefits that such integration brings about far outweighs the harmless abstraction leakage in the application layer.

To stick with the way the ROS interface is defined, I think one must try to express the applications needs in terms of fairly generic QoS settings, and then map those to the underlying network. The current QoS of ROS are insufficient, but if one adds priority and latency to them, you are much closer to being able to define a mapping to flows. Perhaps those two alone will be enough, perhaps not. ... It will make things harder for the RMW implementations and the middleware, without a doubt. So be it.

Your alternative design proposal is visionary and indeed cleaner than ours. I too have thought about such a design before and reasoned that the current abstracted QoS parameters in ROS are largely inadequate to automatically infer network QoS configuration from. DDS-standard QoS parameters are inadequate too. To come up with a minimal set of generic, network-agnostic QoS parameters for ROS that somehow enable complete and automatic inference of required network (not just 5GS but WiFi, TSN, and others) QoS configuration is greatly challenging and disruptive, as you have identified. I want to work on this challenge, but as a second step. My first priority is to enable ROS applications to integrate with 5G networks as smoothly as possible without greatly upsetting the existing state of things, hence our proposal.

wjwwood commented 3 years ago

Ok, I've created a .repos file in a gist:

https://gist.githubusercontent.com/wjwwood/cbe7ce0fd4927019acfff3bb7ad7bb12/raw/d17929bf702507750e80070c55c539d8d0713df9/unique_network_flows.repos

And I'll run some initial CI before I review the pr's, hopefully this afternoon:

(edit: messed up the arguments)

wjwwood commented 3 years ago

New CI:

wjwwood commented 3 years ago

CI is failing on rmw_connextdds for some reason, @anamud is this something you can investigate in the next 24 hours or so? If not, I can try, but I'm not sure I'll have time.

For now, CI without rmw_connextdds:

wjwwood commented 3 years ago

Looks like there are some linter issues (uncrustify and cpplint) too.

anamud commented 3 years ago

CI is failing on rmw_connextdds for some reason, @anamud is this something you can investigate in the next 24 hours or so? If not, I can try, but I'm not sure I'll have time.

@wjwwood I have solved the error and updated the PR. Please see here. I would appreciate it if you start a new CI flow.

anamud commented 3 years ago

Looks like there are some linter issues (uncrustify and cpplint) too.

Will fix soon. Thanks!

EduPonz commented 3 years ago

Hi @anamud @wjwwood , I think I've taken care of all the uncrustify and cpplint issues in https://github.com/ros2/rmw_fastrtps/pull/502/commits/f8d4dab0e2b5aff971fd73106b916a352b7ac596

wjwwood commented 3 years ago

Here's new CI:

The linkage issues on Windows look real though (different from the connextdds linkage issues). I think that will likely come up in the above CI again:

23:10:53 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(96,6): error C2375: 'rclcpp::operator ==': redefinition; different linkage (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

23:10:53 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(77): message : see declaration of 'rclcpp::operator ==' (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

23:10:53 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(100,6): error C2375: 'rclcpp::operator !=': redefinition; different linkage (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

23:10:53 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(78): message : see declaration of 'rclcpp::operator !=' (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

23:10:53 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(104,16): error C2375: 'rclcpp::operator <<': redefinition; different linkage (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

23:10:53 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(81): message : see declaration of 'rclcpp::operator <<' (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]
wjwwood commented 3 years ago

I'm still seeing quite a few failures from ament_cpplint and ament_uncrustify.

This job did test the commit from @EduPonz:

16:14:54 === src/ros2/rmw_fastrtps (git) === 16:14:54 commit f8d4dab0e2b5aff971fd73106b916a352b7ac596 (HEAD -> feature/unique_network_flows, origin/feature/unique_network_flows) 16:14:54 Author: EduPonz eduardoponz@eprosima.com 16:14:54 16:14:54 Fix linting

https://ci.ros2.org/job/ci_linux/14224/consoleFull

But still had test failures:

https://ci.ros2.org/job/ci_linux/14224/testReport/(root)/projectroot/cpplint_2/ https://ci.ros2.org/job/ci_linux/14224/testReport/(root)/projectroot/uncrustify/

There were also failures in rmw_connextdds_common:

https://ci.ros2.org/job/ci_linux/14224/testReport/junit/(root)/projectroot/cpplint/

And cyclonedds:

https://ci.ros2.org/job/ci_linux/14224/testReport/junit/(root)/projectroot/cpplint_3/ https://ci.ros2.org/job/ci_linux/14224/testReport/junit/(root)/projectroot/uncrustify_2/


Also, the windows compiler error I expected is still there:

16:59:21 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(96,6): error C2375: 'rclcpp::operator ==': redefinition; different linkage (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

16:59:21 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(77): message : see declaration of 'rclcpp::operator ==' (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

16:59:21 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(100,6): error C2375: 'rclcpp::operator !=': redefinition; different linkage (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

16:59:21 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(78): message : see declaration of 'rclcpp::operator !=' (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

16:59:21 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(104,16): error C2375: 'rclcpp::operator <<': redefinition; different linkage (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]

16:59:21 C:\ci\ws\src\ros2\rclcpp\rclcpp\include\rclcpp/network_flow_endpoint.hpp(81): message : see declaration of 'rclcpp::operator <<' (compiling source file C:\ci\ws\src\ros2\rclcpp\rclcpp\src\rclcpp\time_source.cpp) [C:\ci\ws\build\rclcpp\rclcpp.vcxproj]
asorbini commented 3 years ago

I spawned some more CI to validate these changes based on @anamud's request in rmw_connextdds#13, because I failed to notice that @wjwwood had already done so.

Linking them here for reference, feel free to stop them if they won't provide any new information:

EduPonz commented 3 years ago

I've gone over the longer issues related with rwm_fastrtps again. Unfortunately, that's all that I can do, I think the rest is up to @anamud , who has write access to the affected respositories. Thanks guys again!

anamud commented 3 years ago

@wjwwood @EduPonz @asorbini I have fixed the latest errors pointed by the CI system and addressed review comments. Please have a look and start a new CI flow if feasible. Appreciate your help with CI runs.

wjwwood commented 3 years ago

CI:

anamud commented 3 years ago

@wjwwood The Windows build in the last CI flow failed due to errors rooted at missing forward declarations and incorrect scoping of friend functions. I have fixed the errors now and tested for correct working on a local Windows compilation setup.

Please start a new CI flow.

wjwwood commented 3 years ago

New CI:

wjwwood commented 3 years ago

There are some compiler warnings from Windows related to strncpy:

https://ci.ros2.org/job/ci_windows/14335/msbuild/new/

I think you could use https://github.com/ros2/rcutils/blob/5d3cecc8af601c10561f7c17d76eab740bb540ce/include/rcutils/snprintf.h#L59 instead of strncpy to avoid this warning without doing your own "if win32" kind of logic.

MiguelCompany commented 3 years ago

@wjwwood @anamud Just fixed the two warnings on rmw_fastrtps

anamud commented 3 years ago

@wjwwood @MiguelCompany I handled the remaining three warnings (all in rmw). See here. Thanks for the suggestion to use rcutils_snprintf().

Please start a new CI flow.

wjwwood commented 3 years ago

Thanks! New CI:

anamud commented 3 years ago

Thanks for the new CI run.

@wjwwood @MiguelCompany @EduPonz Please let me know how to proceed. I am unable to resolve the two new unstable markers. They are unrelated to unique network flows.

One of them is being discussed as a flaky test here.

The other is a CMake warning about policy CMP0115 in src/eclipse-cyclonedds/cyclonedds/src/idl/CMakeLists.txt at line 41.

wjwwood commented 3 years ago

I think both of those are known, so I believe this CI is good!

Thanks for iterating and for making such a neat and organized set of pull requests!

wjwwood commented 3 years ago

I've merged everything but this design document. I'll try to pick that up sometime soon, but would love someone else to review and merge it if they have time.

anamud commented 3 years ago

Excellent work, everyone! Thanks for the timely support, that too during the long Easter weekend.

wjwwood commented 3 years ago

Thanks to everyone for the discussion and thanks @anamud for the contributions!