nasa / astrobee

NASA Astrobee Robot Software
https://www.nasa.gov/astrobee
Apache License 2.0
1.03k stars 311 forks source link

Figure out how to migrate ROS1 latching topics to ROS2 #628

Open trey0 opened 1 year ago

trey0 commented 1 year ago

Migrating ROS1 latching topics to ROS2 was raised as an issue at the FSW meeting on 1/9. The concern is that there is no ROS2 feature that exactly corresponds to ROS1 latching topics, and there are various possible approaches to achieve similar functionality. Creating this thread as one place to discuss.

To over-simplify a bit, the basic issue is that the closest ROS2 feature corresponding to ROS1 latching topics is using "transient local" DDS QoS. However, whereas in ROS1 you only need to declare that a topic is latching on the publisher side, in ROS2 you need to tell both sides to use transient local QoS. (This summary glosses over some other important issues!)

Here are some great background references about the topic:

trey0 commented 1 year ago

Here's a solution approach I haven't heard offered yet:

The new convention would be to declare the QoS in ff_names.h, where topic names are already declared. For example, in the following snippet, we would add the second line:

#define TOPIC_COMMAND                      "command"
#define TOPIC_COMMAND_LATCHING             true

Along with this convention, we could potentially define abstraction methods for creating publishers and subscribers that directly take the TOPIC_*_LATCHING macro as an argument. The same abstraction methods could potentially be available in both ROS1 and ROS2. In ROS2, both publisher and subscriber creation would use the latching argument, but in ROS1 the subscriber creation would ignore it.

Pros:

Cons:

bcoltin commented 1 year ago

From my understanding, much of the reason we need latching topics in our code is because the startup order is non-deterministic and a node will publish an initial message before its subscribers are loaded. I know nothing about the ros2 lifecycle control, but can we use that to avoid the need for this use case?

I'd also ask how many latching topics we have (and how many of them actually need to be latching) which could affect how much effort we want to put into this vs. just using timers and repeating.

trey0 commented 1 year ago

Another solution approach motivated by the discussion threads above:

This thread https://github.com/ros2/rclcpp/issues/1868 points to implementations of adaptive QoS in rosbag2 and domain_bridge. The idea would be to provide our own adaptive QoS subscriber creation method that mimics their approach.

Pros:

Cons:

trey0 commented 1 year ago

Solution approach #1 from Brian's comment above:

Pros:

Cons:

trey0 commented 1 year ago

Solution approach 2 from Brian's comment above:

Pros:

Cons:

kbrowne15 commented 1 year ago

We should discuss on Tuesday. In the meantime, I went through the fsw and found all the latched topics to help with the discussion. I have tried to sort them by why they are latched and have included the node/file they are in.

Startup/State:

Possibly Runtime

Simulation

Tools

Ground, testing, display:

kbrowne15 commented 1 year ago

Marina and I went through the topics to see which ones did not need to be latched. There are still some we need to look into but the rest have been labeled with whether or not they are needed and whether they need to have a QoS setting or be replaced with a timer. Most topics are latched due to startup or communication with the HLP. If we end up having issues with the QoS settings, Marina and I discussed replacing them with timers and in the case of latching only for startup, using the system monitor state to stop the timers once the system is successfully running. The check boxes should be crossed off once the code is changed to reflect our decisions.

Startup/State:

Possibly Runtime

Simulation

Tools

Ground, testing, display:

trey0 commented 1 year ago

For anybody who has not been involved in meetings, I wanted to clarify my understanding of our intended approach:

Please chime in if you notice any errors there...

trey0 commented 1 year ago

Here's a bit more thinking on switching some topics from latching to non-latching.

Above I noted that making a mistake will probably have greater negative impact if the mistake moves in the latching to non-latching direction (vs. non-latching to latching). The impact might be reduced reliability especially at startup, with the problems being hard to reproduce for debugging.

One other point to consider is that ideally we want to minimize behavior differences between the ROS1 and ROS2 versions. If we really think it's important to make a message topic non-latching, maybe we should initiate that change on the develop branch (where we can test it more thoroughly) then port it to the ros2 branch. The latching switch doesn't have to be part of the ROS2 transition.

This reasoning pushes in the direction of mechanically turning all ROS1 latching topics to ROS2 latching topics, rather than taking time to do a bunch of case-by-case investigations to see if some topics could become non-latching.

Now, if our selected QoS approach for simulating ROS1 latching in ROS2 doesn't work reliably or has other downsides we haven't discovered yet, we probably would need more case-by-case study to figure out the right way to implement the required functionality for each topic. But if the QoS approach does work well, I would suggest mechanically keeping exactly the same topics latching across the ROS2 transition.