Figure out how to migrate ROS1 latching topics to ROS2

Migrating ROS1 latching topics to ROS2 was raised as an issue at the FSW meeting on 1/9. The concern is that there is no ROS2 feature that exactly corresponds to ROS1 latching topics, and there are various possible approaches to achieve similar functionality. Creating this thread as one place to discuss.

To over-simplify a bit, the basic issue is that the closest ROS2 feature corresponding to ROS1 latching topics is using "transient local" DDS QoS. However, whereas in ROS1 you only need to declare that a topic is latching on the publisher side, in ROS2 you need to tell both sides to use transient local QoS. (This summary glosses over some other important issues!)

Here are some great background references about the topic:

https://github.com/ros2/ros2/issues/464 @wjwwood explains why this is a thorny issue. In short, DDS provides transient local QoS which is vaguely similar to ROS1 latching topics, but also has important differences, and there are good reasons to use DDS-native features rather than craft complex ROS2 abstractions on top of them to more precisely mimic older ROS1 functionality.
https://github.com/ros2/rosbag2/issues/601 One possible approach briefly discussed during the FSW meeting is enabling a subscriber to a topic to use an adaptive QoS that matches the QoS profile used by the publisher. This thread points to three existing implementations of adaptive QoS for subscribers and discusses feasibility of making that functionality more broadly available in ROS2.
https://github.com/ros2/rclcpp/issues/1868 This thread suggests that adaptive QoS for subscribers will eventually be a core ROS2 feature, but since it hasn't been completed yet, it's unlikely to become available in the ROS2 Humble distro we expect to use for our initial Astrobee ISS deployment of ROS2. In other words, it doesn't make sense to wait for the core feature to be available before tackling this issue.

Here's a solution approach I haven't heard offered yet:

Continue using latching topics (migrating to DDS QoS)
Bite the bullet on explicitly specifying the QoS in both publisher and subscriber
Guarantee QoS synchronization between publisher and subscriber by following a new Astrobee convention.

The new convention would be to declare the QoS in ff_names.h, where topic names are already declared. For example, in the following snippet, we would add the second line:

#define TOPIC_COMMAND                      "command"
#define TOPIC_COMMAND_LATCHING             true

Along with this convention, we could potentially define abstraction methods for creating publishers and subscribers that directly take the TOPIC_*_LATCHING macro as an argument. The same abstraction methods could potentially be available in both ROS1 and ROS2. In ROS2, both publisher and subscriber creation would use the latching argument, but in ROS1 the subscriber creation would ignore it.

Pros:

Simple and uses ROS2 in a standard way, not requiring bleeding-edge features or complex adaptive QoS logic.
Enables changing latching behavior for a topic with a code change in one place (ff_names.h) that automatically propagates to all publishers and subscribers.
If you read about adaptive QoS for subscribers, a lot of the complexity comes from worrying about what to do if there are multiple publishers on the same topic with different QoS, especially if late-joining publishers are possible. If we can use our conventions to guarantee only one QoS on each topic, we are just opting out of a whole lot of issues. (On the other hand, if we currently have topics where there are multiple publishers with some latching and some not and we really need to support that, it would be good to understand that. But I doubt it!)

Cons:

Requires a lot of boilerplate code changes (wherever subscribers are created).
Still doesn't precisely mimic ROS1 latching topic behavior. For example, a late-joining subscriber to a latched topic will receive N old messages (N = the publisher's history window) instead of 1 old message. It's possible to configure N=1 but that could affect performance in other ways. Need to think through whether this is a problem.

From my understanding, much of the reason we need latching topics in our code is because the startup order is non-deterministic and a node will publish an initial message before its subscribers are loaded. I know nothing about the ros2 lifecycle control, but can we use that to avoid the need for this use case?

I'd also ask how many latching topics we have (and how many of them actually need to be latching) which could affect how much effort we want to put into this vs. just using timers and repeating.

Another solution approach motivated by the discussion threads above:

Continue using latching topics (migrating to DDS QoS)
Use adaptive QoS for subscribers to choose their QoS to match the publishers

This thread https://github.com/ros2/rclcpp/issues/1868 points to implementations of adaptive QoS in rosbag2 and domain_bridge. The idea would be to provide our own adaptive QoS subscriber creation method that mimics their approach.

Pros:

Don't need to add latching information to ff_names.h. I was going to say it would save us from having to change each subscriber, but I'm not sure it does.

Cons:

This adaptive QoS stuff is pretty complicated and we would definitely want to do some testing to make sure we understand the behavior.
This seems like an evolving area of ROS2, suggesting the adaptive QoS implementation might break in future distros (not an urgent issue).
We could expect weird behavior if there are multiple publishers with some latching and some not. Hopefully that's not a real problem with the current system.

Solution approach #1 from Brian's comment above:

To the extent that we need latching topics because of non-deterministic startup order of nodes, avoid that need by using ROS2 lifecycle management features.

Pros:

Better use of ROS2 lifecycle management to make startup more deterministic seems like a great idea anyway. For example, we do observe failures to start up that can tentatively be blamed on race conditions.

Cons:

Needs further study to understand if this covers all current uses of latching topics.
The particular topics in question might or might not be a great fit for this management technique. ROS2 defines a small fixed number of lifecycle phases with conventional usage patterns, and we'd need to establish a convention of which phase is used to create subscribers vs. which phase is used to publish one-time messages which hopefully aligns with their standard conventions.
ROS2 launch files seem complicated and hard to debug. Might take us a while to get this working.

Solution approach 2 from Brian's comment above:

To the extent we need latching topics so that we can publish a message only once at startup without worrying that late-joining subscribers will miss it, avoid that need by repeatedly publishing the message on a timer.

Pros:

Timers seem to be better-understood than QoS profiles that simulate latching.

Cons:

Hides some complexity. If the message is now published on a timer, how fast does the timer need to repeat? Does the timer delay become one more delay in the total system startup time?
How much overhead is incurred by otherwise unnecessary repeated publishing (and potentially repeated logging)?
Are there cases where this approach would require the subscriber to add a state flag so it will only process the first message and ignore subsequent messages?

We should discuss on Tuesday. In the meantime, I went through the fsw and found all the latched topics to help with the discussion. I have tried to sort them by why they are latched and have included the node/file they are in.

Startup/State:

"heartbeat" (TOPIC_HEARTBEAT) - ff_nodelet(all), graph_localizer_nodelet, imu_augmentor_nodelet, ground_truth_localizer_nodelet
"mgt/sys_monitor/heartbeat" (TOPIC_MANAGEMENT_SYS_MONITOR_HEARTBEAT) - sys_monitor
"mgt/sys_monitor/config" (TOPIC_MANAGEMENT_SYS_MONITOR_CONFIG) - sys_monitor
"mgt/sys_monitor/state" (TOPIC_MANAGEMENT_SYS_MONITOR_STATE) - sys_monitor
"mgt/camera_state" (TOPIC_MANAGEMENT_CAMERA_STATE) - image_sampler
"loc/manager/state" (TOPIC_LOCALIZATION_MANAGER_STATE) - localization_manager_nodelet
"beh/perch/state" (TOPIC_BEHAVIORS_PERCHING_STATE) - perch_nodelet
"joint_states" (TOPIC_JOINT_STATES) - perching_arm_node, gazebo_model_plugin_perching_arm
"hw/pmc/state" (TOPIC_HARDWARE_PMC_STATE) - pmc_actuator_nodelet, gazebo_model_plugin_pmc
"beh/arm/state" (TOPIC_BEHAVIORS_ARM_STATE) - arm_nodelet
"mgt/access_control/state" (TOPIC_MANAGEMENT_ACCESS_CONTROL_STATE) - access_control, ground_dds_ros_bridge
"mgt/cpu_monitor/state" (TOPIC_MANAGEMENT_CPU_MONITOR_STATE) - cpu_mem_monitor
"mgt/mem_monitor/state" (TOPIC_MANAGEMENT_MEM_MONITOR_STATE) - cpu_mem_monitor
"mgt/disk_monitor/state" (TOPIC_MANAGEMENT_DISK_MONITOR_STATE) - disk_monitor_node
"mgt/data_bagger/state" (TOPIC_MANAGEMENT_DATA_BAGGER_STATE) - data_bagger
"mgt/executive/agent_state" (TOPIC_MANAGEMENT_EXEC_AGENT_STATE) - executive
"mob/state" (TOPIC_MOBILITY_MOTION_STATE) - choreographer_nodelet
"beh/dock/state" (TOPIC_BEHAVIORS_DOCKING_STATE) - dock_nodelet

Possibly Runtime

"parameter_descriptions" - config_server
"parameter_updates" - config_server
"signals" (TOPIC_SIGNALS) - states_nodelet
"mob/planner_qp/safe_flight_cooridor" - planner_qp
"mob/planner_qp/trajectory" - planner_qp
"mob/planner_qp/solver_info" - planner_qp
"joint_goals" (TOPIC_JOINT_GOALS) - arm_nodelet
"mgt/access_control/cmd" (TOPIC_MANAGEMENT_ACCESS_CONTROL_CMD) - access_control
"snapshot_trigger" - astrobee_recorder
"mgt/data_bagger/topics" (TOPIC_MANAGEMENT_DATA_BAGGER_TOPICS) - data_bagger
"mgt/executive/plan_status" (TOPIC_MANAGEMENT_EXEC_PLAN_STATUS) - executive
"mob/flight_mode" (TOPIC_MOBILITY_FLIGHT_MODE) - choreographer_nodelet
"mob/choreographer/segment" (TOPIC_MOBILITY_SEGMENT) - choreographer_nodelet
"mob/inertia" (TOPIC_MOBILITY_INERTIA) - choreographer_nodelet
"mob/mapper/zones" (TOPIC_MOBILITY_ZONES) - validator
"gnc/ctl/segment" (TOPIC_GNC_CTL_SEGMENT) - ctl
"robot_name" (TOPIC_ROBOT_NAME) - dds_ros_bridge
"mob/mapper/free_space_cloud" (TOPIC_MAPPER_OCTOMAP_FREE_CLOUD) - mapper_nodelet
"mob/mapper/obstacle_cloud" (TOPIC_MAPPER_OCTOMAP_CLOUD) - mapper_nodelet
"mob/mapper/inflated_free_space_cloud" (TOPIC_MAPPER_OCTOMAP_INFLATED_FREE_CLOUD) - mapper_nodelet
"mob/mapper/inflated_obstacle_cloud" (TOPIC_MAPPER_OCTOMAP_INFLATED_CLOUD) - mapper_nodelet

Simulation

"hw/laser/rviz" (TOPIC_HARDWARE_LASER_RVIZ) - gazebo_model_plugin_laser
"hw/lights/rviz" (TOPIC_HARDWARE_LIGHTS_RVIZ) - gazebo_model_plugin_flashlight

Tools

"command" (TOPIC_COMMAND) - interactive_marker_teleop, zones_pub, simple_move, plan_pub, teleop_tool, data_to_disk_pub

Ground, testing, display:

"vive" - vive_offine
"ekf" - vive_offline
"hw/vive/lighthouses" (TOPIC_HARDWARE_VIVE_LIGHTHOUSES) - vive_nodelet
"hw/vive/trackers" (TOPIC_HARDWARE_VIVE_TRACKERS) - vive_nodelet
"gnc/ekf" (TOPIC_GNC_EKF) - ground_dds_ros_bridge
"gs/data" (TOPIC_GUEST_SCIENCE_DATA) - ground_dds_ros_bridge
"sparse_mapping/map_cloud" - sparse_mapping_display
"mob/planner_qp/trajectory_debug" - planner_qp

Marina and I went through the topics to see which ones did not need to be latched. There are still some we need to look into but the rest have been labeled with whether or not they are needed and whether they need to have a QoS setting or be replaced with a timer. Most topics are latched due to startup or communication with the HLP. If we end up having issues with the QoS settings, Marina and I discussed replacing them with timers and in the case of latching only for startup, using the system monitor state to stop the timers once the system is successfully running. The check boxes should be crossed off once the code is changed to reflect our decisions.

Startup/State:

[x] "heartbeat" (TOPIC_HEARTBEAT) - ff_nodelet(all), graph_localizer_nodelet, imu_augmentor_nodelet, ground_truth_localizer_nodelet | Timer (Change ff component to start heartbeat even if there is an initialization fault)
[x] "mgt/sys_monitor/heartbeat" (TOPIC_MANAGEMENT_SYS_MONITOR_HEARTBEAT) - sys_monitor | QoS
[x] "mgt/sys_monitor/config" (TOPIC_MANAGEMENT_SYS_MONITOR_CONFIG) - sys_monitor | QoS
[x] "mgt/sys_monitor/state" (TOPIC_MANAGEMENT_SYS_MONITOR_STATE) - sys_monitor | QoS
[x] "mgt/camera_state" (TOPIC_MANAGEMENT_CAMERA_STATE) - image_sampler | QoS
[x] "loc/manager/state" (TOPIC_LOCALIZATION_MANAGER_STATE) - localization_manager_nodelet QoS
[x] "beh/perch/state" (TOPIC_BEHAVIORS_PERCHING_STATE) - perch_nodelet | QoS
[x] "joint_states" (TOPIC_JOINT_STATES) - perching_arm_node, gazebo_model_plugin_perching_arm | Not needed - published frequently
[x] "hw/pmc/state" (TOPIC_HARDWARE_PMC_STATE) - pmc_actuator_nodelet, gazebo_model_plugin_pmc | Not needed - published frequently
[x] "beh/arm/state" (TOPIC_BEHAVIORS_ARM_STATE) - arm_nodelet | QoS
[x] "mgt/access_control/state" (TOPIC_MANAGEMENT_ACCESS_CONTROL_STATE) - access_control, ground_dds_ros_bridge | QoS
[x] "mgt/cpu_monitor/state" (TOPIC_MANAGEMENT_CPU_MONITOR_STATE) - cpu_mem_monitor | Not needed - published frequently
[x] "mgt/mem_monitor/state" (TOPIC_MANAGEMENT_MEM_MONITOR_STATE) - cpu_mem_monitor | Not needed - published frequently
[x] "mgt/disk_monitor/state" (TOPIC_MANAGEMENT_DISK_MONITOR_STATE) - disk_monitor_node | Not needed - published frequently
[x] "mgt/data_bagger/state" (TOPIC_MANAGEMENT_DATA_BAGGER_STATE) - data_bagger | QoS
[x] "mgt/executive/agent_state" (TOPIC_MANAGEMENT_EXEC_AGENT_STATE) - executive | QoS
[x] "mob/state" (TOPIC_MOBILITY_MOTION_STATE) - choreographer_nodelet | QoS
[x] "beh/dock/state" (TOPIC_BEHAVIORS_DOCKING_STATE) - dock_nodelet | QoS

Possibly Runtime

[x] "parameter_descriptions" - config_server | Qos
[x] "parameter_updates" - config_server | QoS
[x] "signals" (TOPIC_SIGNALS) - states_nodelet | QoS
[x] "mob/planner_qp/safe_flight_cooridor" - planner_qp | Not needed
[x] "mob/planner_qp/trajectory" - planner_qp | Not needed
[x] "mob/planner_qp/solver_info" - planner_qp | Not needed
[x] "joint_goals" (TOPIC_JOINT_GOALS) - arm_nodelet | Not needed - only published when goal cmds are issued, so perch beh subscriber should be up
[x] "mgt/access_control/cmd" (TOPIC_MANAGEMENT_ACCESS_CONTROL_CMD) - access_control | Not needed
[x] "snapshot_trigger" - astrobee_recorder | Not needed - publisher and subscriber in same node, empty service
[x] "mgt/data_bagger/topics" (TOPIC_MANAGEMENT_DATA_BAGGER_TOPICS) - data_bagger | QoS
[x] "mgt/executive/plan_status" (TOPIC_MANAGEMENT_EXEC_PLAN_STATUS) - executive | QoS
[x] "mob/flight_mode" (TOPIC_MOBILITY_FLIGHT_MODE) - choreographer_nodelet | QoS
[x] "mob/choreographer/segment" (TOPIC_MOBILITY_SEGMENT) - choreographer_nodelet | QoS
[x] "mob/inertia" (TOPIC_MOBILITY_INERTIA) - choreographer_nodelet | QoS
[x] "mob/mapper/zones" (TOPIC_MOBILITY_ZONES) - validator | QoS
[x] "gnc/ctl/segment" (TOPIC_GNC_CTL_SEGMENT) - ctl | QoS
[x] "robot_name" (TOPIC_ROBOT_NAME) - dds_ros_bridge | QoS
[x] "mob/mapper/free_space_cloud" (TOPIC_MAPPER_OCTOMAP_FREE_CLOUD) - mapper_nodelet | Not needed - no subscribers
[x] "mob/mapper/obstacle_cloud" (TOPIC_MAPPER_OCTOMAP_CLOUD) - mapper_nodelet | Not needed - no subscribers
[x] "mob/mapper/inflated_free_space_cloud" (TOPIC_MAPPER_OCTOMAP_INFLATED_FREE_CLOUD) - mapper_nodelet | Not needed - no subscribers , info obtained by qp planner through service
[x] "mob/mapper/inflated_obstacle_cloud" (TOPIC_MAPPER_OCTOMAP_INFLATED_CLOUD) - mapper_nodelet | Not needed - no subscribers, info obtained by qp planner through service

Simulation

[x] "hw/laser/rviz" (TOPIC_HARDWARE_LASER_RVIZ) - gazebo_model_plugin_laser | QoS - used by rviz
[x] "hw/lights/rviz" (TOPIC_HARDWARE_LIGHTS_RVIZ) - gazebo_model_plugin_flashlight | QoS - used by rviz

Tools

[x] "command" (TOPIC_COMMAND) - interactive_marker_teleop, zones_pub, simple_move, plan_pub, teleop_tool, data_to_disk_pub | not needed (make sure there is subscriber before publishing)

Ground, testing, display:

[ ] "vive" - vive_offine | republish
[ ] "ekf" - vive_offline | republish
[ ] "hw/vive/lighthouses" (TOPIC_HARDWARE_VIVE_LIGHTHOUSES) - vive_nodelet | republish
[ ] "hw/vive/trackers" (TOPIC_HARDWARE_VIVE_TRACKERS) - vive_nodelet | republish
[x] "gnc/ekf" (TOPIC_GNC_EKF) - ground_dds_ros_bridge | not needed
[x] "gs/data" (TOPIC_GUEST_SCIENCE_DATA) - ground_dds_ros_bridge | not needed (start all ground nodes at same time)
[x] "sparse_mapping/map_cloud" - sparse_mapping_display | not needed
[x] "mob/planner_qp/trajectory_debug" - planner_qp | not needed

For anybody who has not been involved in meetings, I wanted to clarify my understanding of our intended approach:

We plan to follow more or less the first approach presented in this thread. We'll implement latched topics using DDS transient local QoS, and use a project-level convention to ensure that there is a consistent QoS for each topic that is used by all publishers and subscribers. We will write custom wrapper code that implements the convention (creates a publisher or subscriber with the right QoS).
New: Since only a small minority of topics are latching, to avoid the overhead of requiring a LATCHING declaration for each topic in ff_names.h, we'll use a "sparse declaration" approach where basically we'll provide a list of latching topic substrings. Any topic containing one of the substrings in the (short) latching list will be considered to be latching; otherwise the default is non-latching.
Katie's audit of latching message topics also found some that Marina thought should always have been non-latching, so we should probably discuss further whether or not to switch those over. (If a topic should be non-latching but is latching, typically the main impact is a modest performance overhead. On the other hand, if a topic should be latching but is non-latching, typically that might cause hard-to-reproduce reliability problems when messages are dropped due to race conditions. So we may want to be careful here.)
And part of Katie's most recent post on this thread is discussing a fallback plan in case the QoS latching approach doesn't work as expected. As of now, plan A is to use QoS and her proposed plan B is to repeatedly publish messages using a timer but in a clever way that minimizes overhead.

Please chime in if you notice any errors there...

Here's a bit more thinking on switching some topics from latching to non-latching.

Above I noted that making a mistake will probably have greater negative impact if the mistake moves in the latching to non-latching direction (vs. non-latching to latching). The impact might be reduced reliability especially at startup, with the problems being hard to reproduce for debugging.

One other point to consider is that ideally we want to minimize behavior differences between the ROS1 and ROS2 versions. If we really think it's important to make a message topic non-latching, maybe we should initiate that change on the develop branch (where we can test it more thoroughly) then port it to the ros2 branch. The latching switch doesn't have to be part of the ROS2 transition.

This reasoning pushes in the direction of mechanically turning all ROS1 latching topics to ROS2 latching topics, rather than taking time to do a bunch of case-by-case investigations to see if some topics could become non-latching.

Now, if our selected QoS approach for simulating ROS1 latching in ROS2 doesn't work reliably or has other downsides we haven't discovered yet, we probably would need more case-by-case study to figure out the right way to implement the required functionality for each topic. But if the QoS approach does work well, I would suggest mechanically keeping exactly the same topics latching across the ROS2 transition.

nasa / astrobee

Figure out how to migrate ROS1 latching topics to ROS2 #628