ros2 / rcl

Library to support implementation of language specific ROS Client Libraries.
Apache License 2.0
128 stars 162 forks source link

Lifecyle node failed to make transition #1166

Open kjjpc opened 2 months ago

kjjpc commented 2 months ago

Bug report

Steps to reproduce issue

When I launch multiple lifecycle nodes using LaunchService, some nodes failed to be activated. This is a reproduction script.

import os
import signal

from launch import LaunchDescription
from launch.actions import EmitEvent
from launch.actions import RegisterEventHandler
from launch.event_handlers import OnProcessStart
import launch.events
from launch.launch_description_entity import LaunchDescriptionEntity
from launch.launch_service import LaunchService
import launch_ros
from launch_ros.actions import LifecycleNode
from launch_ros.events.lifecycle import ChangeState
from lifecycle_msgs.msg import Transition
from multiprocessing import Process
from typing import List

def up_launch(launch_description_list: List[LaunchDescriptionEntity]):
    ls = LaunchService()
    ls.include_launch_description(LaunchDescription(launch_description_list))
    ls.run()

def generate_launch_description(ns):
    lc_node = LifecycleNode(
        package='lifecycle_py',
        executable='lifecycle_talker',
        name='lc_talker',
        namespace=ns,
        output='screen',
    )
    emit_configure = RegisterEventHandler(
        OnProcessStart(
            target_action=lc_node,
            on_start=[
                EmitEvent(event=ChangeState(
                    lifecycle_node_matcher=launch.events.matches_action(lc_node),
                    transition_id=Transition.TRANSITION_CONFIGURE,
                )),
            ]))

    #  send activate when on_configure finished
    emit_active = RegisterEventHandler(
        launch_ros.event_handlers.OnStateTransition(
            target_lifecycle_node=lc_node,
            goal_state='inactive',
            # start_state='configuring',
            entities=[
                EmitEvent(event=ChangeState(
                    lifecycle_node_matcher=launch.events.matches_action(lc_node),
                    transition_id=Transition.TRANSITION_ACTIVATE,
                )),
            ],
        ))

    return [lc_node, emit_configure, emit_active]

if __name__ == '__main__':
    os.setpgrp()
    try:
        for i in range(12):
            ls = generate_launch_description('room'+str(i))
            launch_proc = Process(target=up_launch,
                                  args=(ls,),
                                  )
            launch_proc.start()
    except KeyboardInterrupt:
        print('group kill')
        os.killpg(0, signal.SIGINT)

Expected behavior

All lifecycle nodes become active.

Actual behavior

Some nodes remains configured occasionally.

Additional information

I investigated the behavior and found that the lifecycle_node action of launch_ros failed to catch transition_event topic. The lifecycle_node action create subscriber just before calling change_state and the topic subscription registration is not completed before transition_event publish.

One solution is to make transition_event topic reliable and transient local. Following changes resolve the issue.

Add following code to com_interface.c.

   rcl_publisher_options_t publisher_options = rcl_publisher_get_default_options();
+  publisher_options.qos.reliability = RMW_QOS_POLICY_RELIABILITY_RELIABLE;
+  publisher_options.qos.durability = RMW_QOS_POLICY_DURABILITY_TRANSIENT_LOCAL;
+  publisher_options.qos.depth = 1;
   rcl_ret_t ret = rcl_publisher_init(

Change qos setting in lifecycle_node.py of launch_ros.

         self.__rclpy_subscription = node.create_subscription(
             lifecycle_msgs.msg.TransitionEvent,
             '{}/transition_event'.format(self.node_name),
             functools.partial(self._on_transition_event, context),
-            10)
+            QoSProfile(depth=10, reliability=ReliabilityPolicy.RELIABLE,
+                       durability=DurabilityPolicy.TRANSIENT_LOCAL))
fujitatomoya commented 1 month ago

@kjjpc thanks for posting the issue and detailed report.

return [lc_node, emit_configure, emit_active]

as a possible work-around, changing this line into return [lc_node, emit_active, emit_configure] can mitigate the issue in my local environment. (register OnStateTransition event handler before changing state to configure.)

i am not sure if this LaunchDescription order is requirement for the user application, but i can see the examples are considered with this order. e.g https://github.com/ros2/launch_ros/blob/dbc2bbc80d31ae932bdd7060669f58d0b6e39305/launch_ros/examples/lifecycle_pub_sub_launch.py#L79-L84

if this is the design for LaunchDescription, probably we would want to add the doc section for LaunchDescription. @wjwwood @adityapande-1995 what do you think?

Change qos setting in lifecycle_node.py of launch_ros.

we have to be careful on this. if this change is applied alone, it can break the downstream launch application with incompatible QoS.