micro-ROS / rmw_microxrcedds

RMW implementation using Micro XRCE-DDS middleware.
Apache License 2.0
35 stars 27 forks source link

Reconnecting to agent after multiple disconnects #241

Closed ted-miller closed 2 years ago

ted-miller commented 2 years ago

Issue template

Industrial robots are generally left on (without rebooting) for very long periods of time. So, I need to be able to detect if the Agent application is terminated on the PC. In that scenario, I shut down the node on the robot and wait for the Agent to come back online. When it does, I'll restart the node with a new connection. (I'm using your ping function to detect the status of the Agent.)

I found that I could do this process twice. But, then it would fail on the third attempt. One of the init functions would return a general error (RMW_RET_ERROR).

Ultimately, I found that get_memory() in memory.c was returning a NULL. To solve the problem, I simply rebuilt rmw-microxrcedds using RMW_UXRCE_ALLOW_DYNAMIC_ALLOCATIONS=ON.

So, I'm running just fine. But, it makes me suspicious that there is a memory leak.

The issue was discovered on the Motoman robot. But, I was able to reproduce on the PC using the micro-ros-demos-rclc.

I modified the ping_uros_agent so that it sits in a loop and tries to continually reconnect. modified_ping_uros_agent.zip

Steps to reproduce the issue

Setup fresh uros workspace by following the basic tutorial.

<source galactic>
ted@ted-Ubuntu:~/microros_ws_fresh$ source install/local_setup.bash
ted@ted-Ubuntu:~/microros_ws_fresh$ export RMW_IMPLEMENTATION=rmw_microxrcedds
ted@ted-Ubuntu:~/microros_ws_fresh$ ros2 run micro_ros_demos_rclc ping_uros_agent interactive

In another terminal, start the Agent. The first terminal will indicate that it is connected. After a couple seconds, ctrl-c the agent to stop it. Wait for the first terminal to indicate that it is looking for the agent. Then restart the agent in the other terminal. Repeat this process multiple times.

In this case, it failed on the fifth attempt.

Actual behavior

Console output

``` ted@ted-Ubuntu:~/microros_ws_fresh$ ros2 run micro_ros_demos_rclc ping_uros_agent interactive Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648217900.259890517] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... Failed status on line 155: 1. Continuing. Failed status on line 156: 1. Continuing. Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648217918.390578284] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... Failed status on line 155: 1. Continuing. Failed status on line 156: 1. Continuing. Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648217934.513469052] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... Failed status on line 155: 1. Continuing. Failed status on line 156: 1. Continuing. Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648217947.634014704] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... Failed status on line 155: 1. Continuing. Failed status on line 156: 1. Continuing. Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [rcutils|error_handling.c:65] an error string (message, file name, or formatted message) will be truncated [ERROR] [1648217962.759251750] [rclc]: [rclc_timer_init_default] Error in rcl_timer_init: error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/node.c:384, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/node.c:384, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/node.c:384, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/node.c:3, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/guard_condition.c:91 Failed status on line 130: 1. Aborting. ted@ted-Ubuntu:~/microros_ws_fresh$ ```

Additional information

In the console output above, I do see that two of the fini functions failed each cycle.

Failed status on line 155: 1. Continuing.
Failed status on line 156: 1. Continuing.

This corresponds to:

            RCSOFTCHECK(rcl_publisher_fini(&publisher, &node));
            RCSOFTCHECK(rcl_node_fini(&node));

Perhaps this is the source of the problem. I have not yet traced down those errors. So, I'll start with those functions to determine why they're returning error codes. But, if you have any advice on what else could be the cause, I would love some guidance.

Additionally, if there is something wrong with my approach to "shutting down" after the agent is gone, please let me know.

Thanks!

pablogs9 commented 2 years ago

I've fixed the example: https://github.com/micro-ROS/micro-ROS-demos/pull/58

Please let us know if rcl_timer_fini(&timer); solves this.

ted-miller commented 2 years ago

I updated the above code with the changes in https://github.com/micro-ROS/micro-ROS-demos/pull/58

UpdateCodeWithPr58.zip

Running it the same as described above produced mixed results. After killing the agent for the fifth time, a new error appears: [rcutils|error_handling.c:65] an error string (message, file name, or formatted message) will be truncated. However, the node is able to initialize during the next iteration of the loop. So, I don't know if this is related or not. (The new error message persists for each iteration after that.)

Full Output

``` ted@ted-Ubuntu:~/microros_ws_fresh$ ros2 run micro_ros_demos_rclc ping_uros_agent interactive 2>&1 Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648566660.823875181] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... Failed status on line 150: 1. Continuing. Failed status on line 153: 1. Continuing. Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648566675.845733235] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... Failed status on line 150: 1. Continuing. Failed status on line 153: 1. Continuing. Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648566691.867579349] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... Failed status on line 150: 1. Continuing. Failed status on line 153: 1. Continuing. Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648566707.886723162] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... Failed status on line 150: 1. Continuing. Failed status on line 153: 1. Continuing. Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648566722.909127476] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... [rcutils|error_handling.c:65] an error string (message, file name, or formatted message) will be truncated Failed status on line 150: 1. Continuing. [rcutils|error_handling.c:65] an error string (message, file name, or formatted message) will be truncated Failed status on line 153: 1. Continuing. Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first ```

Another thing that I noticed is that you changed the order of your fini commands in the updated example. Was this intentional?

I seem to recall running into an issue in the past with destroying stuff out of order. (This is why I'm suspicious of this change.) Currently, in my robot code, I'm killing stuff in this order:

Does it matter that the executor is getting destroyed first?

ted-miller commented 2 years ago

I enabled RMW_UROS_ERROR_HANDLING=ON and setup a callback with rmw_uros_set_error_handling_callback to print any error message to stdout.

Full output

``` ted@ted-Ubuntu:~/microros_ws_fresh$ ros2 run micro_ros_demos_rclc ping_uros_agent interactive Please, start your micro-ROS Agent first Please, start your micro-ROS Agent first [INFO] [1648584288.562769630] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Failed status on line 162: 1. Continuing. >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Failed status on line 165: 1. Continuing. Please, start your micro-ROS Agent first [INFO] [1648584299.588277535] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/node.c:384' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Failed status on line 162: 1. Continuing. >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Failed status on line 165: 1. Continuing. [INFO] [1648584308.607342778] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/node.c:384' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Failed status on line 162: 1. Continuing. >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Failed status on line 165: 1. Continuing. [INFO] [1648584318.630913240] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/node.c:384' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Failed status on line 162: 1. Continuing. >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Failed status on line 165: 1. Continuing. [INFO] [1648584328.650820682] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/node.c:384' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Failed status on line 162: 1. Continuing. >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Failed status on line 165: 1. Continuing. Please, start your micro-ROS Agent first [INFO] [1648584341.674771770] []: Created a timer with period 1000 ms. Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... Agent is still up! Sent: 'This is a message sent after pinging micro-ROS Agent' Pinging agent again... micro-ROS agent has stopped. Exiting... >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/node.c:384' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Issues running micro XRCE-DDS session Failed status on line 162: 1. Continuing. >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at /home/jenkins-agent/workspace/ci_packaging_linux/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:185' with this new error message: 'Issues running micro XRCE-DDS session, at /home/ted/microros_ws_fresh/src/uros/rmw_microxrcedds/rmw_microxrcedds_c/src/utils.c:43' rcutils_reset_error() should be called after error handling to avoid this. <<< Issues running micro XRCE-DDS session Failed status on line 165: 1. Continuing. ```

pablogs9 commented 2 years ago

@Acuadros95 can you take a look at that?

Acuadros95 commented 2 years ago

Ultimately, I found that get_memory() in memory.c was returning a NULL.

This should not happen, did rcl_timer_fini(&timer); solve this? Please set RMW_UXRCE_ALLOW_DYNAMIC_ALLOCATIONS=OFF again and try to replicate this error, as the fini methods should release all the memory blocks.

But, from the output, it seems the fini failures are just general errors because the agent is down. Nothing obviously unexpected.

Exactly, rcl_publisher_fini and rcl_node_fini return RCL_RET_ERROR after the destroy operation on the agent timeouts. We should modify this and return RCL_RET_TIMEOUT and use the error return for actual memory problems.

You can also set RMW_UXRCE_ENTITY_DESTROY_TIMEOUT=0 to use best effort communication on the fini methods, which will send the destroy message to the agent, release local resources and return immediately. But its probably better to just ignore this error if blocking time on this methods is not relevant.

(I am still curious about whether the order of destruction matters.)

It does, but only this limitations apply:

  • Destroy a node after all its related entities (Publishers, subscribers, services, …) have been destroyed.
  • Destroy the support object after all the related nodes are destroyed.
ted-miller commented 2 years ago

Ultimately, I found that get_memory() in memory.c was returning a NULL.

This should not happen, did rcl_timer_fini(&timer); solve this?

I just ran the tests again and can confirm that adding rcl_timer_fini does resolve the get_memory() problem. (I added a print command to get_memory for confirmation.) So... I guess that means I'm not disposing of something properly in my robot code.

(I am still curious about whether the order of destruction matters.)

It does, but only this limitations apply: Destroy a node after all its related entities (Publishers, subscribers, services, …) have been destroyed. Destroy the support object after all the related nodes are destroyed.

Good to know.

Thanks again for the support!

asasine commented 1 year ago

But, from the output, it seems the fini failures are just general errors because the agent is down. Nothing obviously unexpected.

Exactly, rcl_publisher_fini and rcl_node_fini return RCL_RET_ERROR after the destroy operation on the agent timeouts. We should modify this and return RCL_RET_TIMEOUT and use the error return for actual memory problems.

@Acuadros95 was this change ever made?

Acuadros95 commented 1 year ago

@asasine I am afraid that its not applied.

The timeout stiuation is correctly handle internally at the rmw level. But on the rcl layer, rcl_publisher_fini and similar methods will still return RCL_RET_ERROR on both cases (Timeout, or actual error).

asasine commented 1 year ago

Thanks for getting back to me. Is there any possibility of this change being applied in the future?