Problem with _register_node_api()

ros / ros_comm

ROS communications-related packages, including core client libraries (roscpp, rospy, roslisp) and graph introspection tools (rostopic, rosnode, rosservice, rosparam).

http://wiki.ros.org/ros_comm

752 stars 911 forks source link

Problem with _register_node_api() #2241

Open EdAtHooked opened 2 years ago

EdAtHooked commented 2 years ago

I'm seeing an issue where some of my nodes are being shut down with the following message:

<node>_sim-31.log:[rospy.impl.masterslave][INFO] 2022-05-10 17:05:42,835: external shutdown request from [/master]: [<other_node>] Reason: new node registered with same name

I believe I've tracked down the issue to the following sequence:

Node A is started, and given a port (lets call that port A)
Node A stops
Node B is started, and is given port A (which is fine, since there is currently not a running node with that port)
Node A is started again.

At step 4, _register_node_api() sends a shutdown request to port A, which is now in use by Node B, which results in node B being shut down. It does this because it still has an entry in the nodes dict for node B even thought that node has stopped.

This happens with some frequency in our system, any suggestions for how to fix it?

EdAtHooked commented 2 years ago

Looking at the code a bit more:

it appears that the XMLRPC port assignments are handled by the underlying linux layer, which means that the master doesn't know that the port has been reassigned, so there's no way to clean up the old node mappings. i suspect that the issue above will only occur if the node isn't shut down cleanly
using rosnode cleanup, or calling cleanup_master_whitelist doesn't help. i suspect this is because the param_subscribers are added to the nodes map in the registration code but are not unregistered by the cleanup code. this means that node_ref.is_empty() check is never true and the node entry is never deleted

AlexReimann commented 1 year ago

We got following curious log output:

WARN|1668152473.387696692|/velocity_controller|ros.roscpp: Reason given for shutdown: [[/front_cam/image_to_rtsp_roi_stream] Reason: new node registered with same name]

As can be seen the node is shutdown because of "new node registered with same name", but the name is actually different (velocity_controller vs image_to_rtsp_roi_stream).

I suspect there's some race condition with the self.nodes of rosmaster/RegistrationManager and some mutexing is missing.

AlexReimann commented 1 year ago

I'd say there is a lock missing here:

https://github.com/ros/ros_comm/blob/dd78ac8af128bb8eb992d6431bb9f994658ea6ab/tools/rosmaster/src/rosmaster/paramserver.py#L245-L257

Also the ParamDictionary uses the RegistrationManager at the same time as the ROSMasterHandler, but they use different locks. I guess the locks rather should move downwards into the RegistrationManager or they should be shared between the two.

Edit: There's a lock in the ROSMasterHandler properly locking all instances where ParamDictionary potentially modifies the RegistrationManager

AlexReimann commented 1 year ago

Looking at the code a bit more:

* it appears that the XMLRPC port assignments are handled by the underlying linux layer, which means that the master doesn't know that the port has been reassigned, so there's no way to clean up the old node mappings. i suspect that the issue above will only occur if the node isn't shut down cleanly

* using rosnode cleanup, or calling cleanup_master_whitelist doesn't help. i suspect this is because the param_subscribers are added to the nodes map in the registration code but are not unregistered by the cleanup code. this means that node_ref.is_empty() check is never true and the node entry is never deleted

So, yeah I came to the same conclusion now as you.

What do you think about checking if any other element of self.nodes has a different id but the api we want to shutdown? If so we don't send the shutdown signal, but just issue a warning?

EdAtHooked commented 1 year ago

My solution was to fix rosnode cleanup, so that it correctly unsubscribes the parameter subscribers. This appears to allow the node entry to properly be deleted in _unsubscribe (rosmaster/registrations.py). You still have to detect that your node died and call one of the cleanup functions to avoid the issue, but that wasn't too much of a problem for us. My suggestion would be that you check for dead nodes and clean them up before launching any new nodes. What we do is:

use rosnode.rosnode_ping_all() to find any nodes that don't respond to the ping request, they're probably dead
use my modified cleanup_master_whitelist to clean up the rosmaster internals
launch any new node(s)

Here is the code I added to cleanup_master_whitelist, it should probably be added to cleanup_master_blacklist as well, and rolled into the main branch, but I didn't have the time to create a PR, etc.

    psubs = master.getParamSubscribers()
    for k, l in psubs:
        for n in l:
            if n not in whitelist:
                node_api = rosnode.get_api_uri(master, n)
                master_n = rosgraph.Master(n)
                master_n.unsubscribeParam(node_api, k)

Ideally, rosmaster would handle all of this stuff internally every time a new node is launched. I believe this is the only way to avoid the underlying issue.

AlexReimann commented 1 year ago

Running with this now, will see if it solves the issue:

diff --git a/tools/rosmaster/src/rosmaster/registrations.py b/tools/rosmaster/src/rosmaster/registrations.py
index f7e8d124d..aadff915f 100644
--- a/tools/rosmaster/src/rosmaster/registrations.py
+++ b/tools/rosmaster/src/rosmaster/registrations.py
@@ -32,10 +32,14 @@
 #
 # Revision $Id$

+import logging
 from rosmaster.util import remove_server_proxy
 from rosmaster.util import xmlrpcapi
 import rosmaster.exceptions

+
+_logger = logging.getLogger("rosmaster.master")
+
 """Data structures for representing registration data in the Master"""

 class NodeRef(object):
@@ -461,6 +465,18 @@ class RegistrationManager(object):
         if node_ref is not None:
             if node_ref.api == caller_api:
                 return node_ref, False
+
+            other_node_with_api = None
+            for node in self.nodes:
+                if node.id != caller_api and node.api == caller_api:
+                    other_node_with_api = node.id
+                    break
+
+            if other_node_with_api:
+                log_warn = "Got registration for already on {} registered api. Removing old registration of Node {}, as it probably is a zombie".format(other_node_with_api, caller_id)
+                _logger.warn(log_warn)
+                print(log_warn)
+                del self.nodes[caller_id]
             else:
                 bumped_api = node_ref.api
                 self.thread_pool.queue_task(bumped_api, shutdown_node_task,

AlexReimann commented 8 months ago

Here is the code I added to cleanup_master_whitelist, it should probably be added to cleanup_master_blacklist as well, and rolled into the main branch, but I didn't have the time to create a PR, etc.
    psubs = master.getParamSubscribers()
    for k, l in psubs:
        for n in l:
            if n not in whitelist:
                node_api = rosnode.get_api_uri(master, n)
                master_n = rosgraph.Master(n)
                master_n.unsubscribeParam(node_api, k)

@EdAtHooked Where do you get the psubs = master.getParamSubscribers() from? There is no getParamSubscribers in the noetic-devel branch?
It seems like there is no existing api to get the information which node is subscribed to which parameters?