Open EdAtHooked opened 2 years ago
Looking at the code a bit more:
We got following curious log output:
WARN|1668152473.387696692|/velocity_controller|ros.roscpp: Reason given for shutdown: [[/front_cam/image_to_rtsp_roi_stream] Reason: new node registered with same name]
As can be seen the node is shutdown because of "new node registered with same name", but the name is actually different (velocity_controller
vs image_to_rtsp_roi_stream
).
I suspect there's some race condition with the self.nodes
of rosmaster/RegistrationManager
and some mutexing is missing.
I'd say there is a lock missing here:
Also the ParamDictionary
uses the RegistrationManager
at the same time as the ROSMasterHandler
, but they use different locks.
I guess the locks rather should move downwards into the RegistrationManager
or they should be shared between the two.
Edit: There's a lock in the ROSMasterHandler
properly locking all instances where ParamDictionary
potentially modifies the RegistrationManager
Looking at the code a bit more:
* it appears that the XMLRPC port assignments are handled by the underlying linux layer, which means that the master doesn't know that the port has been reassigned, so there's no way to clean up the old node mappings. i suspect that the issue above will only occur if the node isn't shut down cleanly * using rosnode cleanup, or calling cleanup_master_whitelist doesn't help. i suspect this is because the param_subscribers are added to the nodes map in the registration code but are not unregistered by the cleanup code. this means that node_ref.is_empty() check is never true and the node entry is never deleted
So, yeah I came to the same conclusion now as you.
What do you think about checking if any other element of self.nodes
has a different id but the api we want to shutdown?
If so we don't send the shutdown signal, but just issue a warning?
My solution was to fix rosnode cleanup
, so that it correctly unsubscribes the parameter subscribers. This appears to allow the node entry to properly be deleted in _unsubscribe (rosmaster/registrations.py). You still have to detect that your node died and call one of the cleanup functions to avoid the issue, but that wasn't too much of a problem for us. My suggestion would be that you check for dead nodes and clean them up before launching any new nodes. What we do is:
cleanup_master_whitelist
to clean up the rosmaster internalsHere is the code I added to cleanup_master_whitelist
, it should probably be added to cleanup_master_blacklist
as well, and rolled into the main branch, but I didn't have the time to create a PR, etc.
psubs = master.getParamSubscribers()
for k, l in psubs:
for n in l:
if n not in whitelist:
node_api = rosnode.get_api_uri(master, n)
master_n = rosgraph.Master(n)
master_n.unsubscribeParam(node_api, k)
Ideally, rosmaster would handle all of this stuff internally every time a new node is launched. I believe this is the only way to avoid the underlying issue.
Running with this now, will see if it solves the issue:
diff --git a/tools/rosmaster/src/rosmaster/registrations.py b/tools/rosmaster/src/rosmaster/registrations.py
index f7e8d124d..aadff915f 100644
--- a/tools/rosmaster/src/rosmaster/registrations.py
+++ b/tools/rosmaster/src/rosmaster/registrations.py
@@ -32,10 +32,14 @@
#
# Revision $Id$
+import logging
from rosmaster.util import remove_server_proxy
from rosmaster.util import xmlrpcapi
import rosmaster.exceptions
+
+_logger = logging.getLogger("rosmaster.master")
+
"""Data structures for representing registration data in the Master"""
class NodeRef(object):
@@ -461,6 +465,18 @@ class RegistrationManager(object):
if node_ref is not None:
if node_ref.api == caller_api:
return node_ref, False
+
+ other_node_with_api = None
+ for node in self.nodes:
+ if node.id != caller_api and node.api == caller_api:
+ other_node_with_api = node.id
+ break
+
+ if other_node_with_api:
+ log_warn = "Got registration for already on {} registered api. Removing old registration of Node {}, as it probably is a zombie".format(other_node_with_api, caller_id)
+ _logger.warn(log_warn)
+ print(log_warn)
+ del self.nodes[caller_id]
else:
bumped_api = node_ref.api
self.thread_pool.queue_task(bumped_api, shutdown_node_task,
Here is the code I added to
cleanup_master_whitelist
, it should probably be added tocleanup_master_blacklist
as well, and rolled into the main branch, but I didn't have the time to create a PR, etc.psubs = master.getParamSubscribers() for k, l in psubs: for n in l: if n not in whitelist: node_api = rosnode.get_api_uri(master, n) master_n = rosgraph.Master(n) master_n.unsubscribeParam(node_api, k)
@EdAtHooked
Where do you get the psubs = master.getParamSubscribers()
from? There is no getParamSubscribers
in the noetic-devel
branch?
It seems like there is no existing api to get the information which node is subscribed to which parameters?
I'm seeing an issue where some of my nodes are being shut down with the following message:
<node>_sim-31.log:[rospy.impl.masterslave][INFO] 2022-05-10 17:05:42,835: external shutdown request from [/master]: [<other_node>] Reason: new node registered with same name
I believe I've tracked down the issue to the following sequence:
At step 4, _register_node_api() sends a shutdown request to port A, which is now in use by Node B, which results in node B being shut down. It does this because it still has an entry in the nodes dict for node B even thought that node has stopped.
This happens with some frequency in our system, any suggestions for how to fix it?