This PR solves the issue of deadlock during concurrent shutdown of mongo_server and config_manager nodes. I came across this issue while running pre-release tests.
This solution is still up for debate (see Question section at the bottom).
Steps to Reproduce
Clone and build the mongodb_store package on ROS Noetic or Melodic.
Run the config_manager.test a couple of times: rostest mongodb_store config_manager.test --text
In some cases, the mongo_server node hangs during shutdown and requires SIGKILL to exit (after ~20 seconds). This behavior causes pre-release tests to fail due to timeout.
Cause
During shutdown, the mongo_server issues a shutdown command to its mongod subprocess. At the same time, the config_manager attempts to close its MongoClient, which sends some cleanup commands to the mongod server. This somehow causes a deadlock and prevents the mongo_server node to exit cleanly. In fact, any concurrent command to the mongod process during shutdown seems to cause the deadlock.
Current Solution
This was solved by controlling the node shutdown sequence through the ready flag in the mongodb_server.py (see commit).
Question
Several other nodes create MongoClient instances and do not close them (mongodb_store_node, replicator_node, etc.).
So here we have two options:
Include the MongoClient closing/cleanup into all the other nodes instantiating it, through rospy.on_shutdown (like we now have in the config_manager node).
Remove the MongoClient closing/cleanup from the config_manager node, as is the case in other nodes. Resources in the node are freed anyway when it is shut down, and the daemon should periodically clean up expired sessions.
Sure, that makes sense - I updated the PR with respect to option 2.
If we decide for option 1 at some point in the future, the initial commit will still be available for reference.
Summary
This PR solves the issue of deadlock during concurrent shutdown of
mongo_server
andconfig_manager
nodes. I came across this issue while running pre-release tests.This solution is still up for debate (see Question section at the bottom).
Steps to Reproduce
mongodb_store
package on ROS Noetic or Melodic.config_manager.test
a couple of times:rostest mongodb_store config_manager.test --text
In some cases, the
mongo_server
node hangs during shutdown and requiresSIGKILL
to exit (after ~20 seconds). This behavior causes pre-release tests to fail due to timeout.Cause
During shutdown, the
mongo_server
issues ashutdown
command to itsmongod
subprocess. At the same time, theconfig_manager
attempts to close itsMongoClient
, which sends some cleanup commands to themongod
server. This somehow causes a deadlock and prevents themongo_server
node to exit cleanly. In fact, any concurrent command to themongod
process during shutdown seems to cause the deadlock.Current Solution
This was solved by controlling the node shutdown sequence through the
ready
flag in themongodb_server.py
(see commit).Question
Several other nodes create
MongoClient
instances and do not close them (mongodb_store_node
,replicator_node
, etc.). So here we have two options:MongoClient
closing/cleanup into all the other nodes instantiating it, throughrospy.on_shutdown
(like we now have in theconfig_manager
node).MongoClient
closing/cleanup from theconfig_manager
node, as is the case in other nodes. Resources in the node are freed anyway when it is shut down, and the daemon should periodically clean up expired sessions.What do you think?