strands-project / mongodb_store

MongoDB tools for storing and analysing runs of ROS systems.
BSD 3-Clause "New" or "Revised" License
49 stars 72 forks source link

Avoid deadlock on server shutdown #279

Closed ggorjup closed 2 years ago

ggorjup commented 2 years ago

Summary

This PR solves the issue of deadlock during concurrent shutdown of mongo_server and config_manager nodes. I came across this issue while running pre-release tests.

This solution is still up for debate (see Question section at the bottom).

Steps to Reproduce

  1. Clone and build the mongodb_store package on ROS Noetic or Melodic.
  2. Run the config_manager.test a couple of times:
    rostest mongodb_store config_manager.test --text

In some cases, the mongo_server node hangs during shutdown and requires SIGKILL to exit (after ~20 seconds). This behavior causes pre-release tests to fail due to timeout.

Cause

During shutdown, the mongo_server issues a shutdown command to its mongod subprocess. At the same time, the config_manager attempts to close its MongoClient, which sends some cleanup commands to the mongod server. This somehow causes a deadlock and prevents the mongo_server node to exit cleanly. In fact, any concurrent command to the mongod process during shutdown seems to cause the deadlock.

Current Solution

This was solved by controlling the node shutdown sequence through the ready flag in the mongodb_server.py (see commit).

Question

Several other nodes create MongoClient instances and do not close them (mongodb_store_node, replicator_node, etc.). So here we have two options:

  1. Include the MongoClient closing/cleanup into all the other nodes instantiating it, through rospy.on_shutdown (like we now have in the config_manager node).
  2. Remove the MongoClient closing/cleanup from the config_manager node, as is the case in other nodes. Resources in the node are freed anyway when it is shut down, and the daemon should periodically clean up expired sessions.

What do you think?

hawesie commented 2 years ago
  1. seems like the simpler option to me.
ggorjup commented 2 years ago

Sure, that makes sense - I updated the PR with respect to option 2.
If we decide for option 1 at some point in the future, the initial commit will still be available for reference.