ros / bond_core

Maintains a bond (i.e. heart beat ) between ROS nodes and provides feedback when the bond is broken
http://www.ros.org/wiki/bond_core
35 stars 64 forks source link

Use the first sister message as a Heartbeat, too #93

Open peci1 opened 1 year ago

peci1 commented 1 year ago

I've noticed that when the sister is killed quickly after being started, there is (very much repeatable) chance to deadlock the other Bond. Consider this sequence:

  1. Sister sends her first status message. My bond executes AwaitingSister->SisterAlive() which calls Connect(). Connect() stops the connection timer.
  2. Sister dies before sending her second message.
  3. My bond is now in Alive state. However, Heartbeat() has not yet been called (that would be done by the second message which did not come), so heartbeat_timer_ is still off and does not trigger the timeout event.

The fix I sent fixes it on the C++ side. I'm not sure about the Python side, and not sure about the SM source code. The Heartbeat() needs to be called already in the Alive state (it is undefined in AwaitingSister). Quickly skimming through the SM syntax, it doesn't seem to me it would support calling some functions before the transition and some after it.

If that would be the case, I'd suggest calling Heartbeat in the AwaitingSister state (right after calling Connected()) and defining it for AwaitingSister state to do the same as for the Alive state. That would require no additional changes to the SM definition, so it might be preferred.

I'll let the maintainers decide which approach would be better.

peci1 commented 1 year ago

A practical example where this can happen is when a nodelet is unloaded very shortly after being loaded.

peci1 commented 4 months ago

Could this please get a review?