Generic Monitor Node and Support Libraries

arjo129 commented 3 years ago

Feature request

Description

Currently we have a number of issues that require fail over capabilities within nodes. Here are a few:

102
100

We also occasionally need to write new nodes which eventually will inevitably require fail over support:
103

The schedule node already has a very strong fail over support through the use of a specialized monitor node. This is a good starting point for this new feature. There is also the stubborn_buddies library which enables basic fail over support through link time composition.

Implementation considerations

These are just some quick mental notes about this feature:

Do we want a specialized monitor node for each application or a single grand monitor node?
How do we encode the status of nodes that are actively running?
- One alternative if we choose the "specialized monitor node per node type" approach is to have the current node publish on a status topic and the monitor read this status topic. When the node fails, we call some type of factory method to instantiate a new version of the node.
  - The pros of this method are that it is simple.
  - One con is that the status topic may get very large. In practice, I doubt that is the case with most of our use cases. However, its still prudent to consider this.
  - Another con is that every node may have a different internal state leading to a proliferation of different message types. This can become quite cumbersome to manage.
    - We could potentially just use a generic message and encode yaml strings in the message but this would be gross.
    - We could ignore this problem altogether and template<our<way<out<of<here>>>> .
- Another alternative is to serialize the state to disk and pass the file path around.
  - The pros of this method are that it can handle large data and solves the problem of using a common ros message.
  - One con is that this isn't suited for distributed systems over the network. We may want our monitor node to be split between two machines.
Fail over capability of the monitor node itself. I think this should be fairly trivial to do with stubborn_buddies.
How do we launch the node across different machines? (Probably something we don't need to worry about just yet)
What will the API look like?
- Should we have an abstract RestorableNode class which the node will inherit or should we use some other architecture.
- Alternatively we could instantiate a MonitorNode<T> class where T refers to the status topic message type.

mxgrey commented 3 years ago

Another con is that every ros2 topic requires us to use a different Message type.

Is this sentence backwards? Every message type needs to be on its own topic, but we don't need each topic to use a different message type.

arjo129 commented 3 years ago

Hmm I think I meant "every node may have a different internal state leading to a proliferation of different message types."

mxgrey commented 3 years ago

Personally I'd like us to finish ad hoc fail-over implementations for at least three very different kinds of nodes before we start worrying too much about how to encapsulate all fail-over into one implementation. I have a pretty strong feeling that the needs for doing efficient and reliable fail-over will be so different between the different types of nodes that trying to encapsulate it all into one implementation may actually add complexity overall with little tangible benefit.

These questions are certainly good to keep in mind as we go forward, but I would avoid putting this goal on the roadmap until we have some more concrete fail-over implementations to account for besides just the schedule node.

open-rmf / rmf_ros2