lasp-lang / partisan

High-performance, high-scalability distributed computing for the BEAM.
https://partisan.dev
Apache License 2.0
905 stars 58 forks source link

add remote monitoring #45

Closed evanmcc closed 1 year ago

evanmcc commented 7 years ago

in order for the call emulation in #44 to work, and more generally for partisan to act as a full-featured disterl replacement (see #42), we'll need to add remote montioring. A good design for this doesn't really spring right to mind, I guess, so I am looking for feedback here.

My initial thought was just to add some monitoring metadata on top of the existing node to node data handling (it would work like hello, I guess?). But that can combine with remote node failures in a complicated way, so I need to read more code to have any better fleshed out ideas.

benoitc commented 6 years ago

i guess monitoring could be done by connected nodes and gossiped to others when it happen?

ankhers commented 4 years ago

I think the simplest thing to do would be to have a monitor function that can work either with a pid, or a {partisan_remote_reference, Node, {partisan_process_reference, PidAsList}}. If it is a regular pid, we could just use erlang:monitor/1. If it is a remote reference, we could cast a message off to the given Node with the calling processes remote reference. On the Node, there would be a process (or group of processes) that would be responsible for monitoring the local pid. Since we have the remote reference for the calling process, the down message can be forwarded off to the original calling process when needed.

I am not sure how clear that is, so I will write it down step by step.

  1. Process 0.0.100 on Node A has a remote reference for process 0.0.200 on Node B
  2. Process 0.0.100 on Node A calls monitor on the remote reference
  3. The remote reference of 0.0.100 on Node A gets generated and gets sent to the partisan_monitor process on Node B
  4. The partisan_monitor process starts monitoring 0.0.200 on Node B
  5. If/When 0.0.200 on Node B goes down, it will forward the down message using the remote reference of 0.0.100 on Node A
  6. 0.0.100 on Node A receives the down message and can do whatever accordingly.

Hopefully that is a little more clear.

If this makes some sense, I would gladly take some time to get something working.

aramallo commented 1 year ago

In v5.0.0-beta I have re-implemented monitoring leveraging the new connection handling which offers fast checks and also a new implementation of on_up and on_down peer service callbacks.

It only works with pluggable service manager ie full mesh. I would love to come up with a design that works for Hyparview soon.

So I will close this issue.