[Feature] DynamicSupervisor please

sclee15 commented 1 year ago

Hello.

I think Gleam's typed OTP and its concept of subject is great.

But, It would be better to have some types of DynamicSupervisor that allow me to spawn workers as I need.

lpil commented 1 year ago

Sounds great!

arnarg commented 1 year ago

I'm also interested in this.

In terms of API design, do you see any value in having a separate supervisor and dynamic supervisor (afaict this is the case in Elixir) or should there just be a single supervisor type that happens to be dynamic, if you want it to be static you don't add more children to it. I'm personally leaning towards the option of just having a single type that is dynamic.

Also, should the supervisor still have an init function to setup initial children or do you just create a supervisor and start adding children to it dynamically?

Option 1

pub fn main() {
  let assert Ok(sup) = supervisor.start(fn(children) {
    children
    |> add(worker(database.start))
    |> add(worker(monitoring.start))
    |> add(worker(web.start))
  })

  // Something happens in between

  let assert Ok(runner) = supervisor.add_child(sup, worker(runner.start))
}

Option 2

pub fn main() {
  let assert Ok(sup) = supervisor.start()
  let assert Ok(db) = supervisor.add_child(sup, worker(database.start))
  let assert Ok(mon) = supervisor.add_child(sup, worker(monitoring.start))
  let assert Ok(web) = supervisor.add_child(sup, worker(web.start))

  // Something happens in between

  let assert Ok(runner) = supervisor.add_child(sup, worker(runner.start))
}

I see the value in not breaking the existing API but I also find option 2 to be a bit simpler API.

lpil commented 1 year ago

With option 2 how does it restart the children when one dies?

arnarg commented 1 year ago

I haven't really looked into the current implementation but I'm assuming it will need to keep some kind of list of children.

Does it currently call the init function every time a child dies?

lpil commented 1 year ago

Nope, the current supervisor implements the rest_for_one strategy. I think we likely need to supervisors that implement all the different strategies, and possibly some other patterns that may be useful given Gleam OTP's lack of process naming.

arnarg commented 1 year ago

Oh I see, to be honest I wasn't too familiar with all the different strategies.

I guess rest_for_one was the most logical to start with so you can control the arguments down the chain (so you don't have an old reference to a subject belonging to a process that already died)?

So some thoughts:

rest_for_one
- The init function will be kept (no matter the strategy picked). This is the only place to set up the arguments.
- All dynamic children added afterwards will be put at the back of the "chain" (as if they were just last in the init function). So they're always restarted if any child from the init function dies.
- Possibly the dynamic children alone can be configured as one_for_one or one_for_all. So if any of the dynamic children die either only that one or all of them are restarted but the initial ones are unaffected.
one_for_all
- This one can pretty much be treated as rest_for_all except everything is started from scratch, right?
one_for_one
- This one is simple except for argument passing.
- Should there be a function to set the argument for the supervisor to pass to children that can be set externally (i.e. supervisor.set_arguments(sup, MyArguments(...)))?
- How does this affect already running children?
- Should the supervisor maintain a registry that children can (optionally) use to register itself by name and will be passed to children?
- This will need to send an update to all children when updated, not optimal.

Or would it be preferred to have distinct static and dynamic supervisors?

lpil commented 1 year ago

I don't think we could safely restart any dynamically added children as the initial state that was used to create them is not controlled by the supervisor.

Take a web server that does some background processing as an example. It could have a web server process, a database, connection process, and dynamically, added worker processes.

If there was to be a failure, which caused them all to be restarted, the web application and the database, connection processes would be initialise correctly, but if any of the work processes were restarted using their original initial state, they would have references to the no longer existing database connection process, and such would always fail. This would eventually result in there being too much restart intensity and the entire supervisor would fail.

arnarg commented 1 year ago

I see. Given my limited experience using supervisors I might not be the best person to come up with designs for this 😅.

Would a more typical use case add the dynamic supervisor as a child of a static one so that the whole dynamic supervisor is restarted if any of its dependencies crash (web server, database, ...)?

If so, then two distinct types of supervisors might make more sense.

guillheu commented 1 week ago

I definitely could make use of a dynamic supervisor.

From what I've read, my understanding is that the rest-for-one strategy makes sense for the current supervisor since the init state of each worker depends on the state returned by the worker initialized right before it. In other words, each child worker starting returns the init state for its immediate "younger sibling". Since that init state for the "younger sibling" might therefor be different, the "younger sibling" should also be restarted, for example, if each worker needs the subject of its immediate "older sibling".

For a dynamic supervisor though, I feel like it'd be easier to use a one-for-one strategy. This would have the limitation of not allowing workers to pass state to each other. Obviously it wouldn't work in the example @lpil gave, where multiple workers depend on each other, but I feel like it'd be better than nothing.

Am I understanding the situation correctly ?

lpil commented 1 week ago

Yup that's right. "Dynamic supervisor" is what Elixir folks call a supervisor with the one for one strategy.

guillheu commented 1 week ago

I've been thinking about how to implement a dynamic supervisor. I'm trying to start from the current otp supervisor and changing things as I go. Feel free to correct my misunderstandings and point me in the right direction.

The init function for the dynamic supervisor should not start any children. Instead, it's the loop that should initialize children through a new message variant, something like ChildAdded.
I understand that actor.start_spec automatically links the process of a new actor to its calling process, which is why process.traps_exits and process.selecting_trapped_exits can be used in the init function without explicitly linking the child process to the supervisor.
The returning type argument of the ChildSpec type is no longer useful as previously intended, since the "next" child does not need its return. However, I think having a function that executes after initialization is still useful. For instance, I've been using the supervisor ChildSpec.returning function to register the new child subject to a subject registry (using Chip). I think having a configurable function that runs after initialization and gives access to the new worker subject is a good thing, but should be called something else. Maybe something like finalize instead of returning. Also, it would return Nil instead of the returning type argument, which can be removed from the ChildSpec.
Similarily, the Spec for the dynamic supervisor does not need a return type argument
I don't think the Children or Starter types are necessary. AFAICT they're needed for the one-to-rest strategy, but don't serve a purpose for a one-to-one.

Here's an overview:

pub opaque type Message(argument, children_message) {
  ChildAdded(child: ChildSpec(children_message, argument))
  Exit(process.ExitMessage)
  RetryRestart(Pid)
}

type State {
  State(
    restarts: IntensityTracker,
    // starter: Starter(a),
    retry_restarts: Subject(Pid),
  )
}

pub type Spec(argument) {
  Spec(
    argument: argument,
    max_frequency: Int,
    frequency_period: Int,
    init: fn(argument) -> Nil,
  )
}

pub opaque type ChildSpec(msg, argument) {
  ChildSpec(
    start: fn(argument) -> Result(Subject(msg), StartError),
    finalize: fn(argument, Subject(msg)) -> Nil,
  )
}

fn loop(
  message: Message(argument, children_message),
  state: State,
) -> actor.Next(Message(argument, children_message), State) {
  case message {
    Exit(exit_message) -> todo  //handle_exit(exit_message.pid, state)
    RetryRestart(pid) -> todo   //handle_exit(pid, state)
    ChildAdded(child) -> todo
  }
}

fn init(
  spec: Spec(argument),
) -> actor.InitResult(State, Message(argument, children_message)) {
  // Create a subject so that we can asynchronously retry restarting when we
  // fail to bring an exited child
  let retry = process.new_subject()

  // Trap exits so that we get a message when a child crashes
  process.trap_exits(True)

  // Combine selectors
  let selector =
    process.new_selector()
    |> process.selecting(retry, RetryRestart)
    |> process.selecting_trapped_exits(Exit)

  todo
}

pub fn worker(
  start: fn(argument) -> Result(Subject(msg), StartError),
  finalize: fn(argument, Subject(msg)) -> Nil,
) -> ChildSpec(msg, argument) {
  ChildSpec(start: start, finalize: finalize)
}

One issue with this design is all the workers under the same dynamic supervisor must all return a subject with the same message type. IDK if this is the same in Elixir, but I was forced to specify the ChildSpec message type in the dynamic supervisor Message because of the ChildAdded message.

lpil commented 1 week ago

I think the correct way to go would be to use the Erlang supervisor module as a base as it is well tested and known to work. That's what static_supervisor does, which is recommended over supervisor, which has several bugs presently.

That said, I've not looked into this much so I don't know what the API might be or what limitations might arise.

guillheu commented 1 week ago

I see. That seems feasible at first glance. simple-one-to-one fits my use case at least. I think I'm gonna need a bit more time playing with the Erlang supervisor module before I can confidently contribute to a dynamic_supervisor for gleam_otp.

guillheu commented 1 week ago

I started working on the simple-one-for-one Erlang supervisor, but realized I was mostly just reusing static_supervisor, which makes sense. I decided to add the SimpleOneForOne strategy to the static_supervisor. Because of how the simple-one-for-one strategy works, I also had to add a binding for the Erlang supervisor:start_child function. This means the static_supervisor isn't so static anymore. One thing led to another, and I ended up binding to what I think are the most important Erlang dynamic supervisor functions for all strategies:

start_child/2
terminate_child/2
restart_child/2
delete_child/2
count_children/1

I essentially have a functional (but messy) dynamic supervisor using the Erlang supervisor module. You can compare my work with the current otp main branch here https://github.com/gleam-lang/otp/compare/main...guillheu:otp:simple-one-for-one-erlang

I haven't opened a PR yet because I think my work is very, very messy, and I'd like some feedback first. Sorry if this is a big code review to ask for, feel free to ask for clarifications. I'll try to be responsive.

The simple-one-for-one behaves so differently from the other 3 strategies I'm thinking it might be preferable to have it in its own separate module.

gleam-lang / otp

[Feature] DynamicSupervisor please #42