gleam-lang / otp

📫 Fault tolerant multicore programs with actors
https://hexdocs.pm/gleam_otp/
Apache License 2.0
458 stars 52 forks source link

[Feature] DynamicSupervisor please #42

Open sclee15 opened 1 year ago

sclee15 commented 1 year ago

Hello.

I think Gleam's typed OTP and its concept of subject is great.

But, It would be better to have some types of DynamicSupervisor that allow me to spawn workers as I need.

lpil commented 1 year ago

Sounds great!

arnarg commented 1 year ago

I'm also interested in this.

In terms of API design, do you see any value in having a separate supervisor and dynamic supervisor (afaict this is the case in Elixir) or should there just be a single supervisor type that happens to be dynamic, if you want it to be static you don't add more children to it. I'm personally leaning towards the option of just having a single type that is dynamic.

Also, should the supervisor still have an init function to setup initial children or do you just create a supervisor and start adding children to it dynamically?

Option 1

pub fn main() {
  let assert Ok(sup) = supervisor.start(fn(children) {
    children
    |> add(worker(database.start))
    |> add(worker(monitoring.start))
    |> add(worker(web.start))
  })

  // Something happens in between

  let assert Ok(runner) = supervisor.add_child(sup, worker(runner.start))
}

Option 2

pub fn main() {
  let assert Ok(sup) = supervisor.start()
  let assert Ok(db) = supervisor.add_child(sup, worker(database.start))
  let assert Ok(mon) = supervisor.add_child(sup, worker(monitoring.start))
  let assert Ok(web) = supervisor.add_child(sup, worker(web.start))

  // Something happens in between

  let assert Ok(runner) = supervisor.add_child(sup, worker(runner.start))
}

I see the value in not breaking the existing API but I also find option 2 to be a bit simpler API.

lpil commented 1 year ago

With option 2 how does it restart the children when one dies?

arnarg commented 1 year ago

I haven't really looked into the current implementation but I'm assuming it will need to keep some kind of list of children.

Does it currently call the init function every time a child dies?

lpil commented 1 year ago

Nope, the current supervisor implements the rest_for_one strategy. I think we likely need to supervisors that implement all the different strategies, and possibly some other patterns that may be useful given Gleam OTP's lack of process naming.

arnarg commented 1 year ago

Oh I see, to be honest I wasn't too familiar with all the different strategies.

I guess rest_for_one was the most logical to start with so you can control the arguments down the chain (so you don't have an old reference to a subject belonging to a process that already died)?

So some thoughts:

Or would it be preferred to have distinct static and dynamic supervisors?

lpil commented 1 year ago

I don't think we could safely restart any dynamically added children as the initial state that was used to create them is not controlled by the supervisor.

Take a web server that does some background processing as an example. It could have a web server process, a database, connection process, and dynamically, added worker processes.

If there was to be a failure, which caused them all to be restarted, the web application and the database, connection processes would be initialise correctly, but if any of the work processes were restarted using their original initial state, they would have references to the no longer existing database connection process, and such would always fail. This would eventually result in there being too much restart intensity and the entire supervisor would fail.

arnarg commented 1 year ago

I see. Given my limited experience using supervisors I might not be the best person to come up with designs for this 😅.

Would a more typical use case add the dynamic supervisor as a child of a static one so that the whole dynamic supervisor is restarted if any of its dependencies crash (web server, database, ...)?

If so, then two distinct types of supervisors might make more sense.

guillheu commented 1 week ago

I definitely could make use of a dynamic supervisor.

From what I've read, my understanding is that the rest-for-one strategy makes sense for the current supervisor since the init state of each worker depends on the state returned by the worker initialized right before it. In other words, each child worker starting returns the init state for its immediate "younger sibling". Since that init state for the "younger sibling" might therefor be different, the "younger sibling" should also be restarted, for example, if each worker needs the subject of its immediate "older sibling".

For a dynamic supervisor though, I feel like it'd be easier to use a one-for-one strategy. This would have the limitation of not allowing workers to pass state to each other. Obviously it wouldn't work in the example @lpil gave, where multiple workers depend on each other, but I feel like it'd be better than nothing.

Am I understanding the situation correctly ?

lpil commented 1 week ago

Yup that's right. "Dynamic supervisor" is what Elixir folks call a supervisor with the one for one strategy.

guillheu commented 1 week ago

I've been thinking about how to implement a dynamic supervisor. I'm trying to start from the current otp supervisor and changing things as I go. Feel free to correct my misunderstandings and point me in the right direction.

Here's an overview:

pub opaque type Message(argument, children_message) {
  ChildAdded(child: ChildSpec(children_message, argument))
  Exit(process.ExitMessage)
  RetryRestart(Pid)
}

type State {
  State(
    restarts: IntensityTracker,
    // starter: Starter(a),
    retry_restarts: Subject(Pid),
  )
}

pub type Spec(argument) {
  Spec(
    argument: argument,
    max_frequency: Int,
    frequency_period: Int,
    init: fn(argument) -> Nil,
  )
}

pub opaque type ChildSpec(msg, argument) {
  ChildSpec(
    start: fn(argument) -> Result(Subject(msg), StartError),
    finalize: fn(argument, Subject(msg)) -> Nil,
  )
}

fn loop(
  message: Message(argument, children_message),
  state: State,
) -> actor.Next(Message(argument, children_message), State) {
  case message {
    Exit(exit_message) -> todo  //handle_exit(exit_message.pid, state)
    RetryRestart(pid) -> todo   //handle_exit(pid, state)
    ChildAdded(child) -> todo
  }
}

fn init(
  spec: Spec(argument),
) -> actor.InitResult(State, Message(argument, children_message)) {
  // Create a subject so that we can asynchronously retry restarting when we
  // fail to bring an exited child
  let retry = process.new_subject()

  // Trap exits so that we get a message when a child crashes
  process.trap_exits(True)

  // Combine selectors
  let selector =
    process.new_selector()
    |> process.selecting(retry, RetryRestart)
    |> process.selecting_trapped_exits(Exit)

  todo
}

pub fn worker(
  start: fn(argument) -> Result(Subject(msg), StartError),
  finalize: fn(argument, Subject(msg)) -> Nil,
) -> ChildSpec(msg, argument) {
  ChildSpec(start: start, finalize: finalize)
}

One issue with this design is all the workers under the same dynamic supervisor must all return a subject with the same message type. IDK if this is the same in Elixir, but I was forced to specify the ChildSpec message type in the dynamic supervisor Message because of the ChildAdded message.

lpil commented 1 week ago

I think the correct way to go would be to use the Erlang supervisor module as a base as it is well tested and known to work. That's what static_supervisor does, which is recommended over supervisor, which has several bugs presently.

That said, I've not looked into this much so I don't know what the API might be or what limitations might arise.

guillheu commented 1 week ago

I see. That seems feasible at first glance. simple-one-to-one fits my use case at least. I think I'm gonna need a bit more time playing with the Erlang supervisor module before I can confidently contribute to a dynamic_supervisor for gleam_otp.

guillheu commented 1 week ago

I started working on the simple-one-for-one Erlang supervisor, but realized I was mostly just reusing static_supervisor, which makes sense. I decided to add the SimpleOneForOne strategy to the static_supervisor. Because of how the simple-one-for-one strategy works, I also had to add a binding for the Erlang supervisor:start_child function. This means the static_supervisor isn't so static anymore. One thing led to another, and I ended up binding to what I think are the most important Erlang dynamic supervisor functions for all strategies:

I essentially have a functional (but messy) dynamic supervisor using the Erlang supervisor module. You can compare my work with the current otp main branch here https://github.com/gleam-lang/otp/compare/main...guillheu:otp:simple-one-for-one-erlang

I haven't opened a PR yet because I think my work is very, very messy, and I'd like some feedback first. Sorry if this is a big code review to ask for, feel free to ask for clarifications. I'll try to be responsive.

The simple-one-for-one behaves so differently from the other 3 strategies I'm thinking it might be preferable to have it in its own separate module.