Open sclee15 opened 1 year ago
Sounds great!
I'm also interested in this.
In terms of API design, do you see any value in having a separate supervisor and dynamic supervisor (afaict this is the case in Elixir) or should there just be a single supervisor type that happens to be dynamic, if you want it to be static you don't add more children to it. I'm personally leaning towards the option of just having a single type that is dynamic.
Also, should the supervisor still have an init function to setup initial children or do you just create a supervisor and start adding children to it dynamically?
Option 1
pub fn main() {
let assert Ok(sup) = supervisor.start(fn(children) {
children
|> add(worker(database.start))
|> add(worker(monitoring.start))
|> add(worker(web.start))
})
// Something happens in between
let assert Ok(runner) = supervisor.add_child(sup, worker(runner.start))
}
Option 2
pub fn main() {
let assert Ok(sup) = supervisor.start()
let assert Ok(db) = supervisor.add_child(sup, worker(database.start))
let assert Ok(mon) = supervisor.add_child(sup, worker(monitoring.start))
let assert Ok(web) = supervisor.add_child(sup, worker(web.start))
// Something happens in between
let assert Ok(runner) = supervisor.add_child(sup, worker(runner.start))
}
I see the value in not breaking the existing API but I also find option 2 to be a bit simpler API.
With option 2 how does it restart the children when one dies?
I haven't really looked into the current implementation but I'm assuming it will need to keep some kind of list of children.
Does it currently call the init function every time a child dies?
Nope, the current supervisor implements the rest_for_one
strategy. I think we likely need to supervisors that implement all the different strategies, and possibly some other patterns that may be useful given Gleam OTP's lack of process naming.
Oh I see, to be honest I wasn't too familiar with all the different strategies.
I guess rest_for_one
was the most logical to start with so you can control the arguments down the chain (so you don't have an old reference to a subject belonging to a process that already died)?
So some thoughts:
rest_for_one
one_for_one
or one_for_all
. So if any of the dynamic children die either only that one or all of them are restarted but the initial ones are unaffected.one_for_all
rest_for_all
except everything is started from scratch, right?one_for_one
supervisor.set_arguments(sup, MyArguments(...))
)?Or would it be preferred to have distinct static and dynamic supervisors?
I don't think we could safely restart any dynamically added children as the initial state that was used to create them is not controlled by the supervisor.
Take a web server that does some background processing as an example. It could have a web server process, a database, connection process, and dynamically, added worker processes.
If there was to be a failure, which caused them all to be restarted, the web application and the database, connection processes would be initialise correctly, but if any of the work processes were restarted using their original initial state, they would have references to the no longer existing database connection process, and such would always fail. This would eventually result in there being too much restart intensity and the entire supervisor would fail.
I see. Given my limited experience using supervisors I might not be the best person to come up with designs for this 😅.
Would a more typical use case add the dynamic supervisor as a child of a static one so that the whole dynamic supervisor is restarted if any of its dependencies crash (web server, database, ...)?
If so, then two distinct types of supervisors might make more sense.
I definitely could make use of a dynamic supervisor.
From what I've read, my understanding is that the rest-for-one
strategy makes sense for the current supervisor
since the init state of each worker depends on the state returned by the worker initialized right before it. In other words, each child worker starting returns the init state for its immediate "younger sibling". Since that init state for the "younger sibling" might therefor be different, the "younger sibling" should also be restarted, for example, if each worker needs the subject of its immediate "older sibling".
For a dynamic supervisor though, I feel like it'd be easier to use a one-for-one
strategy. This would have the limitation of not allowing workers to pass state to each other. Obviously it wouldn't work in the example @lpil gave, where multiple workers depend on each other, but I feel like it'd be better than nothing.
Am I understanding the situation correctly ?
Yup that's right. "Dynamic supervisor" is what Elixir folks call a supervisor with the one for one strategy.
I've been thinking about how to implement a dynamic supervisor. I'm trying to start from the current otp supervisor
and changing things as I go. Feel free to correct my misunderstandings and point me in the right direction.
init
function for the dynamic supervisor should not start any children. Instead, it's the loop
that should initialize children through a new message variant, something like ChildAdded
.actor.start_spec
automatically links the process of a new actor to its calling process, which is why process.traps_exits
and process.selecting_trapped_exits
can be used in the init
function without explicitly linking the child process to the supervisor.returning
type argument of the ChildSpec
type is no longer useful as previously intended, since the "next" child does not need its return. However, I think having a function that executes after initialization is still useful. For instance, I've been using the supervisor ChildSpec.returning
function to register the new child subject to a subject registry (using Chip). I think having a configurable function that runs after initialization and gives access to the new worker subject is a good thing, but should be called something else. Maybe something like finalize
instead of returning
. Also, it would return Nil
instead of the returning
type argument, which can be removed from the ChildSpec
.Spec
for the dynamic supervisor does not need a return
type argumentChildren
or Starter
types are necessary. AFAICT they're needed for the one-to-rest
strategy, but don't serve a purpose for a one-to-one
.Here's an overview:
pub opaque type Message(argument, children_message) {
ChildAdded(child: ChildSpec(children_message, argument))
Exit(process.ExitMessage)
RetryRestart(Pid)
}
type State {
State(
restarts: IntensityTracker,
// starter: Starter(a),
retry_restarts: Subject(Pid),
)
}
pub type Spec(argument) {
Spec(
argument: argument,
max_frequency: Int,
frequency_period: Int,
init: fn(argument) -> Nil,
)
}
pub opaque type ChildSpec(msg, argument) {
ChildSpec(
start: fn(argument) -> Result(Subject(msg), StartError),
finalize: fn(argument, Subject(msg)) -> Nil,
)
}
fn loop(
message: Message(argument, children_message),
state: State,
) -> actor.Next(Message(argument, children_message), State) {
case message {
Exit(exit_message) -> todo //handle_exit(exit_message.pid, state)
RetryRestart(pid) -> todo //handle_exit(pid, state)
ChildAdded(child) -> todo
}
}
fn init(
spec: Spec(argument),
) -> actor.InitResult(State, Message(argument, children_message)) {
// Create a subject so that we can asynchronously retry restarting when we
// fail to bring an exited child
let retry = process.new_subject()
// Trap exits so that we get a message when a child crashes
process.trap_exits(True)
// Combine selectors
let selector =
process.new_selector()
|> process.selecting(retry, RetryRestart)
|> process.selecting_trapped_exits(Exit)
todo
}
pub fn worker(
start: fn(argument) -> Result(Subject(msg), StartError),
finalize: fn(argument, Subject(msg)) -> Nil,
) -> ChildSpec(msg, argument) {
ChildSpec(start: start, finalize: finalize)
}
One issue with this design is all the workers under the same dynamic supervisor must all return a subject with the same message type. IDK if this is the same in Elixir, but I was forced to specify the ChildSpec
message type in the dynamic supervisor Message
because of the ChildAdded
message.
I think the correct way to go would be to use the Erlang supervisor module as a base as it is well tested and known to work. That's what static_supervisor
does, which is recommended over supervisor
, which has several bugs presently.
That said, I've not looked into this much so I don't know what the API might be or what limitations might arise.
I see.
That seems feasible at first glance. simple-one-to-one
fits my use case at least.
I think I'm gonna need a bit more time playing with the Erlang supervisor
module before I can confidently contribute to a dynamic_supervisor
for gleam_otp.
I started working on the simple-one-for-one
Erlang supervisor, but realized I was mostly just reusing static_supervisor
, which makes sense. I decided to add the SimpleOneForOne
strategy to the static_supervisor
.
Because of how the simple-one-for-one
strategy works, I also had to add a binding for the Erlang supervisor:start_child
function. This means the static_supervisor
isn't so static anymore. One thing led to another, and I ended up binding to what I think are the most important Erlang dynamic supervisor functions for all strategies:
start_child/2
terminate_child/2
restart_child/2
delete_child/2
count_children/1
I essentially have a functional (but messy) dynamic supervisor using the Erlang supervisor
module.
You can compare my work with the current otp main branch here https://github.com/gleam-lang/otp/compare/main...guillheu:otp:simple-one-for-one-erlang
I haven't opened a PR yet because I think my work is very, very messy, and I'd like some feedback first. Sorry if this is a big code review to ask for, feel free to ask for clarifications. I'll try to be responsive.
The simple-one-for-one
behaves so differently from the other 3 strategies I'm thinking it might be preferable to have it in its own separate module.
Hello.
I think Gleam's typed OTP and its concept of subject is great.
But, It would be better to have some types of DynamicSupervisor that allow me to spawn workers as I need.