How to implement one_for_all restart strategy?

l2dy commented 3 years ago

ErrTerminateTree combined with DontPropagateTermination would just kill a tree and not restart it. Is there a way to terminate and restart all child services when one of them fails?

https://github.com/thejerf/suture/blob/28d70c1169e7048214486c64ac0f40e04b595cb1/v4/supervisor.go#L365-L371

thejerf commented 3 years ago

In my opinion, you really can't implement this in Go safely, which is why suture is structured in a way that basically assumes you'll never want this. There is no safe way to stop a service and be sure it has stopped, in a situation where there may be error conditions preventing it from stopping.

Erlang is the only language I know where it is safe to reach out to another thread and kill it, to know you've killed it, and to know the system will be able to continue on safely. (Haskell is the only other language I know of with safe asynchronous exceptions, but even then I think the remote thread can capture them and I'm not sure there's an "unconditionally kill the remote thread" as there is in Erlang. Don't recall, not looking it up just for this post.)

Without this primitive at the language level, it's really dangerous to try to claim that a Go library can offer this functionality. This is partially because we are speaking about trying to do this precisely at the time that something has gone wrong, so while in normal code you may be able to assume that a crashing process isn't necessarily that big a deal, suture must be written with the assumption that bad things are happening right now and that services that didn't actually "stop" when they were asked to is a realistic concern. (Which is, after all, precisely why we have code that does just that in there already, and I've seen it fire plenty of times just in my own code, let alone all the experiences of everyone using this.)

However, I am very interested in what use case you have which leads you to want this. I never found one even when I was programming in Erlang for ~6-7 years. If we dig down to the next level of why you're interested in this functionality, we may be able to work out something that we can do in idiomatic Go to meet the use case. (Or some use of existing functionality that works for you.)

l2dy commented 3 years ago

If the supervised services depend on each other, one_for_all resets more state, and therefore could recover from certain edge cases not covered by other restart strategies. This of course may not be practical in Go, because you can't forcibly kill a goroutine.

My actual use case is more close to the rest_for_one strategy. The dependent services capture a reference of the terminated service on start, so if that service fails we have to reload it, and restarting would be a thread-safe option.

https://github.com/halturin/ergo has a supervision tree implementation, and it somehow supports the one_for_all strategy, which is why I asked if it's possible to implement this in suture.

thejerf commented 3 years ago

somehow supports the one_for_all strategy, which is why I asked if it's possible to implement this in suture.

Well... does it, though? Or does it only support it when things are working correctly and telling services to shut down works correctly?

Since what I'm saying is about Go and not about a particular library, I would rather suspect that if you write services that hang in ergo, one_for_all will not be quite as effective as it is in Erlang and will fail one way or another, not because of any particular defect in ergo, but because the Go language + runtime simply don't have the primitives necessary.

It looks like ergo wants to directly integrate into Erlang as an Erlang node, which is really cool, but means they have to make certain promises whether or not they can actually fulfill them in Go. suture is more about being idiomatic in Go and not making promises it can't fully keep.

The dependent services capture a reference of the terminated service on start, so if that service fails we have to reload it, and restarting would be a thread-safe option.

I continue to be interested in the use case you are encountering, though, because again, we may be able to come up with something more idiomatic and a promise suture can keep, rather than blindly copying Erlang.

This confuses me somewhat, though; by a "reference" do you mean a pointer? If service X starts up, and dependent services get an *X, when the service is restarted by suture that pointer will still be valid because we restart the same struct up again. If there's a channel, as long as the startup process doesn't create a new channel (i.e., it should be created at initialization time) it would still be valid. I'm not sure what is getting invalidated by a service getting restarted.

l2dy commented 3 years ago

This confuses me somewhat, though; by a "reference" do you mean a pointer? If service X starts up, and dependent services get an *X, when the service is restarted by suture that pointer will still be valid because we restart the same struct up again. If there's a channel, as long as the startup process doesn't create a new channel (i.e., it should be created at initialization time) it would still be valid. I'm not sure what is getting invalidated by a service getting restarted.

It's a buffered channel, and on restart I want to discard any data buffered to cleanup the state.

l2dy commented 3 years ago

And "killing" the rest of the services also ensures that data sent to this newly created channel would not be based on their old state.

thejerf commented 3 years ago

I had to give this one a bit of thought time. The channel adds a constraint that there is no equivalent for in Erlang, because you must create a clear channel while none of the children are trying to use it. (In Erlang, messages are spec'd as being able to just drop, but channels do not work that way.)

I think in your position I would use composition with a Supervisor, something like this:

type BuffChannelService struct {
    c        chan int
    superCtx context.Context
    *suture.Supervisor
}

func (bcs *BuffChannelService) Serve(ctx context.Context) {
    if bcs.super != nil {
        // You may do other things here to capture the results,
        // & pre-v4 needs to do something else.
        go bcs.superCtx.Cancel()
    }

    bcs.superCtx = ctx
    bcs.super = suture.New(...)
    bcs.c = make(chan int, BufferSize)

    // add children here, passing the buffered channel in to them as needed

    go bcs.super.Serve(ctx)

    for {
        // server loop here
    }
}

This is just a sketch, some logging around the cancellation or whatever may be useful, putting the child creation into a method may be nice, etc.

This should keep everything ordered correctly.

I think trying to put this in suture itself as a strategy or something would be a bit rough because of those very particular ordering needs. It starts getting into an inner-platform effect if we go too far down that road.

l2dy commented 3 years ago

Thanks, composition is really useful here. Not sure if I need a mutex to protect c, but this should work for me and decouple its children from Serve().

type BuffChanService struct {
    c         chan Action
    supCancel context.CancelFunc
    supErr    <-chan error
    *suture.Supervisor
}

func New() *BuffChanService {
    return &BuffChanService{
        Supervisor: suture.NewSimple("bcs"),
    }
}

func (bcs *BuffChanService) GetChan() chan Action {
    return bcs.c
}

func (bcs *BuffChanService) Serve(ctx context.Context) error {
    if bcs.supErr != nil {
        bcs.supCancel()
        <-bcs.supErr
    }

    ctx, bcs.supCancel = context.WithCancel(ctx)
    defer bcs.supCancel()
    bcs.c = make(chan Action, BufferSize)
    bcs.supErr = bcs.Supervisor.ServeBackground(ctx)

    for {
        // server loop
    }
}

func Run() {
    bcs := New()

    // Construct children, pass the BuffChanService pointer and let them call GetChan on (re-)initialization.
    bcs.Add(...)
}

thejerf commented 3 years ago

If you pass the channel to the children, you shouldn't need the mutex. If you pass a reference, you should probably still pass a copy of the channel to them directly to avoid the race conditions.

Glad I could help.

thejerf / suture

How to implement one_for_all restart strategy? #55