team-telnyx / pogo

Distributed supervisor for clustered Elixir applications
GNU Lesser General Public License v3.0
90 stars 3 forks source link

supervisor crash if starting child fails #4

Open flaviogrossi opened 1 year ago

flaviogrossi commented 1 year ago

Hi, I noticed that when starting or redistributing child processes, the case of a child failing to start is never considered.

This can happen for various reasons, like the process is globally registered and the name is (temporarily) already registered, or the process does some extra work in the init callback which fails, or simply there is a bug in the init logic, etc.

In this case, the whole supervisor will crash because of those two lines above, tearing down all other children with it.

I was wondering what is the right approach to take here. In the redistribution case ignoring failed processes should be enough, since they will be eventually started if the failure is temporary. For the initial start instead, is it enough to ignore the error and not complete the request so that it will be retried later? Should the error be final and just the caller be notified in some way? For example by calling an optional callback argument passed to start_children? Should the concept of restart intensity be introduced to distinguish between temporary and permanent failures (this sound too complex to me)?

What is the correct approach? I can try to propose a pull request if needed.

jaybe78 commented 2 weeks ago

Hello @flaviogrossi

Have you been using this lib in production and i so, have you been happy with it ?

flaviogrossi commented 2 weeks ago

Have you been using this lib in production and i so, have you been happy with it ?

no, we ended up creating our own library for various reasons.

Not using it in production yet, but it seems to be working for our use case.