haskell-distributed / distributed-process-platform

DEPRECATED (Cloud Haskell Platform) in favor of distributed-process-extras, distributed-process-async, distributed-process-client-server, distributed-process-registry, distributed-process-supervisor, distributed-process-task and distributed-process-execution
http://haskell-distributed.github.com
BSD 3-Clause "New" or "Revised" License
47 stars 17 forks source link

add missing monitor on CreateHandle variant of ChildStart in Supervisor #74

Closed tavisrudd closed 10 years ago

tavisrudd commented 10 years ago

Without this the child process is unsupervised.

tavisrudd commented 10 years ago

@hyperthunk btw, how is ChildRunningExtra intended to be used? The CreateHandle variant fits our needs from the perspective of having a starter that a) is passed the SupervisorPid and b) can be used to define a ChildSpec from pure code, but I feel we're misusing. For our use case, AnotherChildStartVariant (Closure (SupervisorPid → Process (ProcessId))) might be a better fit. We're using it to pass the SupervisorPid into an api endpoint worker inside the tree. In response to our api calls, this worker asks its own supervisor to start/stop siblings dynamically.

hyperthunk commented 10 years ago

Whoops - nicely spotted @tavisrudd !!! I'll get this merged asap, though it'll take a day for me to get to it since I'm still fiddling around with other PRs and as you can see, the travis build has bombed out. I haven't bothered to look at why (as it's very late here) but it's probably just environment fu.

@hyperthunk btw, how is ChildRunningExtra intended to be used? The CreateHandle variant fits our needs from the perspective of having a starter that a) is passed the SupervisorPid and b) can be used to define a ChildSpec from pure code, but I feel we're misusing. For our use case, AnotherChildStartVariant (Closure (SupervisorPid → Process (ProcessId))) might be a better fit. We're using it to pass the SupervisorPid into an api endpoint worker inside the tree. In response to our api calls, this worker asks its own supervisor to start/stop siblings dynamically.

Argh, be very very careful doing that @tavisrudd - you can easily deadlock the supervision tree that way!!! Imagine the scenario: your worker is responding to some external event (from the network, from another process, another node, etc) and in doing so, it calls into the supervisor asking to start a new child. The supervisor however, has already started shutting down, so the call to start/stop a child never returns - it just blocks indefinitely. Now you supervisor is blocked waiting for the child to stop and the child is blocked waiting for the supervisor to respond. Of course the supervisor will deal with this if the child termination policy provides a timeout, by unceremoniously killing the child, but that's probably not what you really wanted to happen (since the child won't get to cleanup nicely if it's a managed process).

This (above) is a very common cause of deadlock in OTP supervision trees. The usual workaround is to introduce another supervisor, which is a child of the parent (and therefore a sibling of the worker) so that you don't deadlock.

Now, back to your question....

ChildRunningExtra is quite simply just a way to return some extra information about a child, which is usually some kind of opaque handle that is used to communicate with the child instead of using its ProcessId directly. I'd suggest reading http://haskell-distributed-next.github.io/tutorials/6ch.html to get a feel for what I mean by "opaque handles" here. The point is that if you want to call a Control.Distributed.Process.Platform.Service.Registry, you cannot use its ProcessId, since the registry API calls expect a Registry k v type instead. The ChildRunningExtra API provides a way to return that additional information (i.e., the Registry k v handle) along with the pid, so that you can still launch supervised registries without loosing the ability to interact with them. The same goes for any API that uses handles in this fashion, which means quite a lot of d-p-platform in practise (see Exchange and various others).

So in answer to your question, you are not really misusing the API, just using it in a way I hadn't thought of. This is really the same sort of thing as when an Erlang/OTP supervisor starts a child process and the function it is given (to start the child) returns {ok, Child, Info} instead of just {ok, Child} - it's nothing more than a way to pass back a bit of additional data to clients that locate and interact with the child via its supervisor.

I don't think AnotherChildStartVariant is necessary here. What you're doing seems fine to me, though obviously your bugfix is pretty important. ;)

I'll get this merged as soon as I get time to re-run the test suite, hopefully tomorrow afternoon or evening.

tavisrudd commented 10 years ago

On 14-03-16 5:08 PM, Tim Watson wrote:

Argh, be very very careful doing that @tavisrudd https://github.com/tavisrudd - you can easily deadlock the supervision tree that way!!! ...

This (above) is a very common cause of deadlock in OTP supervision trees. The usual workaround is to introduce another supervisor, which is a child of the parent (and therefore a sibling of the worker) so that you don't deadlock.

Thanks for pointing that out. I assume doing lookupChild parentSupPid aSiblingKey from within the api worker would be prone to the same sort of deadlock (thinking of the impossible Nothing/terminate branch in UnsafeClient.call). If so, how would you safely get a reference to the sibling supervisor you suggest?

tavisrudd commented 10 years ago

Btw, @hyperthunk I'm working on a test case or two for CreateHandle

hyperthunk commented 10 years ago

The canonical approach in erlang is to have the sibling supervisors process id passed to the worker's unit or spawn function. This could prove awkward in CH due to the rather different API shape we've adopted though. Performing the lookup immediately (I.e., durin the worker's init function) should be safe, since the top level supervisor won't start a controlled shutdown whilst it is initialising a child. Another approach would be to use name registration or the service registry. Yet another would be to perform the lookup outside of the supervisor altogether, in the child start spawn wrapper you provide. This isn't supervised and can safely block, though the effect will be the same as performing the lookup in your init function.

There may actually be a race here, that we need to plug. If the supervisor has begun spawning the child and is instructed to shut down prior to having updated its internal state, we could leak a process! We need to make sure that the supervisor atomically spawns, monitors and updates it's internal state without interruption. So I suspect we need a call to 'mask' somewhere and only unblock asynchronous exceptions once we are in a position to guarantee we know about all our children, should an exit signal or other Async exception cause a shutdown.

hyperthunk commented 10 years ago

Oh, yes please - more test cases would be lovely! :D

hyperthunk commented 10 years ago

NB: I've filed https://cloud-haskell.atlassian.net/browse/DPP-103 to look at the race I described.

tavisrudd commented 10 years ago

Btw, @hyperthunk what's the preferred method for filing PRs? Should I bother filing tickets on jira in addition to the PR here.

hyperthunk commented 10 years ago

No don't worry about that - I'll file a Jira ticket if there's significant work to do in order to integrate a patch, otherwise I just merge 'em and be done with it. :)