Fix process leak on Supervisor with StarterProcess (DPP-84)

tavisrudd commented 10 years ago

Fix https://cloud-haskell.atlassian.net/browse/DPP-84

When the StarterProcess ChildStart variant is used with dynamic supervision, each cycle of startNewChild + terminateChild/deleteChild leaks a process.

Note, this required converting handleDeleteChild from RestrictedProcess to Process.

hyperthunk commented 10 years ago

So it is safe for us to kill this ProcessId from our point of view, because we're deleting the child spec (and therefore won't require this "re-starter process" to continue living. But what if the code that created this re-starter tries to then send it to another supervisor? The ProcessId given to that supervisor will be dead/stale and the supervisor will reject it (when trying to start the child spec) with StartFailureDied DiedUnknownId which is confusing.

At first glance, I thought what we actually wanted here is a finalizer that is guaranteed to kill off the re-starter process "at some point" after the ChildSpec becomes unreachable (and is gc'ed). However, the System.Mem.Weak documentation isn't clear on whether or not this is guaranteed for an ADT such as ProcessId. In particular, this comment:

WARNING: weak pointers to ordinary non-primitive Haskell types are particularly fragile, because the compiler is free to optimise away or duplicate the underlying data structure. Therefore attempting to place a finalizer on an ordinary Haskell type may well result in the finalizer running earlier than you expected. This is not a problem for caches and memo tables where early finalization is benign.

The alternative to this would be to document this behaviour of ChildSpec, explaining that once deleted from a supervisor, the ChildSpec becomes invalid. But I really really dislike this idea. The problem is that the ChildSpec is now behaving like a shared, mutable (unsafe) data structure. Even if you serialize the ChildSpec and send it to another node, if the same ChildSpec (or even another ChildSpec that shares the same ToChildStart thunk!) is removed from any supervisor in the system, we've just invalidated that same data structure across all nodes, because the ProcessId is no longer valid. That just seems wrong to me, no matter how much documentation we throw at it.

One approach that might work here would be to convert the StarterProcess constructor to take a Weak ProcessId and in the ToChildStart instance create a weak reference to the process id and create a finalizer that kills the process once all its clients - the supervisors using it - go away, i.e., the ChildStart datum becomes garbage. Problem solved? Well no.....

Firstly, we'd need to test that this finalization business works properly for a Weak ProcessId. You've already written profiling code to detect the leak, so that shouldn't be too difficult, though relying on finalization will probably mean having to increase the time bounds of the tests to give the System.Mem.Weak infrastructure time to do the cleanup - maybe even forcing a GC at some point.

Secondly, and this is a serious problem: finalisation is useless if/when the ChildStart datum escapes the local node. Which means, in practice, that you can't serialize and send it remotely, otherwise this whole finalization thing will go wrong - we'll never detect that a remote peer is using the ChildStart at all (i.e., we will loose track of it as soon as it's gone over the wire) and therefore - assuming that finalizing a ProcessId works at all - we'll end up killing the re-starter process whilst remote supervisors are still using it. Nasty. Confusing. Bug. :(

You can see now why I was a fussy little kitty when @roman and I were debating how to support local children in the original ticket that introduced StarterProcess. This issue is fraught with edge cases and we're changing a piece of critical fault-tolerance infrastructure, so we can't afford to screw it up.

I think we have three choices here, as I see it - feel free to suggest alternatives though, as always:

remove StarterProcess and make people use the Closure Process constructor instead
re-work StarterProcess to reference count its clients/supervisors
remove the ToChildStart instance for Process () and export StarterProcess then make people implement this themselves

The idea of (1) is that we're avoiding the issue by making the API less friendly. Not a great option IMO.

The idea behind (2) is that the re-starter process will keep some internal state to track each time a new supervisor tries to use it. Each time a supervisor deletes a ChildSpec that uses this re-starter pid, instead of kill restarterPid ... they should exit restarterPid Shutdown and the re-starter process should catchExit expecting Shutdown and decrement its supervisor/reference count each time. Once that ref-count reaches zero, it terminates. This is still vulnerable to the problem I outlined above though, so I think it's not really viable. We cannot provide an API to the general population that is open to abuse like this.

The idea of (3) is that we either find a way to expose that re-starter pid or just make people implement the ToChildStart themselves. The point is, if you've written your code so that you know which re-starter is associated with which ChildSpec and you know (in your own code) that you're no longer using the ChildSpec or the StarterProcess then you - the one who knows what is going on are best placed to terminate that process yourself when you're sure it is no longer needed. This puts the responsibility where it really IMHO belongs - in the hands of the application developers.

Thoughts? ......

hyperthunk commented 10 years ago

And of course, it'd be really nice if we came up with a way to do "distributed finalization", but that's really beyond the scope of this ticket. :)

tavisrudd commented 10 years ago

(1) isn't an option for our use case and (2) sounds seriously hard to get right. I think (3) makes most sense. It would still require cleanup of the restarterPid somewhere, but as you say, the developer using dynamic supervision is best positioned to do that correctly. They'd do something like terminateChild ... >> deleteChild ... >> cleanupStarterProcess. It may be worth keeping the code from the related ToChildStart instances in d-p-p but as plain functions and in a separate module accompanied by documentation and warnings specific to that use case.

hyperthunk commented 10 years ago

It may be worth keeping the code from the related ToChildStart instances in d-p-p but as plain functions and in a separate module accompanied by documentation and warnings specific to that use case.

Yes I agree that's probably the right thing to do. We could leave it in the Supervisor API, but as a plain top-level function that returns the process id and then provide a mapping from pid to ChildStart:

-- | Spawns a process that can be used to (re)start child processes
-- when they fail. The `ChildStart` is suitable for use in a `ChildSpec`.
-- Note that once the starter process is no longer used, the caller
-- (of this function) is responsible for cleaning up (i.e., stopping) it.
spawnChildStarter :: Process () -> Process (ProcessId, ChildStart)
spawnChildStarter proc = ... etc

tavisrudd commented 10 years ago

@hyperthunk I've been experimenting with spawnChildStarter but don't have a patch ready yet. I'd like to send you a few refactoring patches first. The first just makes the implementation details of the StarterProcess variants of ToChildStart private via where clauses. The second moves all the public class definitions, data types and associated derived instances from Supervisor.hs to Control.Distributed.Process.Platform.Supervisor.Types. Does that sound ok to you? If so, I'll open a new PR for each.

hyperthunk commented 10 years ago

Does that sound ok to you? If so, I'll open a new PR for each.

Yeah that sounds good to me. Thanks for all the work you've been putting around this - it's much appreciated! I'll try and get all the outstanding PR's QA'd and merged this week.

hyperthunk commented 10 years ago

So.... where are we with this pull request now @tavisrudd ?

hyperthunk commented 10 years ago

I'm going to have to release without this fix I'm afraid, as I'm on holiday next week and this is the last chance I'm going to have to actually make the release. Let's get this merged when I'm back online (next Friday) and do a patch 0.1.1 release with it in.

hyperthunk commented 10 years ago

Caveats - unless you can get me a fix in the next few hours.... :) ?

tavisrudd commented 10 years ago

I won't have time till next week unfortunately. The last few weeks of parenthood and work have been a bit intense and I'm just catching a breath now. Enjoy your holiday!

Tavis

On May 30, 2014, at 5:12 AM, Tim Watson notifications@github.com wrote:

Caveats - unless you can get me a fix in the next few hours.... :) ?

— Reply to this email directly or view it on GitHub.

hyperthunk commented 9 years ago

Hey @tavisrudd - how're you? Do you want me to pick up the fix for this?

tavisrudd commented 9 years ago

I'm well, but busier than I've ever been with a new kid and new job. I've been meaning to get back to this for the past two months but haven't had time. If you want to take this one over, I won't be insulted.

On Wed, 13 Aug 2014, Tim Watson wrote:

Hey @tavisrudd - how're you? Do you want me to pick up the fix for this?

Reply to this email directly or view it on GitHub: https://github.com/haskell-distributed/distributed-process-platform/pull/77#issuecomment-52023004

hyperthunk commented 9 years ago

Ok @tavisrudd - I'm going to pick this one up now. As you can tell, I've been a bit on the busy side myself!

haskell-distributed / distributed-process-platform

Fix process leak on Supervisor with StarterProcess (DPP-84) #77