Open tavisrudd opened 10 years ago
So it is safe for us to kill this ProcessId
from our point of view, because we're deleting the child spec (and therefore won't require this "re-starter process" to continue living. But what if the code that created this re-starter tries to then send it to another supervisor? The ProcessId
given to that supervisor will be dead/stale and the supervisor will reject it (when trying to start the child spec) with StartFailureDied DiedUnknownId
which is confusing.
At first glance, I thought what we actually wanted here is a finalizer that is guaranteed to kill off the re-starter process "at some point" after the ChildSpec
becomes unreachable (and is gc'ed). However, the System.Mem.Weak
documentation isn't clear on whether or not this is guaranteed for an ADT such as ProcessId
. In particular, this comment:
WARNING: weak pointers to ordinary non-primitive Haskell types are particularly fragile, because the compiler is free to optimise away or duplicate the underlying data structure. Therefore attempting to place a finalizer on an ordinary Haskell type may well result in the finalizer running earlier than you expected. This is not a problem for caches and memo tables where early finalization is benign.
The alternative to this would be to document this behaviour of ChildSpec
, explaining that once deleted from a supervisor, the ChildSpec
becomes invalid. But I really really dislike this idea. The problem is that the ChildSpec
is now behaving like a shared, mutable (unsafe) data structure. Even if you serialize the ChildSpec
and send it to another node, if the same ChildSpec
(or even another ChildSpec
that shares the same ToChildStart
thunk!) is removed from any supervisor in the system, we've just invalidated that same data structure across all nodes, because the ProcessId
is no longer valid. That just seems wrong to me, no matter how much documentation we throw at it.
One approach that might work here would be to convert the StarterProcess
constructor to take a Weak ProcessId
and in the ToChildStart
instance create a weak reference to the process id and create a finalizer that kills the process once all its clients - the supervisors using it - go away, i.e., the ChildStart
datum becomes garbage. Problem solved? Well no.....
Firstly, we'd need to test that this finalization business works properly for a Weak ProcessId
. You've already written profiling code to detect the leak, so that shouldn't be too difficult, though relying on finalization will probably mean having to increase the time bounds of the tests to give the System.Mem.Weak
infrastructure time to do the cleanup - maybe even forcing a GC at some point.
Secondly, and this is a serious problem: finalisation is useless if/when the ChildStart
datum escapes the local node. Which means, in practice, that you can't serialize and send
it remotely, otherwise this whole finalization thing will go wrong - we'll never detect that a remote peer is using the ChildStart
at all (i.e., we will loose track of it as soon as it's gone over the wire) and therefore - assuming that finalizing a ProcessId
works at all - we'll end up killing the re-starter process whilst remote supervisors are still using it. Nasty. Confusing. Bug. :(
You can see now why I was a fussy little kitty when @roman and I were debating how to support local children in the original ticket that introduced StarterProcess
. This issue is fraught with edge cases and we're changing a piece of critical fault-tolerance infrastructure, so we can't afford to screw it up.
I think we have three choices here, as I see it - feel free to suggest alternatives though, as always:
StarterProcess
and make people use the Closure Process
constructor insteadStarterProcess
to reference count its clients/supervisorsToChildStart
instance for Process ()
and export StarterProcess
then make people implement this themselvesThe idea of (1) is that we're avoiding the issue by making the API less friendly. Not a great option IMO.
The idea behind (2) is that the re-starter process will keep some internal state to track each time a new supervisor tries to use it. Each time a supervisor deletes a ChildSpec
that uses this re-starter pid, instead of kill restarterPid ...
they should exit restarterPid Shutdown
and the re-starter process should catchExit
expecting Shutdown
and decrement its supervisor/reference count each time. Once that ref-count reaches zero, it terminates. This is still vulnerable to the problem I outlined above though, so I think it's not really viable. We cannot provide an API to the general population that is open to abuse like this.
The idea of (3) is that we either find a way to expose that re-starter pid or just make people implement the ToChildStart
themselves. The point is, if you've written your code so that you know which re-starter is associated with which ChildSpec
and you know (in your own code) that you're no longer using the ChildSpec
or the StarterProcess
then you - the one who knows what is going on are best placed to terminate that process yourself when you're sure it is no longer needed. This puts the responsibility where it really IMHO belongs - in the hands of the application developers.
Thoughts? ......
And of course, it'd be really nice if we came up with a way to do "distributed finalization", but that's really beyond the scope of this ticket. :)
(1) isn't an option for our use case and (2) sounds seriously hard to get right. I think (3) makes most sense. It would still require cleanup of the restarterPid somewhere, but as you say, the developer using dynamic supervision is best positioned to do that correctly. They'd do something like terminateChild ... >> deleteChild ... >> cleanupStarterProcess
. It may be worth keeping the code from the related ToChildStart
instances in d-p-p but as plain functions and in a separate module accompanied by documentation and warnings specific to that use case.
It may be worth keeping the code from the related ToChildStart instances in d-p-p but as plain functions and in a separate module accompanied by documentation and warnings specific to that use case.
Yes I agree that's probably the right thing to do. We could leave it in the Supervisor
API, but as a plain top-level function that returns the process id and then provide a mapping from pid to ChildStart
:
-- | Spawns a process that can be used to (re)start child processes
-- when they fail. The `ChildStart` is suitable for use in a `ChildSpec`.
-- Note that once the starter process is no longer used, the caller
-- (of this function) is responsible for cleaning up (i.e., stopping) it.
spawnChildStarter :: Process () -> Process (ProcessId, ChildStart)
spawnChildStarter proc = ... etc
@hyperthunk I've been experimenting with spawnChildStarter
but don't have a patch ready yet. I'd like to send you a few refactoring patches first. The first just makes the implementation details of the StarterProcess
variants of ToChildStart
private via where clauses. The second moves all the public class definitions, data types and associated derived instances from Supervisor.hs
to Control.Distributed.Process.Platform.Supervisor.Types
. Does that sound ok to you? If so, I'll open a new PR for each.
Does that sound ok to you? If so, I'll open a new PR for each.
Yeah that sounds good to me. Thanks for all the work you've been putting around this - it's much appreciated! I'll try and get all the outstanding PR's QA'd and merged this week.
So.... where are we with this pull request now @tavisrudd ?
I'm going to have to release without this fix I'm afraid, as I'm on holiday next week and this is the last chance I'm going to have to actually make the release. Let's get this merged when I'm back online (next Friday) and do a patch 0.1.1 release with it in.
Caveats - unless you can get me a fix in the next few hours.... :) ?
I won't have time till next week unfortunately. The last few weeks of parenthood and work have been a bit intense and I'm just catching a breath now. Enjoy your holiday!
Tavis
On May 30, 2014, at 5:12 AM, Tim Watson notifications@github.com wrote:
Caveats - unless you can get me a fix in the next few hours.... :) ?
— Reply to this email directly or view it on GitHub.
Hey @tavisrudd - how're you? Do you want me to pick up the fix for this?
I'm well, but busier than I've ever been with a new kid and new job. I've been meaning to get back to this for the past two months but haven't had time. If you want to take this one over, I won't be insulted.
On Wed, 13 Aug 2014, Tim Watson wrote:
Hey @tavisrudd - how're you? Do you want me to pick up the fix for this?
Reply to this email directly or view it on GitHub: https://github.com/haskell-distributed/distributed-process-platform/pull/77#issuecomment-52023004
Ok @tavisrudd - I'm going to pick this one up now. As you can tell, I've been a bit on the busy side myself!
Fix https://cloud-haskell.atlassian.net/browse/DPP-84
When the StarterProcess ChildStart variant is used with dynamic supervision, each cycle of startNewChild + terminateChild/deleteChild leaks a process.
Note, this required converting
handleDeleteChild
fromRestrictedProcess
toProcess
.