gproc_pool:claim doesn't return worker when calling process is killed - Githubissues

uwiger / gproc

Extended process registry for Erlang

Apache License 2.0

1.07k stars 232 forks source link

gproc_pool:claim doesn't return worker when calling process is killed #109

Open netDalek opened 8 years ago

netDalek commented 8 years ago

I've found gproc_pool very flexible vs other pool libraries, but found this behaviour

48>  gproc_pool:claim(pool, fun(_, P) ->  P end).
{true,<0.185.0>}
49> Pid = spawn_link(fun() -> gproc_pool:claim(pool, fun(_, P) -> timer:sleep(10000), P en
d) end).
<0.223.0>
50> exit(Pid, kill).
** exception exit: killed
51>  gproc_pool:claim(pool, fun(_, P) ->  P end).
false
69> gproc_pool:active_workers(pool).
[{a,<0.185.0>}]

After killing the process, worker stay busy forever. I look through the code and found that it is expected behaviour. Am I right? Are there any plans to improve it?

uwiger commented 8 years ago

It's true that the code doesn't handle the case of a process inside a claim operation gets killed from the outside. Strictly speaking, this is a bug.

Fixing it will cause a general slowdown, but I think I found a way that is still cheap enough. See PR #110. Let me know what you think.

netDalek commented 8 years ago

Looks good and simple, can't find disadvantages

uwiger commented 8 years ago

Describing it to my son, I came up with some failure scenarios that seem very hard to address.

Basically, the approach is inherently vulnerable to the calling process being murdered while running inside the claim. Starting from the (correct) assumption that the process could be killed after any given call, at least the following scenarios, however unlikely, come to mind:

Line 509, just after getting the lock, but before spawning the monitoring process. A solution could be to start monitoring before trying the lock, but this seems to forfeit the purpose of the approach.
Line 551, if the calling process is killed just after killing the monitoring process, but before resetting the counter, the counter will never be reset.
If lines 551 & 552 are switched, and the calling process resets the counter then dies before killing the monitoring process, the monitor will reset the counter again, possibly after another process has claimed it.

Considering the above, one could either conclude that the claim() operation is broken and shouldn't be used, or decide to live with the probability that it sometimes breaks, and then decide which types of breakage are least desirable (I'd wager that the counter never getting reset is the most serious).

Alternatively, figure out a different approach that's safe and not so much slower that the function becomes unfit for its purpose.

uwiger commented 8 years ago

(This would be a good time to play around with QuickCheck/Pulse ...)

uwiger commented 8 years ago

Since it certainly is an improvement over the previous implementation, I decided to merge. We can try to eliminate the (very slight) holes in the algorithm later.

netDalek commented 8 years ago

Line 551, if the calling process is killed just after killing the monitoring process, but before resetting the counter, the counter will never be reset.

May be it's better in place of kill + reset_counter send a message to spawned process.

execute_claim(F, K, Pid) ->
     Parent = self(),
     Mon = spawn(
             fun() ->
                     Ref = erlang:monitor(process, Parent),
                     receive
                         {'DOWN', Ref, _, _, _} ->
                             gproc:reset_counter(K)
                         exit ->
                             gproc:reset_counter(K)
                     end
             end),
     try begin
             Res = F(K, Pid),
             {true, Res}
         end
     after
         Pid ! exit
     end.

uwiger commented 8 years ago

I thought about that, but it complicates things, is not inherently safe, and also introduces significant latency (assuming the monitoring process always does the reset).

It would probably at least be a good idea to switch lines 551 and 552, though. I forgot to do that.