uwiger / gproc

Extended process registry for Erlang
Apache License 2.0
1.07k stars 231 forks source link

Unresponsive leader and lost DOWNs #30

Closed msadkov closed 10 years ago

msadkov commented 11 years ago

Hello,

I ran into an issue when globally registered name would not be automatically cleaned up by gproc if the name owner process crashed at the time when leader was unresponsive. My guess is that this is happening due to lost DOWN notification. Steps to reproduce:

start two nodes dev1 and dev2 (dev1 is the leader):

(dev1@127.0.0.1)1> application:start(gproc). nodes().
ok
(dev1@127.0.0.1)2> nodes().
['dev2@127.0.0.1']
(dev1@127.0.0.1)3> 

(dev2@127.0.0.1)1> application:start(gproc). nodes().
ok
(dev2@127.0.0.1)2> nodes().
['dev1@127.0.0.1']
(dev2@127.0.0.1)3> 
(dev2@127.0.0.1)4> gproc_dist:get_leader().
'dev1@127.0.0.1'

start a process on dev2 which would register global name:

(dev2@127.0.0.1)5> spawn(fun() -> gproc:add_global_name(foobar), timer:sleep(10000) end).
<0.66.0>

within 10 seconds timeout, send dev1 to background by hitting CTRL+Z registration is there:

(dev2@127.0.0.1)6> gproc:select({g,n}, [{'_', [], ['$$']}]).
[[{n,g,foobar},<0.66.0>,undefined]]

wait for dev1 to disappear from nodes:

(dev2@127.0.0.1)10> nodes().
[]

registration is still there:

(dev2@127.0.0.1)11> gproc:select({g,n}, [{'_', [], ['$$']}]).
[[{n,g,foobar},<0.66.0>,undefined]]

gproc refuses to register it:

(dev2@127.0.0.1)12> gproc:add_global_name(foobar).
** exception error: bad argument
     in function  gproc:add_global_name/1
        called as gproc:add_global_name(foobar)

where/1 filters it out since it's a local pid:

(dev2@127.0.0.1)16> gproc:where({n,g,foobar}).
undefined

is this behavior a bug or feature? is there a good way to cope with it?

Thank you!

uwiger commented 11 years ago

It's certainly not a feature! :)

I'm looking into it.

uwiger commented 11 years ago

I have some ideas, but the trickiest part of the problem is that a netsplit occurs. Only a few of the gen_leader versions (e.g. garret-smith/gen_leader_revival) have some support for netsplits, and at least when I try this scenario with garret-smith's version, it doesn't seem to do the right thing.

However, a few things come to mind:

norton commented 11 years ago

@msadkov There exists an application to detect network splits for mnesia and hibari. It would need some customisation for gproc. Nevertheless, it might be of help to you.

The application is here => https://github.com/hibari/partition-detector

The admin documentation is here => http://hibari.github.com/hibari-doc/hibari-sysadmin-guide.en.html#partition-detector

msadkov commented 11 years ago

@uwiger @norton thank you for your replies! I'm aware of gen_leader/gproc not being able to handle net splits, so -kernel dist_auto_connect was set to once in this case (I should have mentioned this in my first post, sorry).. with that said, this situation doesn't look like a net split, but rather node going down (not immediately, but after a timeout), right? and after unresponsive node disappears from nodes list (which means a terminated connection, AFAIU) there is no timeout involved anymore, since I can get an immediate DOWN message after calling erlang:monitor with a dead pid sitting in gproc's ets..

uwiger commented 11 years ago

I'm making som progress getting gproc to heal after netsplits, as well as doing proper monitoring. I don't have a solution for handling conflicts yet, and need to fix some regression bugs. I'll keep you informed..

msadkov commented 11 years ago

Thank you!

uwiger commented 11 years ago

Going through the issues list, closing out issues. This one is not yet resolved, so I'll leave it open. Sorry about the delay.

uwiger commented 10 years ago

Closing this issue. Feel free to try out the locks_leader branch which should handle netsplits more robustly.