pid weirdness with gproc_dist

uwiger commented 7 years ago

I came across this. I won't claim that it's a bug in Erlang - I assume there was a hickup in the networking on my Mac - but I must say that I can't even remember having seen this before.

(I believe Hans Svensson talked about something similar in Freiburg 2007).

Running some gproc tests with two local nodes. OTP 18, for no particular reason.

Node A:

Erlang/OTP 18 [erts-7.3.1] [source] [64-bit] [smp:8:8] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V7.3.1  (abort with ^G)
(a@uwpro-2)1> application:ensure_all_started(gproc).
{ok,[gproc]}

Node B:

Eshell V7.3.1  (abort with ^G)
(b@uwpro-2)1> net:ping('a@uwpro-2').
pong
(b@uwpro-2)2> application:ensure_all_started(gproc).
{ok,[gproc]}
(b@uwpro-2)3> gproc:monitor({n,g,a},follow).
#Ref<6830.0.4.170>
(b@uwpro-2)4> flush().
Shell got {gproc,unreg,#Ref<6830.0.4.170>,{n,g,a}}
ok

Node A:

(a@uwpro-2)2> ets:tab2list(gproc).
[{{<6946.39.0>,{n,g,a}},[]},
 {{<6946.39.0>,{n,g,a}},[]},
 {{{n,g,a},n},[{<6946.39.0>,#Ref<0.0.4.170>,follow}]}]

Oops! The 'gproc' table is an ordered_set, so the above looks impossible.

(a@uwpro-2)4> ets:lookup(gproc, {pid(6946,39,0),{n,g,a}}).
[]

Uhmm ...

(a@uwpro-2)8> [A,B,C] = v(2).                   
[{{<6946.39.0>,{n,g,a}},[]},
 {{<6946.39.0>,{n,g,a}},[]},
 {{{n,g,a},n},[{<6946.39.0>,#Ref<0.0.4.170>,follow}]}]
(a@uwpro-2)9> ets:lookup(gproc,element(1,A)).
[{{<6946.39.0>,{n,g,a}},[]}]
(a@uwpro-2)10> ets:lookup(gproc,element(1,B)).
[{{<6946.39.0>,{n,g,a}},[]}]

Ok, so the two objects are both individually accessible.

(a@uwpro-2)11> A == B.
false

(a@uwpro-2)13> Pa = element(1,element(1,A)).
<6946.39.0>
(a@uwpro-2)14> Pb = element(1,element(1,B)).
<6946.39.0>
(a@uwpro-2)15> Pa == Pb.
false

The two pids are not the same.

(a@uwpro-2)18> term_to_binary(Pa).
<<131,103,100,0,9,98,64,117,119,112,114,111,45,50,0,0,0,
  39,0,0,0,0,1>>
(a@uwpro-2)19> term_to_binary(Pb).
<<131,103,100,0,9,98,64,117,119,112,114,111,45,50,0,0,0,
  39,0,0,0,0,2>>

The pids have different serials. Presumably this means that the nodes have disconnected and reconnected (and gproc should have cleaned up, but didn't)

(a@uwpro-2)20> Pa ! hi_a.
hi_a
(a@uwpro-2)21> Pb ! hi_b.
hi_b

Node B:

(b@uwpro-2)5> flush().
Shell got {gproc,unreg,#Ref<6830.0.4.170>,{n,g,a}}
Shell got hi_b
ok

So one pid worked, the other one didn't.

A bit interesting to try to reproduce, I guess.

jlouis commented 7 years ago

2 Immediate observations:

These often happen if the internal representation of the object differ from the viewable representation. We had 0.0 and -0.0 in ETS earlier, and also -2^60 +/- 1 (can't remember exactly which it was). Transfer of a Pid could mean that the internal bit structure is different and thus the compare-function on the ordered set runs into trouble.
This sounds like a bug for QuickCheck. It is hard to reproduce, but chances are that generating lots of tests are able to find the bug and shrink it. Once you have it shrunk, it should be possible to reproduce it.

dszoboszlay commented 7 years ago

The pids have different serials. Presumably this means that the nodes have disconnected and reconnected (and gproc should have cleaned up, but didn't)

No, it means the node was stopped and restarted. It is actually quite easy to generate identical looking pairs of pids:

erl -sname receiver
Erlang/OTP 18 Klarna-g16e0e6a [erts-7.3.1.3] [source-16e0e6a] [64-bit] [smp:4:4] [async-threads:10] [kernel-poll:false]

Eshell V7.3.1.3  (abort with ^G)
(receiver@dszoboszlay)1> register(receiver, self()).
true
(receiver@dszoboszlay)2> os:cmd("erl -sname sender -noinput -eval '{receiver, receiver@dszoboszlay} ! self(), init:stop().'").
[]
(receiver@dszoboszlay)3> os:cmd("erl -sname sender -noinput -eval '{receiver, receiver@dszoboszlay} ! self(), init:stop().'").
[]
(receiver@dszoboszlay)4> receive P1 -> P1 end.
<7411.2.0>
(receiver@dszoboszlay)5> receive P2 -> P2 end.
<7411.2.0>
(receiver@dszoboszlay)6> P1 == P2.
false

I don't know how could gproc (and you) miss the node's restart however.

uwiger commented 7 years ago

2017-09-07 16:30 GMT+02:00 Dániel Szoboszlay notifications@github.com:

I don't know how could gproc (and you) miss the node's restart however.

Yeah. I can't guarantee that I didn't do something embarassingly stupid while thinking of something else entirely. :)

A slight twist is of course that the visual representation doesn't give any clue that the pids are different. Had they not been the unique part of ets table keys, I probably wouldn't have noticed that something was amiss.

BR, Ulf

hanssv commented 6 years ago

I wrote a simple QuickCheck property, basically starting a bunch of nodes (with a placeholder process that make sure gproc is started) and commands to do monitor(..., follow) and ets:tab2list(gproc) on the nodes randomly.

Could not provoke the behaviour Ulf observed. Ran for an hour so some 100k node starts were probably made.

Then I added stopping nodes to the mix, and could (not too surprisingly) mimic the behaviour:

gproc_eqc:start_node(b, []) -> <22171.68.0>
gproc_eqc:start_node(a, [b]) -> <22070.72.0>
gproc_eqc:monitor(#node{ id = a, worker = <22070.72.0>, monitors = []}, a) ->
  #Ref<22171.852298202.3371433987.64237>
gproc_eqc:stop_node(a) -> ok
gproc_eqc:start_node(a, [b]) -> <22070.72.0>
gproc_eqc:monitor(#node{ id = a, worker = <22070.72.0>, monitors = []}, a) ->
  #Ref<22171.852298202.3371433987.64255>
gproc_eqc:check_node(#node{ id = b, worker = <22171.68.0>, monitors = []}) ->
  {ok,
     [{{<22070.72.0>, {n, g, a}}, []},
      {{<22070.72.0>, {n, g, a}}, []},
      {{{n, g, a}, n},
       [{<22070.72.0>, #Ref<22171.852298202.3371433987.64255>, follow},
        {<22070.72.0>, #Ref<22171.852298202.3371433987.64237>,  follow}]}]}

Reason:
  Post-condition failed:
  [{["<22070.72.0>"], [<22070.72.0>, <22070.72.0>]}] /= []

With the difference that I have two entries in the last element in the list... So it is not exactly the same. However, a node restart will lead to Pid reuse as observed already in Freiburg just a hair over ten years ago :-) (Time flies!!)

uwiger commented 6 years ago

Ah, very good! :)

I will concede that I must have forgotten to stop both nodes before making a new attempt. I see an opportunity for some more tests here, and the likelihood that gproc isn't doing what it's supposed to. The locks_leader branch, OTOH, has support for split-brain healing. I'll have to check how it behaves in the same situation.

hanssv commented 6 years ago

I'll see if I get time to cleanup that model into a non-embarrasing state, if so I will share it and it can be extended to do something useful :-)

hanssv commented 6 years ago

Made a pull request (#144)

For me the existing QuickCheck property failed horribly (made some changes in the pull request) but I guess that is expected?

uwiger / gproc

pid weirdness with gproc_dist #141