Open uwiger opened 7 years ago
2 Immediate observations:
These often happen if the internal representation of the object differ from the viewable representation. We had 0.0 and -0.0 in ETS earlier, and also -2^60 +/- 1 (can't remember exactly which it was). Transfer of a Pid could mean that the internal bit structure is different and thus the compare-function on the ordered set runs into trouble.
This sounds like a bug for QuickCheck. It is hard to reproduce, but chances are that generating lots of tests are able to find the bug and shrink it. Once you have it shrunk, it should be possible to reproduce it.
The pids have different serials. Presumably this means that the nodes have disconnected and reconnected (and gproc should have cleaned up, but didn't)
No, it means the node was stopped and restarted. It is actually quite easy to generate identical looking pairs of pids:
erl -sname receiver
Erlang/OTP 18 Klarna-g16e0e6a [erts-7.3.1.3] [source-16e0e6a] [64-bit] [smp:4:4] [async-threads:10] [kernel-poll:false]
Eshell V7.3.1.3 (abort with ^G)
(receiver@dszoboszlay)1> register(receiver, self()).
true
(receiver@dszoboszlay)2> os:cmd("erl -sname sender -noinput -eval '{receiver, receiver@dszoboszlay} ! self(), init:stop().'").
[]
(receiver@dszoboszlay)3> os:cmd("erl -sname sender -noinput -eval '{receiver, receiver@dszoboszlay} ! self(), init:stop().'").
[]
(receiver@dszoboszlay)4> receive P1 -> P1 end.
<7411.2.0>
(receiver@dszoboszlay)5> receive P2 -> P2 end.
<7411.2.0>
(receiver@dszoboszlay)6> P1 == P2.
false
I don't know how could gproc
(and you) miss the node's restart however.
2017-09-07 16:30 GMT+02:00 Dániel Szoboszlay notifications@github.com:
I don't know how could gproc (and you) miss the node's restart however.
Yeah. I can't guarantee that I didn't do something embarassingly stupid while thinking of something else entirely. :)
A slight twist is of course that the visual representation doesn't give any clue that the pids are different. Had they not been the unique part of ets table keys, I probably wouldn't have noticed that something was amiss.
BR, Ulf
I wrote a simple QuickCheck property, basically starting a bunch of nodes (with a placeholder process that make sure gproc
is started) and commands to do monitor(..., follow)
and ets:tab2list(gproc)
on the nodes randomly.
Could not provoke the behaviour Ulf observed. Ran for an hour so some 100k node starts were probably made.
Then I added stopping nodes to the mix, and could (not too surprisingly) mimic the behaviour:
gproc_eqc:start_node(b, []) -> <22171.68.0>
gproc_eqc:start_node(a, [b]) -> <22070.72.0>
gproc_eqc:monitor(#node{ id = a, worker = <22070.72.0>, monitors = []}, a) ->
#Ref<22171.852298202.3371433987.64237>
gproc_eqc:stop_node(a) -> ok
gproc_eqc:start_node(a, [b]) -> <22070.72.0>
gproc_eqc:monitor(#node{ id = a, worker = <22070.72.0>, monitors = []}, a) ->
#Ref<22171.852298202.3371433987.64255>
gproc_eqc:check_node(#node{ id = b, worker = <22171.68.0>, monitors = []}) ->
{ok,
[{{<22070.72.0>, {n, g, a}}, []},
{{<22070.72.0>, {n, g, a}}, []},
{{{n, g, a}, n},
[{<22070.72.0>, #Ref<22171.852298202.3371433987.64255>, follow},
{<22070.72.0>, #Ref<22171.852298202.3371433987.64237>, follow}]}]}
Reason:
Post-condition failed:
[{["<22070.72.0>"], [<22070.72.0>, <22070.72.0>]}] /= []
With the difference that I have two entries in the last element in the list... So it is not exactly the same. However, a node restart will lead to Pid reuse as observed already in Freiburg just a hair over ten years ago :-) (Time flies!!)
Ah, very good! :)
I will concede that I must have forgotten to stop both nodes before making a new attempt. I see an opportunity for some more tests here, and the likelihood that gproc isn't doing what it's supposed to. The locks_leader branch, OTOH, has support for split-brain healing. I'll have to check how it behaves in the same situation.
I'll see if I get time to cleanup that model into a non-embarrasing state, if so I will share it and it can be extended to do something useful :-)
Made a pull request (#144)
For me the existing QuickCheck property failed horribly (made some changes in the pull request) but I guess that is expected?
I came across this. I won't claim that it's a bug in Erlang - I assume there was a hickup in the networking on my Mac - but I must say that I can't even remember having seen this before.
(I believe Hans Svensson talked about something similar in Freiburg 2007).
Running some gproc tests with two local nodes. OTP 18, for no particular reason.
Node A:
Node B:
Node A:
Node B:
A bit interesting to try to reproduce, I guess.