Closed dskliarov closed 9 years ago
Could you describe the exact steps you take to reproduce this? I tried with gdict on two nodes, then adding a third, and saw no issues.
Also, in which situations do you see the get_locks/2 function failing because there are no locks? The assumption was that it will never be called unless there are locks. This assumption could of course be broken, but I'd like to understand when that happens.
Thank you for looking the problem, Ulf. In my application, I have a logic to form a cluster(reading config from binary file, initiate monitoring nodes and process replication). Looks like, when loks started before cluster is formed, it creates net split scenario(2 nodes have a leader and new node is a leader). In this case in function locks_info, leader will be set to undefined in record #st. After that messages from_leader will be ignored and node will never switch to the gen_server/loop. I modified from_leader function and it is working now.
rom_leader(L, Msg, #st{leader = undefined} = S) -> S1 = S#st{leader = L}, from_leader(L,Msg,S1); from_leader(L, Msg, #st{leader = L, mod = M, mod_state = MSt} = S) -> ?event({processing_from_leader, L, Msg}, S), callback(M:from_leader(Msg, MSt, opaque(S)), S); from_leader(_OtherL, _Msg, S) -> ?event({ignoring_from_leader, _OtherL, _Msg}, S), S.
I can issue pull request.
About get_locks function, I got bad_match error, when I shut down one of the nodes.
Just accepting the leader from 'from_leader' is problematic. The elected()
and surrendered()
callbacks need to be called appropriately. So before the handshake between leader and candidate, it's appropriate to ignore the 'from_leader' messages.
Could you try the changes in PR #8 ?
Never mind. Wrong thinking on my part. Have to fix.
Thank you
FWIW, I was able to recreate the issue with get_locks
failing with a badmatch
error when a node shuts down using the gdict
module in the examples. The uw-leader-hanging branch seems to have resolved the issue. To recreate the issue on master I started three nodes, dev1
, dev2
, dev3
, all with the examples
application started.
$> erl -pa ./ebin -pa deps/*/ebin -pa ./examples/ebin -sname dev${i} -setcookie locks -eval "application:ensure_all_started(examples)."
Connect the nodes and start a new gdict
on each node starting from dev1
using:
{ok, Gdict} = gdict:new(dev).
I'm able to store and fetch values from all nodes. When I kill any node, I get the follow stacktrace on each node that's not the leader, in this case dev1
:
=ERROR REPORT==== 22-Dec-2014::17:35:58 === [2/248]
locks_agent: aborted
reason: {badmatch,[]}
trace: [{locks_agent,get_locks,2,
[{file,"src/locks_agent.erl"},{line,1195}]},
{locks_agent,get_locks,2,
[{file,"src/locks_agent.erl"},{line,1196}]},
{locks_agent,analyse,2,[{file,"src/locks_agent.erl"},{line,1210}]},
{locks_agent,handle_locks,1,
[{file,"src/locks_agent.erl"},{line,843}]},
{locks_agent,handle_info,2,
[{file,"src/locks_agent.erl"},{line,608}]},
{locks_agent,handle_msg,2,
[{file,"src/locks_agent.erl"},{line,256}]},
{locks_agent,loop,1,[{file,"src/locks_agent.erl"},{line,231}]},
{locks_agent,agent_init,3,
[{file,"src/locks_agent.erl"},{line,198}]}]
=ERROR REPORT==== 22-Dec-2014::17:35:58 ===
Error in process <0.57.0> on node 'dev2' with exit value: {{badmatch,[]},[
{locks_agent,agent_init,3,[{file,"src/locks_agent.erl"},{line,205}]}]}
Let me know if providing any more information would be helpful.
Ok, thanks. I think I'll merge the uw-leader-hanging branch.
Is this still a problem?
I switch for now to gen_server. It is working without any issues. I am going to come back to the gen_lock later. Thank you for both of this projects.
On Wednesday, 21 October 2015, Ulf Wiger notifications@github.com wrote:
Is this still a problem?
— Reply to this email directly or view it on GitHub https://github.com/uwiger/locks/issues/7#issuecomment-149999920.
Dmitri Skliarov
Ok. Closing this issue.
Not able reproduce the issue on master so no longer a problem as far as I can tell.
After I started 2 locks on 2 nodes, everything is working as expected: One node become a leader. locks_leader process on both nodes is in gen_server mode(current function is gen_server/loop). When I starting 3rd node, first 2 nodes become unresponsive to gen_server calls(current function on lock_leader process on this 2 nodes is lock_leader/safe_loop). lock_leader is waiting message have_all_locks, but not getting it.
I modify check_if_done function(line 787) in locks_agent.erl to resolve it
check_if_done(#state{pending = Pending} = State, Msgs) -> case ets:info(Pending, size) of 0 -> Msg = {have_all_locks, []}, notify_msgs([Msg|Msgs], haveall(State)); -> check_ifdone(State, Msgs) end.
After change, Leader node is in gen_server/loop, but other 2 nodes are in gen_leader/safe_loop.
Modified function get_locks (line 1194) to handle case when ets table is empty get_locks([H|T], Ls) -> case ets_lookup(Ls, H) of [L] -> [L | get_locks(T, Ls)]; [] -> get_locks(T, Ls) end; getlocks([], ) -> [].
After all that, I still having issues and continue to debug. Could you please check, and tell me if I am on the right path. Thank you