uwiger / locks

A scalable, deadlock-resolving resource locker
Mozilla Public License 2.0
204 stars 26 forks source link

When using more than on 2 nodes, first 2 nodes are hanged and not responding to gen_server calls #7

Closed dskliarov closed 9 years ago

dskliarov commented 9 years ago

After I started 2 locks on 2 nodes, everything is working as expected: One node become a leader. locks_leader process on both nodes is in gen_server mode(current function is gen_server/loop). When I starting 3rd node, first 2 nodes become unresponsive to gen_server calls(current function on lock_leader process on this 2 nodes is lock_leader/safe_loop). lock_leader is waiting message have_all_locks, but not getting it.

I modify check_if_done function(line 787) in locks_agent.erl to resolve it

check_if_done(#state{pending = Pending} = State, Msgs) -> case ets:info(Pending, size) of 0 -> Msg = {have_all_locks, []}, notify_msgs([Msg|Msgs], haveall(State)); -> check_ifdone(State, Msgs) end.

After change, Leader node is in gen_server/loop, but other 2 nodes are in gen_leader/safe_loop.

Modified function get_locks (line 1194) to handle case when ets table is empty get_locks([H|T], Ls) -> case ets_lookup(Ls, H) of [L] -> [L | get_locks(T, Ls)]; [] -> get_locks(T, Ls) end; getlocks([], ) -> [].

After all that, I still having issues and continue to debug. Could you please check, and tell me if I am on the right path. Thank you

uwiger commented 9 years ago

Could you describe the exact steps you take to reproduce this? I tried with gdict on two nodes, then adding a third, and saw no issues.

Also, in which situations do you see the get_locks/2 function failing because there are no locks? The assumption was that it will never be called unless there are locks. This assumption could of course be broken, but I'd like to understand when that happens.

dskliarov commented 9 years ago

Thank you for looking the problem, Ulf. In my application, I have a logic to form a cluster(reading config from binary file, initiate monitoring nodes and process replication). Looks like, when loks started before cluster is formed, it creates net split scenario(2 nodes have a leader and new node is a leader). In this case in function locks_info, leader will be set to undefined in record #st. After that messages from_leader will be ignored and node will never switch to the gen_server/loop. I modified from_leader function and it is working now.

rom_leader(L, Msg, #st{leader = undefined} = S) -> S1 = S#st{leader = L}, from_leader(L,Msg,S1); from_leader(L, Msg, #st{leader = L, mod = M, mod_state = MSt} = S) -> ?event({processing_from_leader, L, Msg}, S), callback(M:from_leader(Msg, MSt, opaque(S)), S); from_leader(_OtherL, _Msg, S) -> ?event({ignoring_from_leader, _OtherL, _Msg}, S), S.

I can issue pull request.

About get_locks function, I got bad_match error, when I shut down one of the nodes.

uwiger commented 9 years ago

Just accepting the leader from 'from_leader' is problematic. The elected() and surrendered() callbacks need to be called appropriately. So before the handshake between leader and candidate, it's appropriate to ignore the 'from_leader' messages.

uwiger commented 9 years ago

Could you try the changes in PR #8 ?

uwiger commented 9 years ago

Never mind. Wrong thinking on my part. Have to fix.

dskliarov commented 9 years ago

Thank you

ctbarbour commented 9 years ago

FWIW, I was able to recreate the issue with get_locks failing with a badmatch error when a node shuts down using the gdict module in the examples. The uw-leader-hanging branch seems to have resolved the issue. To recreate the issue on master I started three nodes, dev1, dev2, dev3, all with the examples application started.

$> erl -pa ./ebin -pa deps/*/ebin -pa ./examples/ebin -sname dev${i} -setcookie locks -eval "application:ensure_all_started(examples)."

Connect the nodes and start a new gdict on each node starting from dev1 using:

{ok, Gdict} = gdict:new(dev).

I'm able to store and fetch values from all nodes. When I kill any node, I get the follow stacktrace on each node that's not the leader, in this case dev1:

=ERROR REPORT==== 22-Dec-2014::17:35:58 ===                                                                                                                                [2/248]
    locks_agent: aborted
    reason: {badmatch,[]}
    trace: [{locks_agent,get_locks,2,
                         [{file,"src/locks_agent.erl"},{line,1195}]},
            {locks_agent,get_locks,2,
                         [{file,"src/locks_agent.erl"},{line,1196}]},
            {locks_agent,analyse,2,[{file,"src/locks_agent.erl"},{line,1210}]},
            {locks_agent,handle_locks,1,
                         [{file,"src/locks_agent.erl"},{line,843}]},
            {locks_agent,handle_info,2,
                         [{file,"src/locks_agent.erl"},{line,608}]},
            {locks_agent,handle_msg,2,
                         [{file,"src/locks_agent.erl"},{line,256}]},
            {locks_agent,loop,1,[{file,"src/locks_agent.erl"},{line,231}]},
            {locks_agent,agent_init,3,
                         [{file,"src/locks_agent.erl"},{line,198}]}]

=ERROR REPORT==== 22-Dec-2014::17:35:58 ===
Error in process <0.57.0> on node 'dev2' with exit value: {{badmatch,[]},[
{locks_agent,agent_init,3,[{file,"src/locks_agent.erl"},{line,205}]}]}

Let me know if providing any more information would be helpful.

uwiger commented 9 years ago

Ok, thanks. I think I'll merge the uw-leader-hanging branch.

uwiger commented 9 years ago

Is this still a problem?

dskliarov commented 9 years ago

I switch for now to gen_server. It is working without any issues. I am going to come back to the gen_lock later. Thank you for both of this projects.

On Wednesday, 21 October 2015, Ulf Wiger notifications@github.com wrote:

Is this still a problem?

— Reply to this email directly or view it on GitHub https://github.com/uwiger/locks/issues/7#issuecomment-149999920.

Dmitri Skliarov

uwiger commented 9 years ago

Ok. Closing this issue.

ctbarbour commented 9 years ago

Not able reproduce the issue on master so no longer a problem as far as I can tell.