locks_leader:call/2 timeouts after joining new node

x0id commented 7 years ago

Trying to use locks app (master branch, c9b585a) I've got interesting failure in a scenario described below.

There were 4 nodes alive connected to each other - A, B, C and D. Node D was a leader. One time new node E was started, it discovered other running nodes and connected to them. Before new node E even connected to other nodes, it decided it was a leader.

Once node E connected to other nodes, it sent its leadership info to them. For all 3 non-leaders A, B and C node E locks_leader’s callback elected(State, Election, Pid) was called with "Pid" of the “joined” node (A, B and C) process. In its turn, node’s A, B and C locks_leader’s callback surrendered(State, Synch, Election) was called.

When new leader E connected to old leader D, netsplit happened. Node D won, it’s locks_leader’s callback elected(State, Election, undefined) was called and all other nodes (A, B, C and E) received notification in a callback surrendered(State, Synch, Election), so node E was informed that it was not a leader anymore.

Since then all calls locks_leader:call/2 made in nodes A, B and C ended up with timeout. Same call made in D and E worked as usual with no errors. So it seems that internal state of locks leader of the "passive" nodes A, B and C was compromised by fighting leaders D and E...

x0id commented 7 years ago

Race condition is possible since this happens not every time.

uwiger commented 7 years ago

Sorry for not digging into this yet, but I've been a bit swamped.

One thing to check when this happens is whether the locks themselves are consistent (e.g. by checking ets:tab2list(locks_server_locks) on the relevant nodes. Previous problems with locks_leader have tended to be caused by bugs in the notification logic, i.e. the locks agent will sometimes fail to detect that it needs to notify the leader instance of a relevant change in lock status. This could e.g. result in some leader not trying to claim a lock on a newly connected node.

x0id commented 7 years ago

This happened recently in group of 3 nodes, result of ets:tab2list(locks_server_locks) is below:

node@A (leader):
[{lock, [locks_leader, {asg_manager, asg_manager}], 8, <node@A.217.0>, [{w, [{entry, <node@A.257.0>, <node@A.255.0>, 3, direct}]}, {w, [{entry, <node@B.266.0>, <node@B.265.0>, 8, direct}]}], []}]

node@B (newly connected node, hanging on subsequent locks_leader:call/2):
[{lock, [locks_leader, {asg_manager, asg_manager}], 13, <node@B.219.0>, [{w, [{entry, <node@A.257.0>, <node@A.255.0>, 11, direct}]}, {w, [{entry, <node@B.266.0>, <node@B.265.0>, 11, direct}]}, {w, [{entry, <node@C.263.0>, <node@C.261.0>, 13, direct}]}], []}]

node@C:
[{lock, [locks_leader, {asg_manager, asg_manager}], 2, <node@C.217.0>, [{w, [{entry, <node@B.266.0>, <node@B.265.0>, 1, direct}]}, {w, [{entry, <node@C.263.0>, <node@C.261.0>, 2, direct}]}], []}]

uwiger commented 7 years ago

Thanks, that's interesting (and sorry for missing your reply for so long). So A and C are unaware of each other. I will see if I can figure out what causes it, but it appears to be a variation of the problem I described above.

uwiger commented 7 years ago

Not sure if the above solves anything, but it closes a possible loophole, and adds test cases. I have not yet been able to reproduce the problem.

ten0s commented 7 years ago

This project https://github.com/ten0s/locks-test reproduces the issue.

All nodes are started with '-connect_all false'. A new node is about to join a connected cluster with a leader. It starts locks_leader without connecting to others and becomes a leader in its own separate cluster. A bit later the nodes ping each other, the netsplit happens and sometimes some nodes hang at https://github.com/uwiger/locks/blob/master/src/locks_leader.erl#L510.

If the new node connects to others before starting locks_leader https://github.com/ten0s/locks-test/blob/master/src/test.erl#L73, then everything works.

The above fix doesn't change anything.

uwiger commented 6 years ago

FYI, I've been running your test program. The thing that seems to happen, at least some of the time, is that a node detects that it's no more the leader, broadcasts a leader_uncertain event, and enters the safe loop. In the cases I've observed where it gets stuck there, the actual leader (holding all the locks) still has that node in the synced list, i.e. assumes that it has already told it who the leader is.

One problem I found from visual inspection is that the synced list doesn't always seem to be cleared. Another potential problem is that the candidate needs to be sure that the agent will issue a new notification (once the candidate enters the safe loop), which may mean that an asynchronous await_all_locks request should be issued (there is no such thing yet). Another alternative would be to call locks_agent:change_flag(Agent, notify, true), even though it's already set from the start. The way it's implemented, it will always respond with a notification of the current status.

The good news is that the core locking algorithm doesn't seem to be at fault.

I will have to attend to other things for a while, but will get back to this. Thanks for the interesting test case, and apologies for the delay in responding.

uwiger commented 6 years ago

Before pushing, I let the locks-test program run continuously for 10 minutes, and observed no hickups or pauses.

ten0s commented 6 years ago

I upgraded the locks version in https://github.com/ten0s/locks-test.

AFAIS, the previous version c9b585adcf7cbd347e516ccc3406e03a04a46731 hangs within 30 secs, the latest version 8e9b2e33deec6ed375a38fc88f67eff792b91504 is much more stable, but it also hangs withing 30 mins in the same place https://github.com/uwiger/locks/blob/master/src/locks_leader.erl#L517.

Check nodes: [{'slave10@127.0.0.1',false},
              {'slave1@127.0.0.1',false},
              {'slave5@127.0.0.1',
                  {'EXIT',
                      {timeout,{gen_server,call,[asg_manager,is_leader]}}}},
              {'slave6@127.0.0.1',
                  {'EXIT',
                      {timeout,{gen_server,call,[asg_manager,is_leader]}}}},
              {'slave4@127.0.0.1',false},
              {'slave3@127.0.0.1',false},
              {'slave9@127.0.0.1',false},
              {'slave2@127.0.0.1',false},
              {'slave7@127.0.0.1',false}]

Press Enter to continue

(slave5@127.0.0.1)6> process_info(whereis(asg_manager), [current_function]).
[{current_function,{locks_leader,safe_loop,1}}]

uwiger commented 6 years ago

Thanks for being so tenacious! :) I will try to find time to take a look at that again. At least the window seems to have been significantly reduced, albeit not closed entirely.

x0id commented 4 years ago

I've traced the issue in more details and found one stable pattern:

node_A: set node_A a leader ... node_B: set node_B a leader ... node_C: set node_B an announced leader node_C: set node_A an announced leader node_C: unset leader node_A as reported uncertain - now node_C stuck in the safe_loop since there are no other leader's announcements issued.

The key moment is the am_leader message from node_A is sent before the am_leader message from node_B, but these messages are received by node_C in the "reverse" order. This is pretty possible and totally breaks the logic implemented in the locks_leader.

I couldn't find any easy way to fix the locks_leader.

The locks app addresses all these asynchronous communications disbalance doing the job pretty well, but the locks_leader rolled it a bit back... Could it be possible to use internal locks info to deduce who is the leader instead of passing the leadership through the announcement messages building an extra layer of async communications?

uwiger commented 4 years ago

The locks app addresses all these asynchronous communications disbalance doing the job pretty well, but the locks_leader rolled it a bit back... Could it be possible to use internal locks info to deduce who is the leader instead of passing the leadership through the announcement messages building an extra layer of async communications?

Yes, the locks_leader should rely on the internal locks info to decide who's the leader, but it seems to me as if the key problem here is deciding when another announcement is needed from the locks_agent. So the leader gets stuck waiting for a new announcement that never arrives, since the locks agents believe they have already shared the latest info.

uwiger commented 4 years ago

A possibility, albeit clumsy, might be to have a timeout on the safe_loop, where the locks_leader polls the agent for new info. Perhaps this could e.g. be limited to a period after nodeups arriving.

x0id commented 4 years ago

Currently locks_leader is getting info regarding its own leadership via lock notifications, but it does not use internal lock structures to check who is a new leader - it merely relies on getting am_leader message(s) from other nodes. This makes the bad scenario possible.

x0id commented 4 years ago

Regarding timeouts - this is what we do as a workaround in the app using locks_leader - when timeout happens (due to stuck in the safe_loop) - the app stops locks_leader, it restarts and tries again.

uwiger commented 4 years ago

It might be possible to detect accidental reordering by saving a limited number of am_leader and checking lock versions. Another approach might be to double-check am_leader messages against the most recent lock_info, and issuing leader_uncertain messages if any discrepancy is found.

I don't have time to experiment with it right now, but will try to get around to it soon.

uwiger commented 4 years ago

Pls note PR #42 It still doesn't solve all issues, but I think it goes further in allowing the problems to be identified.

uwiger / locks

locks_leader:call/2 timeouts after joining new node #30