I was testing locks using my testcase.
I believe that there is a bug in the lock_info handling of locks_server and locks_agent, which may cause deadlock.
My testcase has 3 concurrent clients/agents, namely C1, C2, and C3, and 3 locks, [1], [2], and [3].
C1 requests locks in the order of [[1], [2], [3]]
C2 requests locks in the order of [[2], [3], [1]]
C3 requests locks in the order of [[3], [1], [2]]
Here is how the bug happened (in sketch):
C1, C2, and C3 competed on locks.
Due to the deadlock resolving algorithm, C1, C2 eventually acquired all locks and finished.
In the resolution process, C3 got lock_info of [2] (due to locks_agent:send_indirects/1)
even C3 hadn't reach the point of requesting it, which means C3 was not in [2]'s queue.
The locks_server remove the local lock_info entry of [2] since the queue is empty now.
This effectively resets the vsn of the lock_info.
C3 started requesting [2], but the locks_server would respond with lock_info that
had lower vsn than what C3 was told with. Thus C3 got stuck.
I've tried to fix by not removing lock_info entries in locks_server, but my fix seems to fail the test in other ways. Maybe this breaks the algorithm?
I was testing locks using my testcase. I believe that there is a bug in the lock_info handling of
locks_server
andlocks_agent
, which may cause deadlock.My testcase has 3 concurrent clients/agents, namely C1, C2, and C3, and 3 locks, [1], [2], and [3].
Here is how the bug happened (in sketch):
C1, C2, and C3 competed on locks. Due to the deadlock resolving algorithm, C1, C2 eventually acquired all locks and finished.
In the resolution process, C3 got lock_info of [2] (due to
locks_agent:send_indirects/1
) even C3 hadn't reach the point of requesting it, which means C3 was not in [2]'s queue.The locks_server remove the local lock_info entry of [2] since the queue is empty now. This effectively resets the vsn of the lock_info.
C3 started requesting [2], but the locks_server would respond with lock_info that had lower vsn than what C3 was told with. Thus C3 got stuck.
I've tried to fix by not removing lock_info entries in
locks_server
, but my fix seems to fail the test in other ways. Maybe this breaks the algorithm?