Closed jennifer-richards closed 6 years ago
I have fixed this by eliminating the notify callback and its cb->mutex
. This is much cleaner. I will create a pull request soon.
After further review, I am not certain that there was actually a deadlock here- the notify callback was called after releasing the trpc->mq->mutex
, which should prevent the issue. Still, the new method is much simpler, so we should keep the new method.
While debugging #69 a few days ago, I came across a deadlock between trps and trpc threads.
The cause of this is incorrect use of mutexes. There are two: the trpc message queue (
trpc->mq
) has a mutex (trpc->mq->mutex
) that is locked/unlocked around each operation on the queue (i.e., add or pop). The trpc thread additionally has a mutex for its notify callback (cb->mutex
). The latter is used to maintain amsg_ready
flag that is to be set when a message is waiting on thetrpc->mq
, and is used in conjunction with a condition variable.The error is that the trpc thread waits for its condition variable to be signalled, and then holds a lock on
cb->mutex
while it does its work. Along the way, it does atr_mq_pop()
that implicitly lockstrpc->mq->mutex
. I.e., it lockscb->mutex
thentrpc->mq->mutex
.The main thread does its work without locking, then calls
tr_mq_add()
to add a message to the trpc queue. That first lockstrpc->mq->mutex
, then calls the notify callback, which lockscb->mutex
. I.e., it lockstrpc->mq->mutex
thencb->mutex
.The two thus do not respect mutex priority, which makes deadlock possible if the trpc thread holds the
cb->mutex
and the main thread holds thetr->mq->mutex
. Each can then block waiting for the other to be unlocked, which will never occur.I think the correct solution is to do away with the
cb->mutex
and usetrpc->mq->mutex
as the sole lock. I need to verify that it is ok to use it in conjunction with a condition variable in this way, though.