jennifer-richards commented 6 years ago

While debugging #69 a few days ago, I came across a deadlock between trps and trpc threads.

The cause of this is incorrect use of mutexes. There are two: the trpc message queue (trpc->mq) has a mutex (trpc->mq->mutex) that is locked/unlocked around each operation on the queue (i.e., add or pop). The trpc thread additionally has a mutex for its notify callback (cb->mutex). The latter is used to maintain a msg_ready flag that is to be set when a message is waiting on the trpc->mq, and is used in conjunction with a condition variable.

The error is that the trpc thread waits for its condition variable to be signalled, and then holds a lock on cb->mutex while it does its work. Along the way, it does a tr_mq_pop() that implicitly locks trpc->mq->mutex. I.e., it locks cb->mutex then trpc->mq->mutex.

The main thread does its work without locking, then calls tr_mq_add() to add a message to the trpc queue. That first locks trpc->mq->mutex, then calls the notify callback, which locks cb->mutex. I.e., it locks trpc->mq->mutex then cb->mutex.

The two thus do not respect mutex priority, which makes deadlock possible if the trpc thread holds the cb->mutex and the main thread holds the tr->mq->mutex. Each can then block waiting for the other to be unlocked, which will never occur.

I think the correct solution is to do away with the cb->mutex and use trpc->mq->mutex as the sole lock. I need to verify that it is ok to use it in conjunction with a condition variable in this way, though.

jennifer-richards commented 6 years ago

I have fixed this by eliminating the notify callback and its cb->mutex. This is much cleaner. I will create a pull request soon.

jennifer-richards commented 6 years ago

76 should resolve this. Closing without testing because this will have to be caught by general stability testing if the problem is still present.

jennifer-richards commented 6 years ago

After further review, I am not certain that there was actually a deadlock here- the notify callback was called after releasing the trpc->mq->mutex, which should prevent the issue. Still, the new method is much simpler, so we should keep the new method.

painless-security / trust-router

Deadlock in trpc messaging code #75

76 should resolve this. Closing without testing because this will have to be caught by general stability testing if the problem is still present.