painless-security / trust-router

Moonshot Trust Router
0 stars 0 forks source link

TR_MQ does not correctly handle tail pointer when adding high priority messages, may lose messages and leak memory #80

Closed jennifer-richards closed 6 years ago

jennifer-richards commented 6 years ago

The TR_MQ message queue, used to pass messages between threads, allows high and low priority messages to be queued. I've found a bug that can cause messages to be lost, resulting in failure to deliver the messages and a memory leak. This can only occur under heavy TRP load, with a mix of messages and peer connections being made and lost.

The issue

The issue is that the tail pointer in the queue is not updated when adding a high priority message to a queue. If the queue contained only high priority messages, this results in the tail pointing to the head of the list regardless of how many high priority messages are added.

Failure mode 1

If a message is then popped from the queue, the queue will be non-empty (its head pointer will point to the remaining element), but the tr_mq_pop() operation will notice that it just popped the element pointed to by its tail pointer and set the tail to NULL. If a normal priority message is added to the queue before it becomes empty, a segfault will occur.

Failure mode 2

If, instead of a pop, a normal priority message is added, the tail element's next pointer will be set to the new element. Because the tail was incorrectly pointing at the head of the queue, any messages after the head will be cut out of the linked list. These will never be returned by the pop operation and will never be freed (except when the message queue itself is freed, which will free them via talloc, but in normal operation this will not happen until the trust router is shut down, so will effectively be a memory leak). Any high priority messages queued will also be lost because they are added using a pointer that still points into the lost segment of the linked list.

In both failure modes, normal operation of the linked list will be restored once it is empty (unless the segfault is triggered). Messages lost in the second mode will never be recovered.

For these to occur, there must be multiple messages of different priorities accumulating in the message queue before the queue is emptied. This can only occur with a mixture of connection made/lost messages (sent as high priority) and TRP updates / requests (sent as normal priority). Under typical testing (and likely early production) loads, it is rare for more than one connection/disconnection message to be in the queue at a time, and even normal priority messages rarely accumulate before being processed. Thus, this is not a very likely bug to be exercised - it would require an unusually high load, including several near-simultaneous peer connections/disconnections.

How to correct this

The fix to this is straightforward. However, we should consider dropping the priority levels from the message queue entirely. We nominally use them, but they are not required and provide at best a trivial benefit - it would not justify adding these features now if they were not present. The original motivation for the queue (actual failures caused by trying to send messages to a peer that has already disconnected) has been eclipsed by better solutions.

Note: Current use of the priority levels

The high priority messages are sent to notify the main thread about trust router peer connections being made or lost before it makes decisions about how to handle messages already in the queue. Those messages may be handled differently depending on whether a peer is connected or not - e.g., a message might be discarded if its intended recipient is not connected. With a simple message queue, deliverable messages might be discarded if the "peer connected" notice is waiting later in the queue.

In practice, however, the benefit is minimal. The TRP protocol is designed to handle the case that a peer is temporarily not available. The priority queue cannot make any useful guarantees that messages will be handled optimally---random timing is still the biggest factor.

jennifer-richards commented 6 years ago

This is resolved by #81. Since we probably never saw this bug manifest itself, there is no test. The offending code is no longer in the tree.