Background/problem

We have a (known) priority inversion opportunity in the OS. It has to do with our use of closed receive to implement mutual exclusion, and is most easily demonstrated by considering the SPI task:

Unimportant task contacts SPI driver and requests to lock it. It succeeds.
Important task tries to contact SPI driver, is queued. Now we have important work waiting on unimportant work, which isn't great but is hard to avoid.
Medium-important task loses its mind and doesn't yield CPU to the unimportant task. Now the medium-important task is able to prevent the important task from making progress, which was not the design intent (and is also bad).

I knew this was going to be a risk, so some core features of Hubris are designed to make it relatively easy to mitigate -- but we ain't done it yet, and we prolly oughtta. Hence this bug report.

FWIW, we haven't seen any bugs caused by this in practice, likely because we don't use mutual exclusion patterns very much. To make this happen, I'm pretty sure you have to be using "closed receive." Closed receive is the sys_recv mode that only listens for a single task, instead of the highest priority queued sender. It's how we implement mutual exclusion on the SPI driver.

Proposed fix, in the abstract

I suspect the easiest way to fix this would be by using Priority Ceiling Protocol or a derivative of it. Priority Inheritance is a popular way to fix this in realtime operating systems, but ironically priority inheritance has some features that make it hard to implement in constant time: to make unblocking an arbitrary waiter cheap, you need complex minheap structures (and thus dynamic allocation); otherwise it winds up approaching linear. While we're not as aggressively realtime as some systems, I'd sure like to avoid building load-sensitive operations into the kernel, since we've largely avoided it until now.

An application of priority ceiling protocol for the SPI mutual exclusion case would change the original scenario as follows:

Unimportant task contacts SPI driver and requests to lock it. It succeeds.
Important task tries to contact SPI driver, is queued. Now we have important work waiting on unimportant work, which isn't great but is hard to avoid.
At or before this point, the priority of the unimportant task is temporarily boosted to be at least as high as all clients of the SPI task. (Variations described below.)
Medium-important task does not have an opportunity to be scheduled because of the priority boost and can no longer interfere.
Previously-unimportant-but-now-boosted task finishes its work and yields to the important task.

PCP comes in several variations, which I alluded to above. There's the axis of "when to boost:"

You can boost the priority immediately when locking the resource, i.e. the unimportant task would be boosted as soon as it gets the reply from lock. This is simple and easy to reason about, but it can also starve intermediate-priority tasks that have nothing to do with the resource.
You can wait to boost the priority until at least one other task is contending for the resource. This reduces the impact of priority boosting in the (presumably more common) uncontended case. Note that a key difference of PCP over priority inheritance is that, even if you wait for a competitor before boosting, you still boost the priority as high as it needs to go.

And then there's the axis of "how to boost:"

If you can identify all potential contenders for a resource, you only need to boost to the highest priority among them. Much like waiting to boost until the resource is actually contended, this reduces the system overhead of PCP.
If you can't, or if you want to be really sure of something, you can boost the priority to that of the resource itself. (Since we require IPCs to always go "uphill" in terms of priority, in our case it'd be the priority of the resource, minus one.)

Notes on potential implementation and open design questions

With the exception of certain dynamic tasks used for debugging, like hiffy's generic send support and udprpc, we can identify all potential clients of a service at compile time. So, we have the opportunity to do a more precise priority boost. That's nice.

To do this, there's some basic implementation work to do, but also some design work. First, the implementation work: we'd need to start adjusting task priorities. This is easy because I did most of it four years ago:

Tasks distinguish between initial priority (in ROM) and current priority (in RAM) already
The basic data structure we use for handling task wait queues doesn't require updates if the task priorities inside it change. (Because it is really, really simple, and slower than it could be if it were more complex.)

Now, the design work:

How do we know when to boost a task's priority? Consider the SPI lock operation -- it's a normal IPC, nothing about it indicates that it needs special handling. It's also the only operation on the SPI IPC interface that should be handled this way. (My guess is that we'll want a variation on the REPLY primitive.)
How do we know when to stop boosting the priority? (Possibly also a variation on the REPLY primitive? But also definitely if the task should restart.)
How do we know what priority to boost to? (As noted above, we can probably determine this at build time, but then we need to make that metadata available to whatever entity activates the boost.)
Who keeps track of which tasks are boosted? If it's state in the server only, how do we un-boost a task if the server restarts? If it's state in the kernel, how do we keep it in sync with the server?

...and probably other questions.

Hm, on reflection, it also occurs to me that we could potentially have the closed receive operation itself drive the priority boosting.

With the exception of certain dynamic tasks used for debugging, like hiffy's generic send support and udprpc, we can identify all potential clients of a service at compile time. So, we have the opportunity to do a more precise priority boost. That's nice.

I suppose we'd need some way to actually plumb the priority information, which is known when building the dist image, into the ... whoever ends up being responsible for boosting priority ... though?

Aye, that we would.

The simplest thing I can think of would be to (1) have the boosting triggered by something the server does, and (2) boost to either that server's priority, or that server's priority minus one. Probably the latter.

We'd need to think about whether this needs to be transitive, since servers are typically also clients, with rare exception. Generally one of the advantages of PCP over PI is that you don't need to implement transitive graph algorithms. We'd want to make sure that holds in whatever we do.

So if, say,

The clients of a server are required to have priority lower than the server's base (unboosted) priority, even if the server is boosted for some reason
A client is boosted iff the server enters closed receive against that client, making closed receive a sort of "priority donation" operation.
The boost is always to the server's base priority minus 1, so that the boosted client is at least as high priority as any other client of the server.

...I think it works out? Should probably sit down with some of the PCP papers to learn why I'm obviously wrong. :-)

In terms of the original questions I posted, the "closed receive as priority donation" approach gets us:

How do we know when to boost a task's priority? We boost a task x if, and only if, at least one (other) task is in state InRecv(Some(x)).

How do we know when to stop boosting the priority? We unboost the task immediately upon the waiting task resuming. This will be either in response to a message from the boosted client (implying that the client is now blocked, and so its priority doesn't matter), or a notification, which allows the server to implement a timeout or service an interrupt and then decide whether to re-boost the client by re-entering a closed receive, or to open receive instead.

How do we know what priority to boost to? We can directly compute it from the base (unboosted) priority of the task executing the closed-receive operation.

Who keeps track of which tasks are boosted? The kernel; it is precisely the set of task IDs x that appear in states matching the pattern InRecv(Some(x)). If a task leaves that state for any reason (successful receive, notification, or restart by supervisor/debugger) we unboost the referenced task immediately.

Personal development note, I have gotten a lot better at spelling the word recieve. Erm, receive.

As of #1762, there's another potentially nice property to closed RECV acting as the "boost priority" operation:

It can be interrupted. Specifically, a server in a closed RECV can specify a non-zero notification mask to allow some subset of events, such as timers, to break it out of the closed RECV. When this happens, the boosted client would need to atomically lose the increased priority. It would regain it if, and only if, the server re-enters the closed RECV.

This lets a server cancel a client's boosted priority. It doesn't give the server any way to notify the client that this has occurred, however -- in particular, the server cannot reply-fault the client until the client eventually calls back and assumes that it has exclusive access still. This possibility already existed before #1762: a server could be restarted by the supervisor or a debugger while in a closed RECV, so that when the client eventually calls back, it no longer has exclusive access.

oxidecomputer / hubris

Potential priority inversion in closed receive protocols #1654

Background/problem

Proposed fix, in the abstract

Notes on potential implementation and open design questions