cbiffle commented 5 months ago

"Does Hubris support DMA?" is a question I get occasionally, and DMA comes up in internal design discussions as well. I'm starting this issue as a place to record context and approaches.

Current situation

We use DMA today for Ethernet. The Ethernet controller on the STM32H7 (and many other similar parts) has its own dedicated DMA engine. This is the easy case, and I'll expand on why it's easy below.

Target hardware capabilities

Most of our target processors --- possibly all of them, in fact --- have at least a rudimentary DMA engine, capable of moving data from place to place. In every case, it is a vendor-specific DMA engine, different on different models. On the H7 in particular there are no fewer than three general-purpose DMA engines, each of which has slightly different connectivity to the bus matrix, making it best suited for slightly different purposes. (If you count the blitter in the graphics controller, which is arguably general purpose, there are four.)

None of our target processors have what the Big Computer folks would call an IOMMU. There is no memory protection unit that limits the capability of the DMA engine, or for that matter any of the other peripherals that are capable of initiating AXI/AHB transactions. Some of our target processors have limited facilities for approximating this --- the LPC55 in particular has some complicated stuff to keep firmware running in nonsecure mode from being able to DMA the DRM code out of secure mode --- but it tends to be limited and is often tied to other hardware features in awkward ways (on LPC55, it's tied to secure mode).

In other words, generally speaking, a DMA engine on our target processors is more privileged than the kernel. (Even the kernel goes through the MPU, though we have it set to only intercept kernel null pointer dereferences at the moment.)

Our processors are not at all unique in this respect, but it presents some unique challenges because of other aspects of our architecture.

DMA and task isolation

The Hubris kernel is architecture-specific but vendor-independent. We've been able to pull this off because the small set of peripherals required to support the kernel --- which is basically just a tick timer and a memory protection unit --- are specified by the ARMv6-M and later architectures, so we don't have a separate ST MPU vs NXP MPU. The remaining drivers, including all the vendor specific bits, are outside the kernel in tasks.

This is a valuable property, in my opinion, and I'd sure like to maintain it.

If we allow a task access to a DMA controller, in the absence of an IOMMU, that task becomes capable of destroying the kernel, escaping memory protection, escalating its own privileges, etc. The fact that a task has this capability is not necessarily a showstopper. Let's talk about Ethernet.

Currently, our use of DMA is limited to Ethernet in the netstack. The Ethernet controller has a limited DMA controller built in. Because it's limited, its ability to stomp on certain kinds of kernel state is reduced (for reasons not relevant here). But the current situation is that, if we were sufficiently motivated, we could probably bust out of the task containment for the net task.

Does this mean we've failed at our goal of isolated components that can fail separately? I don't think it does. It's not like net runs with the MPU off; it still can't execute code from RAM, overflow its stack, etc. "Mostly isolated with a DMA controller available" is still a useful intermediate step between "fully isolated" and "privileged." I tried to mitigate risk here by writing the Ethernet DMA code very carefully, basically.

The Ethernet example is the easy case, because the net task is the sole owner of...

The ETH_MAC peripheral where the DMA controller lives.
The SRAM bank where we aim the DMA controller.
The incoming and outgoing queues in the Ethernet controller itself.

This makes it harder for a bug in the net stack to cause it to execute "confused deputy" style attacks, abusing its access to DMA to subvert other mechanisms. (If anyone reading this wants to try and exploit the net task, I will help you! Talk to me.)

It also makes it more likely that the common classes of DMA-related mistakes, like running small-number-of-bytes over the end of a buffer, or getting buffers confused, or accidental aliasing of memory that the DMA is still using, will remain confined to the net task. It does not guarantee this, to be clear, but it makes it more likely.

So now the hard case.

Shared DMA is complicated

The general-purpose DMA controllers that vendors tend to include in our target processors are multi-function peripherals that manage N DMA channels (where N is often 8 or 16). They have two key attributes that make them a poor fit for our existing mechanisms in Hubris:

They are shared. Each DMA channel is not an independent peripheral with its own resources that can be separately mapped into a task. Generally speaking, to use the DMA, you have to poke a common bank of registers.
They are an allocatable resource. DMA channels are finite, and on most processors there's a complex web of which DMA channel can be used with which peripheral, and vice versa. Many operating systems would treat this as a dynamic pool that can be managed at runtime; we don't do that, because that's how you get hard-to-reproduce load-dependent failures.

The fact that the controller is shared suggests that there should maybe be a task responsible for it, acting as a server to other tasks who want DMA to happen. This could work. There are a couple of missing mechanisms required, however.

It's not clear how callers would tell the DMA what memory to use. Hubris IPC leases are the standard way of temporarily giving up control of some of your address space to a server, but they are deliberately opaque to the server. The server cannot discover the address where the memory actually lives, something that's important for DMA.
Leases are atomically revoked if the client gets restarted before the server is done processing the message. While rare, this is important for our security model. We can do this because all server accesses to loaned memory are kernel-mediated, and all task state can be atomically changed from the kernel's perspective (it never preempts itself). We do not currently have a way for the kernel to "know" that a task's memory is being used by a hardware device for DMA, and that some specific action must be performed to cancel that transfer before the task can be restarted.

(You're probably wondering how we avoid number 2 with the Ethernet driver. The answer is: by using vendor-specific knowledge. We treat the memory shared with the Ethernet DMA as uninitialized from Rust's perspective, and we carefully go through at net start and fill it in. Before doing this, we assert the reset line to the Ethernet DMA controller, which is a very heavy-handed way of ensuring that all DMA has stopped. We can do this because we have vendor-specific knowledge of the STM32H7 reset controller, clock tree, and Ethernet peripheral; the kernel does not have any such knowledge. If there were more than one task involved, this approach would fail because they could be restarted at different times.)

Vague ideas

We could make DMA a first-class operating system thing and build it into the kernel. This would add a nontrivial amount of driver code to the kernel, and that code would be vendor-specific. But, it would then be easy for the kernel to keep track of DMA state and abort transfers if required. This is my least favorite option, but I wanted to note it for completeness --- this is how almost every other privileged-mode kernel approaches DMA and we could definitely make it work.

If we wanted to do DMA in userspace, there are a handful of mechanisms we might add to support it.

It's important that DMA happen only to portions of the physical address space that are correctly configured for it --- for instance, the observing processor needs to either have set it to bypass the caches, or needs to know when to flush. We could add a mechanism for a task to mark a Lease being sent to a server as "DMA." (We already track this attribute for memory protection regions.)
A server receiving such a lease could potentially instruct the kernel to lock it. This would set a flag on the task, which would stick around until a corresponding unlock (say). While locked, a task cannot be restarted, even by the supervisor. While a client is locked, the server would gain the ability to learn the physical addresses of leased memory.
We would keep track of who locked the task (this is easy, since the locked task will by definition be in WaitingForReply state from the task that locked it). The locks should probably not reset when the server restarts, because we can't guarantee that DMA stops at server restart. So, servers need a way to discover their locked tasks at startup so they can unwind them. This would almost certainly merit a new task state for the clients, because currently all "client is waiting on server" states are cleared if the server restarts.
We would probably want to add a new kernel-to-supervisor notification on task lock / unlock, so that the supervisor could wait for a task to unlock and then restart it if needed.

There are a lot of parts of this handwavey sketch that I don't love -- in particular, locking a task seems like an availability risk.

Please post more ideas, half-baked or otherwise.

cbiffle commented 5 months ago

@andrewjstone suggested a spin on things that I hadn't considered, which I will attempt to summarize below.

Create a DMA task that owns all DMA-capable buffers and the DMA unit.
Designate certain other tasks as responsible for individual buffers.
Have the DMA task perform transfers internally on those tasks' behalf. (For the peripheral side of the peripheral-to-memory transfer, we could either use the uses data to ensure that the caller can actually access the target peripheral, or we could only permit a fixed peripheral location in the config file.)
The DMA task would then be willing to copy to/from one of its buffers and the buffer's associated task, but only if it knows there is no transfer in progress.

This would serve cases where

The transfer size is predictable
We want DMA mostly to make response latency predictable, rather than for throughput (because it adds at least one additional copy compared to normal DMA)

It has the advantage of, potentially, not requiring significant kernel changes, which means we could prototype it in a task and decide if we like it. (As currently described it would require one kernel change, which is that the kernel refuses to transfer to or from leased memory that is marked DMA, to keep it from potentially racing hardware. We'd need some mechanism to either reassure it, or toggle the attribute.)

We don't even need an allocator in the DMA task, because we would know the set of required buffers at build time (for memory allocation), and the set of potential DMA transfers (for channel allocation).

hawkw commented 5 months ago

This would serve cases where

The transfer size is predictable

We want DMA mostly to make response latency predictable, rather than for throughput (because it adds at least one additional copy compared to normal DMA)

This, and especially the second point, does seem to describe our target use cases pretty well, so that's certainly a plus.

cbiffle commented 5 months ago

Yeah, as far as I can tell we have only one performance-sensitive DMA use case right now, which is already covered in the Ethernet driver. The RoT in particular is very much a "correctness over performance" situation.

lzrd commented 5 months ago

Is it not possible to use an MPU slot and map the DMA buffers in the DMA bufffer section into the client task space as well? DMA descriptors are still owned by the dma_server task. Notifications and IPC communicate work todo/done. dma client task can access any time so you can have the zero-copy semantic.

cbiffle commented 5 months ago

It's technically possible to permanently map a properly aligned buffer into both tasks, yes. It presents some handling challenges, in that now we're trying to do safe DMA (and thus aliasing) across two tasks and not just within one task a time. We wouldn't be able to use Rust references into the buffer ever, but if that's acceptable we could probably devise something.

Again, though -- RoT. I would rather it be slow than exploited, and shared memory is always risky.

lzrd commented 5 months ago

I am in agreement w.r.t. RoT reliability over performance. Just thinking about what the general case could look like. POITROAE so without a motivating use case, I support the extra copies.

cbiffle commented 5 months ago

I definitely agree that there are a couple of potential paths for optimizing the copy-happy approach, whether it's using continuously shared memory, or mutable MPU region tables with some kind of a transfer. So, don't mean to shut you down on that!

cbiffle commented 5 months ago

Coming back to this idea after the weekend, it occurs to me that @andrewjstone's proposal is basically a "giant FIFO task." In the sense that the data is still going somewhere from which it needs to be collected (rather than directly deposited into task-owned buffers), and if some of these higher-rate peripherals had large hardware FIFOs, we'd probably just use those.

Thought that was kind of interesting.

oxidecomputer / hubris

General purpose DMA #1697

Current situation

Target hardware capabilities

DMA and task isolation

Shared DMA is complicated

Vague ideas