CX state context virtualization: providing compositionality when multiple CX libraries must share (time multiplex) a physical CX state context, used variously from the same thread

This long issue write-up is dedicated to the memory of my wonderful friend, Chris Brumme, who would write lengthy head-spinningly technically dense blog posts while (for example) becalmed in dentist's office waiting room. Dearly missed.

This issue is kin to #19, in which we contemplate changing the behavior of executing a custom instruction when mcx_selector is invalid, from setting cx_status.IC, to taking an invalid instruction exception.

The primary tenet of the composable extensions design is easy, robust composition of separately authored composable custom extensions, their software libraries, and their hardware implementations. If composition is not surefire and routine, we might as well not bother.

The design is clear about how a system may be configured with multiple composable extensions, each with multiple state contexts, and may then be used by multiple CX libraries, via CX multiplexing, discovering, selecting, and issue custom instructions of each extension.

A subtle CX library composition problem

But consider this essential composition scenario, pertaining to just one CX:

M/S/U system running Linux and configured with one CX, configured with one state context.
An app in a user thread comprises two separately authored libraries.
Unbeknownst to all parties (and under library versioning (which is routine), unknowable), BOTH libraries will attempt to discover, select, and use the stateful CX. (Or they both independently call a third library that (now) does so. Same effect.)
The resource manager (here, the OS) can only provide access to the CX's one state context, once. So, assuming the first library isn't finished with that stateful CX it yet, the second library's attempt to discover and use the stateful CX must fail because the CX's one state context is already in use by the first library. The second library cannot use the CX extension, and must fall back to software. Ouch!

Thus CX library composition may fail when there are multiple libraries using the same stateful CX. Very bad! We mean "fail" in the sense that all the advantages and utility of that stateful CX are not available to that second library. Not necessarily "fail" in the correctness sense.)

Some inadequate mitigations

There are several stop-gap half-measures to partially mitigate our composition problem, but they ultimately prove inadequate. (Don't fret, later below we explain how to make arbitrary CX state context virtualization work for all.)

The specs' CX state context configuration and management mechanisms (not policy) allow each configured CX in a system to be configured with hundreds of its CX state contexts. Therefore the number of physical CX state contexts may be "overprovisioned" (with respect to harts), so that any given software thread's resource manager / OS might draw new CX state contexts from a substantial pool of unassigned, idling CX state contexts. So the second CX library can obtain a second CX state context, as desired. But this is no panacea! Eventually "one more library" and "one more" again, and again, is invariably added to a sprawling software project, and then some CX's state contexts will be oversubscribed, as above, and CX library composition fails. Also, there are important classes of stateful CXs, such as tensor units, where overprovisioning / idling CX state contexts is prohibitively expensive.
We might constrain the stateful CX programming model so that a library may not retain or reserve a CX state context for very long. Imagine we required that a CX library must open (discover) a CX state context, select it, issue its custom instructions, close it, i.e., "get in, do it, and get out!", each time the CX library function is called. In this model, the two libraries's use of the stateful CX are safely time multiplexed, because each disjointly and repeatedly acquires and releases the CX state context and there is no moment when the state context is assigned to both libraries. However, this mitigation is inadequate because of two problems. Firstly, it is still not composition safe. Even if the first library is written to open, select, issue, and close the CX state, prior to returning to app client code, -- in the midst of that code sequence, the first library may call a third library, and, oh dear! that library also attempts to open, select, issue, close the same stateful CX. Yet that call from first library to third, from the midst of a first library function, is a fundamental practice in practical software development and cannot be forbidden (technically or expressively). A second and ultimately damning problem with this "do not retain a CX state context for long" approach is that certain stateful composable extensions by design must keep the state context open across CX library calls -- indeed open across the life of the process or even the uptime of the machine. In particuar, a CX state context may be too large to reinitialize upon every single CX library call.
We might constrain the stateful CX programming model to mandate CX state context sharing: so when the second library opens the CX state context already opened by the first library, it receives a shared reference to it, so that either the first or second library issues stateful CX custom instructions the one self-same CX state context. It should be obvious that this too breaks composition. The work performed by each library upon the same state context is intermingled and probably corrupted. The behavior of the first library will change when composed or not composed with the second library. So this mitigation is also rejected. (But I must note, there are good but uncommon use cases and categories of stateful custom extensions in which we will want to provide a means for each such CX library to explicitly opt-in to open a singleton shared instance of a CX state context (singleton per thread, singleton per process, singleton per machine).

A precedent: threads as unlimited, virtual harts

These three mitigations do not achieve routine composition, i.e., allow both libraries to concurently open / retain their own private (isolated) CX state context object, when there is physically only one CX state context available.

There are many examples of this type of problem in computer systems design. For 60 years we have used multiprogramming to concurrently run multiple programs upon one computer. This evolved to running multiple processes of mulitple threads upon one or more harts. The user software, running within a thread, occasionally scheduled to a hart, is generally unaware that its thread is scheduled and descheduled and even rescheduled to a completely different hart. Software just goes about its business, oblivious to all this, happy with this abstraction of a virtualized hart.

Indeed, multiprogramming threads over RISC-V harts, in the presence of composable extensions and especially CX state contexts, is already carefully addressed in the CX spec. When one thread uses a CX and is scheduled to a hart, then that thread is descheduled, and another thread scheduled to the hart, and further assuming the OS's CX state context resource manager must recycle that CX state context for the second thread about to resume on that hart. Here the OS performs a CX context save of the outgoing thread's CX state context and a CX context reload of the incoming thread's CX state context.

So, we already know how to virtualize multiple CX state contexts, one per thread, upon a single CX state context. We also have specified an event that may cause CX state context switching -- a thread context switch.

For the present library composition problem, we now must somehow determine how to virtualize mulitple CX state contexts within one thread, multiplexing them upon a single CX state context. And we need a new event that triggers that, potentialy as frequently as each subsequent CX selection switch.

Understanding the CX library programming model, CX Runtime API, CX access control.

For concreteness, here is a CX library programming model, which uses a likely CX Runtime API for uniformly discovering, selecting, and issuing custom instructions of a composable extension.

// A simple CX Runtime API :- typedef int cx_sel_t; // CX selector descriptor -- opaque quantity, like a file descriptor cx_sel_t cx_open(CX_GUID); // discovery: if CX is available, open (acquire) a selector for a state context of that CX cx_sel_t cx_select(cx_sel_t); // select the specified CX (and state), return previous selection int cx_close(cx_sel_t); // close (release) this CX and its CX state context and the selector

Here we should expect cx_open() and cx_close() to be OS kernel calls, performed infrequently (e.g. on CX library load or CX library init), whereas cx_select() may be used with much greater frequency, to repeatedly select this CX and then that one across the various CX libraries in the application.

cx_select() selects a new CX/state context, prior to issuing custom instructions to it. A cx_select() call should ideally compile into one instruction: cx_sel_t prev = cx_open(my_sel); -> csrrw a1,cx_index,a0 // a0: my_sel; a1: prev which (%2.2.3, %2.2.4, %2.7) writes cx_index with my_sel, the new CX selector index, which then fetches the corresponding CX selector entry in the OS managed CX selector table and copies that entry to mcx_selector CSR

Then when the CX library issues CX custom instruction(s), the current mcx_selector CSR determines which CX and CX state context receives the custom instruction(s). But on the other hand, if mcx_selector ever holds an "invalid" selector value, then invoking a custom instruction is an error. Currently the spec will record the error in cx_status flags such as cx_status.IC -- invalid CX_ID (no such CX).

If, as is also discussed in #19, if such an error is not signaled with a cx_status error flag but rather with an invalid instruction exception, this provides all the mechanism we might need for the OS to implement a policy of transparent virtualization of a CX state context across multiple library uses of the stateful CX, all within a single thread.

Inside a CX-aware OS, juggling entries in the hart's CX selector table

In response to a cx_open() call, the OS walks the CX Map to determine whether the requested CX is available on this system.

If so, the OS then acquires a state context of the CX for use by the system-caller. The OS selects a CX state context, according to some CX state context resource management policy. Assume (to keep this discussion simple) that the policy is that each stateful CX is configured with one state context per hart, so that the OS assigns exactly one CX state context (per CX) to each thread. At this point we have determined a valid selector value which is a tuple (cx_id,state_id) identifying a specific CX and its state context.

But (%2.7) the OS doesn't write that selector to mcx_selector. Rather, the OS allocates a new entry in the thread's CX selector table, copies the CX selector value into that entry, and returns the index of that entry back to user-code.

But what happens when a second cx_open() call (from a separate CX library, for example) attempts to open a fresh CX state context for this specific CX? Since (we have assumed this thread is only ever alloted one CX state context for this CX, it would be disastrous to lookp the same CX selector value for this (same) physical CX state context, and then copy that value into a second entry in the CX selector table. Why so? Because when that second CX selector index is returned to user code, which then selects through it using the cx_select() API, it will end up executing its CX custom instructions using the same CX state context as a different CX library whose use of the CX state context is still underway, likely corrupting state for both libraries and breaking composition.

Virtual CX state contexts for multiple independent accesses to one CX state context from one thread

The previous paragraph is the wrong way. The right way: when the OS determines that there is already a CX selector table entry for this one state context, the OS should prepare a CX selector for it, but set it up as an invalid bit on that selector -- a "poisoned" CX selector value, copied into this new second CX table entry in the CX selector table.

Next the ndex is returned to user code. User code selects it using cx_select(), that copies the invalid selector from the CX table entry into the mcx_selector CSR. Then the next time a custom instruction issues, it causes an illegal instruction exception.

This is the hook we need to swap virtualized CX state contexts, on demand, within one thread! The illegal instruction exception handler, discovers a custom instruction, checks mcx_selector and cx_index. The OS now knows which CX state context is "swapped out" and must be "swapped in".

Still inthe handler, the OS saves (just) the thread's harts' CX's current CX state context, then reloads or initializes this CX state context for the second access from this thread. Having done this, it copies the valid CX selector for this CX state context into the second entry, and also replaces the first entry with the invalid poisoned selector value. It rewrites cx_index the the self-same value, which now writes mcx_selector with the valid selector that that specific CX state context. The OS returns from exception and this time the custom instructions reissue and are correctly issued upon the second virtual CX state context of this thread, as desired.

Note we were careful to poison the selector of the first entry, whose virtualize CX state context is now "swapped out". When user code resumes and eventually and attempts to select and issue custom instructions upon the first selector, this too takes an illegal instruction trap, with similar results, the handler being careful to CX context switch back to the first CX state context prior to reissuing custom instructions upon that CX / state context.

Summary

By changing the spec to illegal instruction fault upon issuing custom instructions when the CX selector CSR is invalid, we enable transparent on-demand use of a CX that is not backed with a physical CX or physical CX state context. In the former case, the trap hander emulates the particular CX custom instructions; here, in the latter case, the trap handler swaps CX state contexts in and out, virtualizing the CX state contexts available to any thread.

This solves the present composition issue, and so far we don't have a better or simpler way.

Also note, just as with multiprogramming, a CX library client of a virtualized CX state context need not be aware whether its CX state is virtualized. When it needs it, it's there.

Note virtualzation and emulation using a CX-aware illegal instruction trap handler, can be made to work even if the CX access control (%2.7) CSRs mcx_table and cx_index are not implemented (e.g. simple austere M-mode only MCUs).

Issues #19 and this #24 together seems a compelling value proposition for this illegal instruction trap change and downstream impact to cx_status. Further analysis is required to understand its impact, if any, on CXU-LI and CXU-LI compliant CPU cores.

grayresearch / CX