add CX scoped custom CSRs (CX CSRs)

grayresearch commented 8 months ago

Background

The spec requires CX multiplexing for conflict-free composition of independently authored composable custom extensions. Here "conflict-free" means each extension may use any custom opcode instructions. With CX multiplexing, we select the hart's current CX and state context prior to issuing custom instructions to that CX. Thus even if two composable extensions use the same custom opcodes for different custom instructions, the fact that the hart's current CX and state context is always selected ensures that the correct custom instruction is performed, by the selected CX, in response to that custom opcode.

CXs may be stateful. Each CX state context is private (isolated) and is only accessed/accessible via custom instructions via CX multiplexing. Also, the spec defines four mandatory custom instructions, cf{read,write}{status,state}, together called IStateContext, that enables uniform CX state context save/restore for any stateful CX.

One of the spec's design tenets is uniformity. A uniform programming model will help the RISC-V custom computing community achieve an ecosystem of reusable CX library software and CX unit hardware.

Some current spec shortcomings with respect to CSRs for CXs

Uniform CX CSRs: The spec does not define a uniform way to access CX-specific control register state. Nor does it specify whether/how the CSR access instructions (csrrw/csrrs/csrrc) might access CX state. In lieu of this, each CX will invent its own idiosyncratic way, using its own custom instructions, for accessing CX control register state.
Privileged CX CSRs: The spec’s CX access control system (mcx_table, cx_index) allows privileged software to grant or deny a hart access to a specific CX and state context, but once granted, it is all-access. So e.g. if a malicious user thread is granted access to a stateful CX by the OS, the user thread can use the cf_write_state custom instruction, provided for operating system CX state context reload, to poke arbitrary data into the current CX state context. That is not a scenario we want every CXU to defend against. Rather, we require a way to make certain CX state context accesses conditioned by current priv level. Of course the existing CSR access instructions afford such fine-grained privileged access control: CSR "addresses" encode access privileges (user, system, machine) × (read-write, read-only). It would be great if somehow this mechanism might be extended to fine-grained privileged access control for CSRs of CXs.
Conflict free custom CSRs: CX multiplexing allows each composable extension to use any custom opcodes. This supports a decentralized composable extensions ecosystem. But what about custom CSRs? My CX might use certain custom CSR indices and your CX might use some of the same custom CSR indices for a different purpose. What can we do about custom CSR address collisions across separately authored CXs?
Mandatory stateful CX custom instructions: IStateContext’s mandatory custom function instructions reserve 16 CFIDs (i.e., 16 custom-0 instructions), of which four are cf{read,write}_{status,state}. Some dislike this carve out because it pollutes the custom instruction encoding space of composable extensions. Our aspirational message "use any custom opcodes you like, conflict free", is now, with greater precision, "use any custom opcodes you like, except this reserved range, conflict free.

CX scoped custom CSRs

To address these shortcomings, this Issue proposes adding “CX scoped custom CSRs” (CX CSRs) to the spec.

Here are two different ways to do this.

CX CSR access custom instructions: Recall the spec reserves 16 custom function instructions (CFIDs 4080..4095), of which cf{read,write}_{status,state} are four. To these four, add three new instructions echo'ing -Zicsr's csrrw, csrrc, csrrs:
```
[CF_ID=4091] cf_csrrw rd,rs1,rs2 ::= rd = CX.CSR[rs1]; CX.CSR[rs1] = rs2
[CF_ID=4090] cf_csrrs rd,rs1,rs2 ::= rd = CX.CSR[rs1]; CX.CSR[rs1] |= rs2
[CF_ID=4089] cf_csrrc rd,rs1,rs2 ::= rd = CX.CSR[rs1]; CX.CSR[rs1] &= ~rs2
```
This fixes shortcomings # 1 uniform access and # 3 conflict free CX CSRs, but not # 2 privileged CX CSRs nor #4 (undesirable mandatory stateful CX custom instructions).

Pros: it costs next to zero gates or LUTs into extensible processors that already implement the CX spec. CX CSRs are just more (processor uninterpreted) custom function instructions forwarded to some (selected) CXU. It preserves unchanged the definition of a CX as set of stateful custom instructions (only).

Cons: it introduces a new, redundant set of CSR access instructions that are used only to access CX CSRs. This will add unfortunate downstream work e.g. in developer tools, compilers, debuggers, program analysis tools, ... .

Multiplexed custom CSR accesses: Here, when CX multiplexing is enabled, custom-space CSR accesses performed by existing csrr[wcs][i] instructions are read/write and privilege-mode-access checked and then performed using the hart's currently selected CX and CX state context.

This fixes all the shortcomings listed above.

Pros: it extends the use of custom CSRs to CX CSRs in a clean way. It retains the existing CSR access instructions without introducing another set of them for CXs.

Cons: The definition of a CX must change to be a set of stateful custom instructions and also a set of custom CSRs. (*) It may require changing the CXU-LI to convey not only custom function instruction requests/responses, but now also custom CSR access requests/respones. It may require additional wide multiplexers in the processor datapath to route CSR access instruction fields (e.g. 12b CSR address) into CXU-LI ports.

(*) This is analogous to defining a software interface abstraction as a set of methods / member functions (only), vs. defining it as a set of methods/member functions plus a set of data members.

Note the spec requires that "Attempts to access a non-existent CSR raise an illegal instruction exception." This may be challenging to achieve in the current spec, which does not signal an exception but rather sets an error flag for the analogous error of issuing a custom instruction that is not implemented by a CX.

Also note, any CX CSR access must follow the CSR access ordering rules per the priv spec.

Taking stock of the two options, the clear winner from a clean HW-SW ISA perspective is the second one. Adding uniform, privileged-checked CX CSRs, via existing CSR access instructions, and providing unlimited conflict-free CX CSRs, is a significant improvement over "no uniform support for CX CSRs, roll your own" in the current spec. However we must take care this approach does not inevitably cause expensive new multiplexers into processor datapaths.

Impact of adding CX CSRs upon CXU-LI

Presently CXU-LI provides no means to convey a CSR access to a selected CXU. Here are two different ways to do this.

Don't change CXU Requests and Responses: Here the processor must express the CSR R/W access using existing CXU Request signaling. It could do this using certain reserved CF_IDs corresponding to the various CSR accesses. In other words, even if we adopt option # 2, "multiplexed custom CSR access" as the ISA mechanism, the processor could nevertheless map the CX CSR access into a CX custom function instruction.
Change CXU Requests and Reponses: Extend CXU-LI signaling to explicit represent (signal) CX CSR accesses, distinct from other CX custom function instructions. Rather than add several expensive new ports we might try to share the existing CXU request ports that make sense. The CSR address might be a new 12b port, or it might reuse req_data1[11:0] or extend-and-use req_func[9:0]. The new CSR value, already sourced on X[rs1], might as well arrive on req_data0[]. The 3 CSR access operations csrrw/csrrs/csrrc might be encoded and conveyed via a new 2-bit field req_type (or req_cmd); the fourth value might signal "NOT CSR access" i.e. signaling this is a custom function instruction not a CSR access instruction.

If we adopt this encoding, the hardware cost of extending CXU-LI to carry CX CSR accesses is +2 signal bits per request, + one 12-bit 2-1 mux in the CPU to route the CSR address field into req_data1[11:0]. Note, beside the extra LUTs for this mux, the extra LUT delay and wiring near req_data1[] is painful. For that reason it might indeed be better to convey the CSR address on req_func[] which is after all not anywhere near the critical EX stage register operands, muxes, and ALU.

We must also convey CX CSR address errors. We might reuse cxu_status = CFU_ERROR_FUNC to signal an error that the addressed CSR is not implemented by this CX.

In all, the expected hardware cost of adding CX CSRs to CXU-LI is unfortunate but manageable.

Summary

We recommend adding CX scoped custom CSRs to the CX spec. This should be done by extending CX multiplexing to also multiplex CSR access instructions to custom CSR addresses when a CX state context is selected.

CXU-LI should be extended to explicitly represent and distinguish between CX custom instructions and CX custom CSR accesses, with care, so as to minimize the expected area and frequency impact of the new signaling.

grayresearch commented 7 months ago

896 of 4096 CSRs are custom CSRs:

 | Unprivileged and User-Level CSRs
 | BA | 98 | 7654 | Range       |   # | type
 | 10 | 00 | XXXX | 0x800-0x8FF | 256 | U RW
 | 11 | 00 | 11XX | 0xCC0-0xCFF |  64 | U RO
 | Supervisor-Level CSRs
 | 01 | 01 | 11XX | 0x5C0-0x5FF |  64 | S RW
 | 10 | 01 | 11XX | 0x9C0-0x9FF |  64 | S RW
 | 11 | 01 | 11XX | 0xDC0-0xDFF |  64 | S RO
 | Hypervisor and VS CSRs
 | 01 | 10 | 11XX | 0x6C0-0x6FF |  64 | H RW
 | 10 | 10 | 11XX | 0xAC0-0xAFF |  64 | H RW
 | 11 | 10 | 11XX | 0xEC0-0xEFF |  64 | H RO
 | Machine-Level CSRs
 | 01 | 11 | 11XX | 0x7C0-0x7FF |  64 | M RW
 | 10 | 11 | 11XX | 0xBC0-0xBFF |  64 | M RW
 | 11 | 11 | 11XX | 0xFC0-0xFFF |  64 | M RO`

Division of responsibilities

In extending composable extension multiplexing to also multiplex custom CSR accesses, the CPU, not each CXU, shall first check the privilege access and read-write access of the CSR access, prior to forwarding the CSRR[WSC] to the selected extension and state context. If the hart does not have access or if the access is RW to a read-only CSR, the CSR access shall raise an illegal instruction exception per the priv spec.

grayresearch commented 7 months ago

CXU-LI changes for CX CSRs continued

It is now time to study the nitty-gritty LUT overhead of adding CX CSRs to CXU-LI and pick something.

First let's compare and contrast cx_reg cx_imm addi csrrw and csrrwi

cf_id[9:3] rs2[4:0] rs1[4:0] cf_id[2:0] rd[4:0] custom0[6:0]   cx_reg
imm[7:0] cf_id[3:0] rs1[4:0]   000[2:0] rd[4:0] custom1[6:0]   cx_imm
imm[11:0]           rs1[4:0] func3[2:0] rd[4:0] op_imm_[6:0]   addi (I-type)
csr[11:0]           rs1[4:0] func3[2:0] rd[4:0] special[6:0]   csrrw
csr[11:0]          uimm[4:0] func3[2:0] rd[4:0] special[6:0]   csrrwi

It certainly seems the present cx_imm custom1 opcode design is a mistake. It is irregular vs. addi even though it only provides a four-bit cf_id[3:0] vs. what it might with addi/I-type's 3-bit func3. The comments in the spec note: "This new, irregular immediate field encoding may have a disproportionate impact on area and critical path delay in the decode or execute pipeline stages of a RISC-V processor core." Also, "Seven-eighths of the custom-1 encoding space is reserved for future custom function instruction encodings." That is less of a concern now that mcx_selector.version exists to gracefully allow future custom instruction encodings.

There is nothing special about the current cx_imm encoding. It was decided arbitrarily during a 2020-21 design meeting and without full consideration of the cost of the irregularlity nor of the new requirement to HW-frugally support CX CSR accesses.

Although an I-type cx_imm only supports 8 CF_IDs per CX, the present irregular cx_imm only supports 16. Either way, you must resort to cx_reg when you need more than a 3b or 4b CF_ID.

So for starters, let's assume we change cx_imm so it follows the same layout as addi, with the cf_id[2:0] supplied by func3[2:0]. Now let's recap our table:

cf_id[9:3] rs2[4:0] rs1[4:0] cf_id[2:0] rd[4:0] custom0[6:0]   cx_reg
imm[11:0]           rs1[4:0] cf_id[2:0] rd[4:0] custom1[6:0]   cx_imm (I-type)
csr[11:0]           rs1[4:0] func3[2:0] rd[4:0] special[6:0]   csrrw
csr[11:0]          uimm[4:0] func3[2:0] rd[4:0] special[6:0]   csrrwi

Here we see that like addi, this new I-type cx_imm takes the 12b immediate in insn[31:20] and sign-extends and muxes it into the second ALU operand register. This same value becomes the CXU-LI req_data1[] operand. Since we're already doing that, and since the CSR address csr[11:0] of CSRR[WSC][I] instructions is also at insn[31:20], it follows we can convey the CSR address to the CPU's CXU-LI request port req_data1[] for zero additional LUTs. (!!!)

The second thing the CPU must convey to the CXU request is that the request is a CSR access (either W, S, or C) or is a custom function instruction. Four possible request types (CF, CSRW, CSRS, CSRC) => 2b request type port.

This is a satisfactory encoding. By redefining cx_imm encoding to be more uniform, like addi (I-type), we simplify cx_imm's implementation cost, and support CX CSRs by adding just one CXU request port (two signal bits).

grayresearch commented 6 months ago

On further consideration, we redefine the req_func port's FUNC_ID type to have width CXU_FUNC_ID_W = 1 + CF_ID_W.

When the MSB is 0, req_func conveys a CF_ID.

When the MSB is 1, req_func[1:0] conveys a 2b CSR access type (CSRR, CSRRW, CSRRS, CSRRC). This enables CXUs to implement read-only vs. read-write/set/clear CX CSR access semantics.

In all this adds only 1-bit of new control signal, minimizing impact across CXU interconnects, etc.

grayresearch / CX