grayresearch / CX

Proposed RISC-V Composable Custom Extensions Specification
Apache License 2.0
66 stars 12 forks source link

Dynamic CFU dispatch #13

Closed gatecat closed 10 months ago

gatecat commented 2 years ago

I am working with Dirk Koch's group on eFPGA fabrics, and one of the key use cases we have in mind is dynamically adding instructions to CPUs, and I think it would be useful to make sure that's considered in the specification design. Actually, the plan we roughly have come up with is fairly similar to the "not recommended" idea here :)

Half-baked idea (not recommended): Imagine a dynamic facility by which any arbitrary instruction word, not just custom-0/-1/-2 format instructions, may be a CF instruction, issued to a CFU. This might be a table of (mask,pattern) tuples, or a 32-bit mcfu_opcodes_mask CSR bit vector of 5-bit major opcodes, identifying instructions to divert to the current CFU. Or perhaps, in the hardware domain, a CPU might first issue each instruction to the current CFU, and only execute the instruction in the CPU if the CFU delegates it back to the CPU.

A match table like this would enable instructions to be trapped early if not currently implemented in any eFPGA slots, and then a trap handler could either decide (e.g. following typical cache type heuristics) to just run a software implementation, or alternatively load a suitable bitstream to implement the instruction into an eFPGA slot, and update the table accordingly.

For our use case, it would also be important to have multiple CFU 'slots' active at once handling different sets of instructions, and thus have 'slot to dispatch to' as an entry in the table (the table would then roughly resemble a ternary CAM). It would also be interesting to have "latency" (perhaps both latency and allowed issue frequency) as dynamically configurable table entries (as well as a 'dynamic' flag for when an instruction does want to use ready/valid handshaking) to keep the amount the eFPGA part has to implement as simple as possible.

tcal-x commented 2 years ago

Hi @gatecat, just some random comments. At first this reminded me of Mike Wirthlin's "DISC", https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.1186.

We also used something like this compiling to Garp (my work back at Berkeley), although there the unit of acceleration was a complete extracted & accelerated loop, rather than something like a complex instruction. Its key for identifying configurations was their memory address (where the configurations were stored). It had a configuration cache, so if it saw that the correct configuration was already present, it could rapidly switch over.

You've probably talked with Charles and/or looked at the CfuPlugin code and seen that it can be configured to arbitrarily match bits in the instruction to send to the CFU interface, not just the CUSTOM0/1/2 opcodes. Here we match CUSTOM0 (7'b0001011), and matching CUSTOM1 (7'b0101011) is commented out: https://github.com/litex-hub/pythondata-cpu-vexriscv/blob/master/pythondata_cpu_vexriscv/verilog/src/main/scala/vexriscv/GenCoreDefault.scala#L251-L264 . You could for example send all Vector opcodes to the CFU interface.

The approach in CoreV Extension Interface is interesting in that it leaves the decoding to each extension/CFU -- the CPU sends any unrecognized instruction to the extension interface, where "pre-decoders" can indicate "I can take that instruction!". (I just noticed that their docs have been updated so I can't find the picture that I had in mind.)

I think picorv32's PCPI has a similar mechanism.

There are parts of this spec that I think would sync with what you propose, in particular, that the system provides a mapping from the globally unique ID of the desired CFU to its current index (slot). I'm not sure if we allowed the possibility that this mapping could change throughout program execution.

grayresearch commented 1 year ago

Hi @gatecat and @tcal-x. Defrosting an old thread (sorry for the neglect). Than you for the use cases and suggestions and ideas. I really like this general idea. I am just not sure about impact on complexity in particular. We are trying to achieve the simplest SW-HW-HW system that achieves robust composition of mainstream use cases.

I think there are several asks here:

  1. dynamic configuration of CFUs in a running system, over time -- not much anticipated or explored in the spec to date. But IMO should not be too onerous assuming work in (software layers) context switching logic and the custom interface runtime and (hardware layers) reconfigurable Mux CFUs.

  2. in response to implementing a not-implemented custom instruction, a means to transparently handle a CF_ID error (or a CFU_ID error), instead of flagging an error in cfu_status (which currently must then be inspected by software). 2a. Handling could involve reconfiguring the configured CFUs of the system and retrying the instruction. 2b. Handling could involve dispatch to a software emulation routine.

  3. A "multiple slots" means to multiplex CF instructions to multiple CFUs w/o repeatedly prefixing the instructions with mcfu_selector loads of one CI at a time.

I'm not sure how to do 2 and 3 and keep the spec relatively simple. For #3, which I've had at least one other request, one element of that could be multiple mcfu_selector CSRs. But then where does a custom function instruction "go" when two custom interfaces are selected and both implement that instruction (and how to do you tell?)?

We are trying to enable arbitrary robust composition of separately specified, implemented, and versioned interfaces, so we do not want to have to say "I'm using instruction custom-0/cf_id=5 for something -- hey is anyone else using that CF_ID?"

Another way to solve this within one organization / assigned-numbers-authority, is for their catalog of custom interfaces and their corresponding CFUs to play well with one another -- 100% disjoint CF_IDs -- and then defining a union-of-CIs CI for these. Selecting this union-CI would select one CFU_ID hence one configured CFU, but that CFU would be a kind of Mux CFU, provided by that organization, that delegates CFU requests to its subordinate CFUs, with privately enumerated sub-CFU_IDs. Sort of a "NAT" (yes, I do mean network address translation) model for such composite CFUs. But this only works when the union of CIs affords a unique mapping from CF_ID to respective CFU_ID.

In this scenario note that CF instructions issued on this union-of-CIs CI may fail with bad CF_ID if the final leaf CFU is not configured, but we allowed that already for so-called "configured interface subsets" (see spec).

Additional work in the union-of-CI's Mux CFU would be required for stateful custom interfaces so that system software (oblivious to these shenanigans) still achieves robust context switching etc.

Bottom line, I am sympathetic and enthusiastic for your use cases but pending a simple clean mechanism, I prefer to keep the spec as is (as simple as possible) by requiring application software check whether a selected CI is configured in the system, and by having the application software explicitly multiplex (via repeated mcfu_selector loads) the one current custom interface and interface state context at any given time.

What do you think? Thank you.