grayresearch / CX

Proposed RISC-V Composable Custom Extensions Specification
Apache License 2.0
66 stars 12 forks source link

How does mcfu_selector disable the custom interface multiplexing? #14

Closed littlezpf666 closed 9 months ago

littlezpf666 commented 1 year ago

According to the spec, CSR-write mcfu_selector can provide the CFU_ID and STATE_ID for following custom function to multiplex the CFU. When en=0, disable custom interface multiplexing. The cfu_id and state_id fields are ignored. No CFU is selected. Custom-0, custom-1, or custom-2 instructions execute the CPU’s built-in custom instructions. There are two things make me confused. First, what does cfu_id and state_id fields are ignored mean? Will the following custom instruction not be able to assert the request valid, when CPU detect the en field is not asserted? in other words, how does the unenabled mcfu_selector restrict the later custom instruction can't select CFU in terms of hardware CFU interface. Second, what does the custom instruction execute the CPU’s built-in custom instructions mean? what is CPU’s built-in custom instructions? Is it the software instruction called by runtime function? Does it mean that CPU must detect the en field of mcfu_selector before executing CF, to decide what it should execute? I'm trying to implement the CFU-LI in a RISC-V soft core processor. I will be very appreciated if you could give me a reply. @grayresearch

grayresearch commented 1 year ago

Thank you for your question. The mcfu_selector.en field determines whether interfacing multiplexing and CFU request/response dispatch are enabled or not.

When .en=0, custom interfacing multiplexing is not enabled. Then the behavior of custom-0/1/2 instructions is (per CPU core) implementation defined. Then in response to a custom instruction, a CPU core may take an illegal instruction exception or may perform some other (non-CFU-mediated) custom instruction behaviors. In any case, in this .en=0 mode, issuing custom instructions will not result in issuing a CFU request to any CFU.

For your second question, a CPU that supports the custom interface multiplexing functionality in the spec, including mcfu_selector, when mcfu_selector.en=1, it performs the custom instruction by sending a CFU request to that system's configured DAG of CFUs and later receiving a CFU response. So, yes, the CPU inspects mcfu_selector.en field and the mcfu_selector.cfu_id field to determine which CFU if any to receive the CFU request.

littlezpf666 commented 1 year ago

Thanks for your quick reply. I get it. If mcfu_selector.en=0, CPU will prevent issuing CFU request valid and behave what you define, which indicate you fail to execute custom instruction. But I still don't understand what executing CPU’s built-in custom instructions mean. From my perspective, built-in custom instructions are the macro code wrap the custom instruction. Now that CPU prevent issuing CFU request. How can it execute built-in custom instructions? Are CPU’s built-in custom instructions other pure software function which come true the same functionality as custom instruction? Could you explain it further? Thanks.

grayresearch commented 1 year ago

Sure. There are preexisting CPU cores (pre-CFUs) that already implement some built-in custom instructions of their own without resort to any CFUs. See for example, the PULP Platform Snitch core (https://pulp-platform.github.io/snitch/rm/custom_instructions/) which already has a variety of custom instructions in the custom-1 space.

Could a hypothetical Snitch++ CPU core also coexist with CFUs? Yes. How can Snitch custom instructions coexist with CFU custom function instructions? Via mcfu_selector.en=0/1.

When software disables custom interface multiplexing (mcfu_selector.en=0), custom-1 opcode instructions issue the Snitch custom instructions (or illegal instruction exceptions if no such Snitch instruction).

When software enables custom interface multiplexing (mcfu_selector.en=1), custom-1 opcodes instructions issue CF instructions e.g. issue CFU requests to the selected CFU and state context.

As an aside, note that by design the scope of custom function instructions is limited to possibly-private-stateful sequences of operations against the integer register file. Some custom extensions such as Snitch's stream semantic registers, which directly access memory, cannot currently (spec v.1) be expressed as a custom interface of custom function instructions nor encapsulated in a CFU. So even with custom interface multiplexing & CFUs there will always be some custom instructions out of scope that can still be achieved using the .en=0 mechanism.

littlezpf666 commented 1 year ago

So, this selector.en field give the CPU ability to execute other custom instruction scheme and compensate the shortcoming of CFU at present stage.

As you said, the CFU accessing scope is limited to CPU register file. I know it will improve the reliable of CFU scheme and some other custom instruction schemes, like ARM, also employ this method.

But recently I noticed that if you want to execute one CF, you must add at least one instruction to load data from memory and one instruction to store result to memory. I am not sure whether it will reduce the efficiency of accelerating and what extent it will affect accelerating. I've considered making all the resource data immediate. But as mentioned in spec, it may become critical path in CPU.

Could you evaluate the speed loss result from the data exchange from memory to register or give me some recommends to reducing this efficiency loss.

grayresearch commented 1 year ago

It is not correct that the composable custom extensions and CFU spec requires memory accesses. The extra cost of custom interface multiplexing is CSR-writing mcfu_selector only, and is amortized across so many custom function instructions. Otherwise the performance of custom function instructions can be the same as integer ALU operations, sourcing operands and writing results back data to the register file.

Note, in some use cases, since the spec supports stateful interfaces, functions, and CFUs, you can implement a stateful accelerator which can reduce CPU memory access traffic. For example, a multiply accumulate (MAC) custom function instruction could keep the accumulator value as CFU state so it need not be read and written from the register file for each MAC instruction issued. For another example, a matrix multiply accelerator could have row and column vectors as state (loaded into the accelerator by custom function instructions "set-vector-element xyz" etc.) but then perform the N^2 multiples of each row * each column on these elements of its state without further CPU memory accesses.

There is a vast design space of possible accelerator categories, including stateless, stateful scalar regs, vector regs, accessing memory or not, pure compute, or also control/branching, async, request/completion queues, etc. This specification addresses one modest corner of that space. It is intentionally scoped to enable composition of separately authored, possibly stateful, ALU-like custom function instructions and nothing more.

If you have a workload where all the data resides in memory and you need to issue an accelerated, possibly autonomous computation against that data, the proposed extension interfaces may not be appropriate for your application.