grayresearch / CX

Proposed RISC-V Composable Custom Extensions Specification
Apache License 2.0
66 stars 12 forks source link

How to improve the efficiency of CF using limited integer register? #15

Closed littlezpf666 closed 9 months ago

littlezpf666 commented 1 year ago
    It is not correct that the composable custom extensions and CFU spec requires memory accesses. The extra cost of custom interface multiplexing is CSR-writing mcfu_selector only, and is amortized across so many custom function instructions. Otherwise the performance of custom function instructions can be the same as integer ALU operations, sourcing operands and writing results back data to the register file.

Note, in some use cases, since the spec supports stateful interfaces, functions, and CFUs, you can implement a stateful accelerator which can reduce CPU memory access traffic. For example, a multiply accumulate (MAC) custom function instruction could keep the accumulator value as CFU state so it need not be read and written from the register file for each MAC instruction issued. For another example, a matrix multiply accelerator could have row and column vectors as state (loaded into the accelerator by custom function instructions "set-vector-element xyz" etc.) but then perform the N^2 multiples of each row * each column on these elements of its state without further CPU memory accesses.

There is a vast design space of possible accelerator categories, including stateless, stateful scalar regs, vector regs, accessing memory or not, pure compute, or also control/branching, async, request/completion queues, etc. This specification addresses one modest corner of that space. It is intentionally scoped to enable composition of separately authored, possibly stateful, ALU-like custom function instructions and nothing more.

If you have a workload where all the data resides in memory and you need to issue an accelerated, possibly autonomous computation against that data, the proposed extension interfaces may not be appropriate for your application.

Originally posted by @grayresearch in https://github.com/grayresearch/CFU/issues/14#issuecomment-1356818773

Thanks for your comprehensive explain. I am clear that the computation of composable custom instruction don't depend on memory accesses. The resource and result data are in register file. But I think the data exchange from memory to register like ALU operation for the preparation of executing custom instruction can't be avoided, because the data can't initially in register. Besides it's difficult to directly use register's data and meanwhile guarantee the efficient reuse of register, especially in C environment, because the register needs to be distributed by compiler. For example, I use Macro as below of Google's CFU playground demo to encapsulate the custom instruction.

define opcode_R(opcode, func3, func7, rs1, rs2) \

({ \ register unsigned long result; \ asm volatile( \ ".word ((" #opcode ") | \ (regnum%[result] << 7) | \ (regnum%[arg1] << 15) | \ (regnum_%[arg2] << 20) | \ ((" #func3 ") << 12) | \ ((" #func7 ") << 25));\n" \ CUSTOM_INSTRUCTION_NOP \ : [result] "=r" (result) \ : [arg1] "r" (rs1), [arg2] "r" (rs2) \ ); \ result; \ })
cfu_result=opcode_R(CUSTOM0, 0, 0, 1, 1); 9ae: 4785 li a5,1 9b0: 4705 li a4,1 9b2: 00e7878b 0xe7878b 9b6: 80be mv ra,a5 9b8: 8706 mv a4,ra 9ba: 6785 lui a5,0x1 This method is friendly to C user. Compiler will treat Macro as a function and distribute the argument register to subsequent custom function, which will avoid the conflict of the application of register. But every time custom function be called, the compiler will generate the extra load and store instruction. I was wondering if embedding an entire piece of assemble code relevant to composable custom instruction to C code will improve the efficiency of using limited integer register, but will it lead to difficult in software development?

For the application of state example, if I keep a value in CFU as CFU state, every time I use this value to calculate, do I need to use mcfu_selector.id to change the state to make CFU knows the state it executes currently? In terms of MAC, is the times of MAC the state ID?

grayresearch commented 9 months ago

Sorry, I do not understand your question. The mcxu_selector.state_id selects the current state context of the hart's current CXU (formerly CFU). The state context can be any data useful to operation of the composable extension / CXU. For example, it may be a vector register file. This can definitely be used to not have to load and reload and reload etc. data from memory to registers to CXU. However, the current CX version one does limit CX custom instructions to access integer registers or the CXU's state context only. A CX custom instruction cannot directly access memory (CX V1).

Referring to your example above, it is not necessary for every invocation of a CX custom instruction to perform explicit RISC-V loads and stores and/or the various li's etc. in your example. That is the consequence of the way you compile the CX custom instructions into your code.

I am going to close this issue. If you have a follow on question please open a new issue.