stnolting / neorv32

:desktop_computer: A small, customizable and extensible MCU-class 32-bit RISC-V soft-core CPU and microcontroller-like SoC written in platform-independent VHDL.
https://neorv32.org
BSD 3-Clause "New" or "Revised" License
1.6k stars 226 forks source link

[idea] Add indirect CSR accesses (Smcsrind ISA extension)? #768

Closed stnolting closed 8 months ago

stnolting commented 10 months ago

In a very custom version of the processor I am (mis)using the Smcsrind ISA extension to add further CPU-local hardware accelerators. Even though those accelerators are not linked to the actual CPU pipeline, accesses via (indirect) CSR operations are much faster than if they were memory mapped (less clock cycles and no additional SoC bus traffic/congestion).

The ISA extension is about to be ratified and I am curious what other people are thinking. Maybe this would be a nice additional option to couple custom hardware modules??

What do you think? 🤔

Unike267 commented 9 months ago

Hi @stnolting

I'm currently working on the connection of accelerators to NEORV32 as part of a practices in the GDED research group at the University of the Basque Country.

The final goal is connect to NEORV32 a parametric and configurable hardware circuit which implements multiple activation functions (Sigmoid, Tanh, ReLu, Leaky ReLu etc.) to accelerate computations in neural network operations.

At the moment I've connected a simple accelerator (multiplier + fifos) to NEORV32 by four different ways:

The first goal is to determine which of these methods provides the best results in terms of throughput, latency, etc. And choose one to connect the configurable hardware circuit with NEORV32.

the Smcsrind ISA extension to add further CPU-local hardware accelerators.

I don’t know about indirect CSR access. However if you are going to add this functionality to the micro I will be happy to test it.

In a certain sense it seems to me to be similar to CFS, since CFS uses the associated CSR registers, although I do not know if it does so by direct or indirect access.

Can you clarify the difference between the proposed method and CFS?

The ISA extension is about to be ratified and I am curious what other people are thinking. Maybe this would be a nice additional option to couple custom hardware modules??

I would also be interested in comparing the efficacy of this method vs doing a custom instruction via CFU. That at the moment is the way with I've obtained the best results.

Cheers! :smile:


/cc @umarcor

stnolting commented 9 months ago

Hey @Unike267.

The final goal is connect to NEORV32 a parametric and configurable hardware circuit which implements multiple activation functions (Sigmoid, Tanh, ReLu, Leaky ReLu etc.) to accelerate computations in neural network operations.

That sounds really interesting! Can you say something about the actual application of that CNN?

At the moment I've connected a simple accelerator (multiplier + fifos) to NEORV32 by four different ways: [...] The first goal is to determine which of these methods provides the best results in terms of throughput, latency, etc. And choose one to connect the configurable hardware circuit with NEORV32.

Nice! Basically, these are the four "primary" extension options of the core. FYI, w have tried to provide a comparative summary for (most of) these extension options in the user guide: https://stnolting.github.io/neorv32/ug/#_comparative_summary

In a certain sense it seems to me to be similar to CFS, since CFS uses the associated CSR registers, although I do not know if it does so by direct or indirect access.

You are talking about the CFU, right?

The CFU provides 4 CSRs to exchange additional data. However, that are just four registers that can be used directly. With the indirect approach, there would be several "sets" of those interface registers making data exchanges a little more efficient.

Just have a look at the latest indirect CSR access proposal. The spec. is quite short but illustrates the concepts very well: https://github.com/riscv/riscv-indirect-csr-access/releases/download/v1.0.0-rc8frz/riscv-indirect-csr-access-v1.0.0-rc8frz.pdf

Unike267 commented 9 months ago

Hi @stnolting

That sounds really interesting! Can you say something about the actual application of that CNN?

Yes, of course, the application is hyper-spectral image processing. In short, we have a Photonfocus camera, integrating imec’s hyperspectral sensor. Each member of the group has his/her tasks. Mine supervised by @umarcor is to receive the preprocessed information, store it in a buffer like a DDR and use NEORV32 connected with a specific circuit (via CFU) to do images segmentation/classification . At the moment, the purpose of the specific circuit is calculate the result of the activation function with custom instructions. However, the final goal is make all the neural network operations with specific hardware and use the NEORV32 to manage the input/output data.

You are talking about the CFU, right?

Yes, my bad.

The CFU provides 4 CSRs to exchange additional data. However, that are just four registers that can be used directly. With the indirect approach, there would be several "sets" of those interface registers making data exchanges a little more efficient.

Ok, thanks for the clarification.

Just have a look at the latest indirect CSR access proposal. The spec. is quite short but illustrates the concepts very well: https://github.com/riscv/riscv-indirect-csr-access/releases/download/v1.0.0-rc8frz/riscv-indirect-csr-access-v1.0.0-rc8frz.pdf

I will have to check this.

In conclusion it seems an interesting way to connect circuits to NEORV32. I encourage you to implement it.

Cheers! :smiley:

stnolting commented 8 months ago

Sorry for the late response..

Yes, of course, the application is hyper-spectral image processing. In short, we have a Photonfocus camera, integrating imec’s hyperspectral sensor.

That's a cool project! Are you planning a publication about it? :wink:

I will have to check this.

The basic idea was to provide several CSRs for passing wide operand vectors to a hardware accelerator, for example:

uint32_t data[6];

neorv32_cpu_csr_write(mireg,  data[0]);
neorv32_cpu_csr_write(mireg2, data[1]);
neorv32_cpu_csr_write(mireg3, data[2]);
neorv32_cpu_csr_write(mireg4, data[3]);
neorv32_cpu_csr_write(mireg5, data[4]);
neorv32_cpu_csr_write(mireg6, data[5]);

However, my experiments have shown that the overhead is quite high compared to 2x r5-type CFU instructions. So, for now I think I'm dropping this idea.

In conclusion it seems an interesting way to connect circuits to NEORV32. I encourage you to implement it.

Thanks for your feedback anyway! :+1: