openhwgroup / cva5

The CORE-V CVA5 is an Application class 5-stage RISC-V CPU specifically targetting FPGA implementations.
Apache License 2.0
57 stars 14 forks source link

Interest in Flopoco-based FPU, Instruction invalidation on self-modifying code or Verilator Improvements? #20

Open ramonwirsch opened 10 months ago

ramonwirsch commented 10 months ago

In my fork, I have added some functionality to CVA5. I have a significantly modified environment in which I test, so cannot simply issue pull-requests right now. Is there interest in somehow getting some or all of these changes upstream? And if so, which?

The main additions are:

CKeilbar commented 10 months ago

I'd like to preface this comment by stating that I'm a Masters student at Simon Fraser University under Dr. Lesley Shannon, whose team created Taiga before it was adopted by OpenHW and renamed to CVA5. My comments reflect my position, not that of OpenHW and I cannot speak for them.

My research focuses on the memory systems, and I am in the process of updating various units (like the icache/dcache/LS unit/arbiter), and creating new ones (prefetchers, L2 cache) that will result in proper support for a coherent multicore system.

ramonwirsch commented 10 months ago

Correct me if I'm wrong, but instruction invalidation on data writes is controlled by the IFENCE instruction from the Zifencei extension.

Yes, that is what the ISA defines. But since you cannot simply clear the entire ICache in a single cycle on FPGA, like the simplest solution the ISA suggests, the instruction would either need to walk the line set of the entire ICache and clear every valid line (same for branch-predictor data) or invalidation needs to be handled concurrently to all memory writes (what my implementation does), so that when an IFENCE occurs, it simply needs to wait for all invalidation queues to run empty. Handling the invalidation concurrently also saves you from clearing entirely untouched data form the cache that the naive solution of flushing the entire cache would have.

That this concurrent invalidation can be turned off and is off by default is not ISA-compliant / an extension to the ISA, but it could still be used in a ISA-compliant implementation of the IFENCE instruction for FPGA. My implementation simply uses software/the CSR register for controlling the invalidation to also wait for all invalidation queues to be flushed. This is what the IFENCE instruction could do in HW (as long as the invalidation has already been active the entire time) to make this compliant.

Because my implementation being active slows down the processor a bit (stalls due to the invalidation queues becoming full, because invalidation is handled with lower priority than lookups) and I know when precisely I need it, the time was not spent to try to further optimize loss of speed caused by stalling on full invalidation queues.

e-matthews commented 10 months ago

I agree with Chris that improvements to the Verilator infrastructure are very much welcome. A few fixes already exist on the accelerators branch (which is intended to make its way to the main branch soon), but any improvements are welcome.

As a side note: the focus of the accelerators branch was to make integration of custom units easier and more self-contained. Its main improvements are:

In terms of the FPU, one of the things that is a goal of mine is to make the support needed for complex custom units (including FPUs) more flexible. With the distributed decode in the accelerators branch, the remaining pieces are: the regfile, writeback (to support FP exception data) and the renamer (I'll take a more detailed look at your regfile changes).

If you're not planning any unique behaviour for your FPU, I think your work and Simon Fraser University's FPU work should be largely compatible (potentially to the point where you could mix and match units). I've taken a quick look at your repo, and at least for decode_and_issue, renamer, registerfile and the FPU units I didn't see anything unexpected.

So long story short, I think your FPU units will be compatible, and if you have any requirements/requests/ideas about how the generic FPU support is handled we can work together on that.

ramonwirsch commented 9 months ago

I saw the restructuring to that distributed code after I had started work on my unit. I did not want to rebase but also figured there would be no performance benefits for me. Good to know that the data cache has improved.

I think I already listed the next improvements to my FPU (using only 2 register ports instead of 3 would require extensions / changes in CVA5. as long as you do not have an FMAC implementation that uses all 3 operands from the start), but at this point it should be good enough for my use case. I am also developing an external hardware accelerator, so one of my interests is to have the various operations be similarly efficient in my accelerator as in the CPU and for both to run at iso-frequency to be better comparable. So these enhancements would likely only make my hardware accelerator less comparable to the CPU (it does not even have FMAC it simply uses the Mul & Add operators that my FMAC is built from separately.)

My changes will probably not achieve the highest frequencies though, since I am being limited by other external components and do not need to be faster than 100 MHz for my use case.

Regarding pulling other changes like my verilator changes in: let me know if you need any of my files that are not inside the CVA5 repo (verilator build scripts, test setup for my interrupt tests etc.). Most files are in a repo similar to the taiga-project repo that includes peripherals, drivers, build scripts, my particular hardware configurations and tests/benchmarks for it. But I am not willing to make the whole repo public as of now, as it also includes my hardware accelerator. So I'd need to either provide those files on their own or move them to other / new repos. But happy to share anything related to CVA5.