Interest in Flopoco-based FPU, Instruction invalidation on self-modifying code or Verilator Improvements?

In my fork, I have added some functionality to CVA5. I have a significantly modified environment in which I test, so cannot simply issue pull-requests right now. Is there interest in somehow getting some or all of these changes upstream? And if so, which?

The main additions are:

a Flopoco-based FPU with customizable pipeline latencies across 4 CVA5-pipeline modules and FP-RegFile. FPU supports all Single-precision operations, but no subnormals, rounding modes, exceptions or FPU-CSR registers yet. Since Flopoco generates VHDL-code there is a verilator-dpi-C implementation for all of those that is a drop-in replacement for the actual Flopoco implementations. Would require additional work to regenerate matching Flopoco implementations from a user-supplied Flopoco binary instead of the pre-generated files I use in my own git submodule. But the verilator implementation runs out-of-the box. FP latencies are not yet configurable from the CPU_CONFIG structure, but could easily be, as they are already parameterized inside the pipelines.
- The FP RF is a separate instance of the existing RF (now more parameterizable) , with 64 physical registers that are also handled by the Renamer. They are 2 bits wider than GP regs to match Flopoco's wider format. This also allows for cheaper 3r1w ports independent of the GP RF (although the FP-MAC implementation does not use the 3rd operand simultaneously. Enhanced decode & issue stages could make do with only 2 read ports). I have not investigated synthesis-impact of using a shared pool of physical registers to avoid allocating 2x 64 registers or to mitigate the need for some separate infrastructure in the renamer.
Optional (build-time and runtime, controlled via CSR) Instruction invalidation for all Data-writes. Instruction cache and Branch-Predictor were not kept coherent with data when that was changed, so bootloader-functionality was problematic. I have not kept up-to-date with what CVA5 can do in this regard out of the box (there seems some early-branch-flush feature to at least handle this in the predictor). This invalidation can slow down the processor a bit, as each write is signaled to a configurable number of fifos to check for needed invalidation. The invalidation is by default off and needs to be enabled via custom CRS register for as long as overwriting existing instructions is possible.
I have reworked the Verilator implementation with new command-line options/parsing and features. Among them:
- Extensible. I can have my own build with different top-level file with more ports by switching out one C file and reusing all the rest
- can terminate on reaching infinite loops (optional) or a user-exit magic-nop
- configurable stall limit (RT-OS with RFI instructions will hit the hard coded limit very fast)
- UART redirectable to file, including inputs. Can be used with socat to simulate bootloaders communicating over UART with the actual host-side loader tool
- new combined format for memory contents. I have devised a text-format that lists an arbitrary number of binary files, each with offsets, ranges from which actual memory contents and reference contents can be loaded. Verilator can initialize both local memory and DRAM from this format. It is essentially a dummed down, human-readable ELF Header table, which means, in most cases, my tool (written in kotlin, complex, reads and understands ELF-files) just generates this index-file, but the actual memory contents are read from the original ELF-Binary. But additional contents can easily be mixed-in or overlayed. Since the format is sparse and supports zero-initializing it can save a lot of space compared to the existing hex files.
- local memory and DRAM can be initialized separately, even from the existing hw_init hex formats
- use FST format instead of VCD (but configurable at build-time). Much faster and more space-saving
- out-of-tree: I now build Verilator with CMake, which builds faster and is more comfortable, which is also where the Flopoco source files are integrated right now
out-of-tree/WIP: Zephyr Port, intended to build a multi-threaded application that manages many things, including bootloading via an additional UART port (supports User-Mode, UART, but works around lack of RISC-V PMP / user mode is not actually isolated via any means

I'd like to preface this comment by stating that I'm a Masters student at Simon Fraser University under Dr. Lesley Shannon, whose team created Taiga before it was adopted by OpenHW and renamed to CVA5. My comments reflect my position, not that of OpenHW and I cannot speak for them.

My research focuses on the memory systems, and I am in the process of updating various units (like the icache/dcache/LS unit/arbiter), and creating new ones (prefetchers, L2 cache) that will result in proper support for a coherent multicore system.

With regards to the FPU, I have one in the works that is almost ready for merging, outlined in https://ieeexplore.ieee.org/document/10171529/ and will hopefully be published and merged in the coming weeks/months. It supports the FD extensions, including subnormal, exceptions, and the FCSR registers.
Correct me if I'm wrong, but instruction invalidation on data writes is controlled by the IFENCE instruction from the Zifencei extension. As it stands now, I think this instruction is either not implemented or not fully working in the mainline repo, but I have already addressed this in my private WIP fork.
Those sound like good Verilator changes. I've made some of my own that line up with yours, but I didn't go as far. I don't know if you already knew this, but the accelerators-2023 branch was published with a number of changes/improvements (and I believe the switch to .fst files was one of them).

Correct me if I'm wrong, but instruction invalidation on data writes is controlled by the IFENCE instruction from the Zifencei extension.

Yes, that is what the ISA defines. But since you cannot simply clear the entire ICache in a single cycle on FPGA, like the simplest solution the ISA suggests, the instruction would either need to walk the line set of the entire ICache and clear every valid line (same for branch-predictor data) or invalidation needs to be handled concurrently to all memory writes (what my implementation does), so that when an IFENCE occurs, it simply needs to wait for all invalidation queues to run empty. Handling the invalidation concurrently also saves you from clearing entirely untouched data form the cache that the naive solution of flushing the entire cache would have.

That this concurrent invalidation can be turned off and is off by default is not ISA-compliant / an extension to the ISA, but it could still be used in a ISA-compliant implementation of the IFENCE instruction for FPGA. My implementation simply uses software/the CSR register for controlling the invalidation to also wait for all invalidation queues to be flushed. This is what the IFENCE instruction could do in HW (as long as the invalidation has already been active the entire time) to make this compliant.

Because my implementation being active slows down the processor a bit (stalls due to the invalidation queues becoming full, because invalidation is handled with lower priority than lookups) and I know when precisely I need it, the time was not spent to try to further optimize loss of speed caused by stalling on full invalidation queues.

I agree with Chris that improvements to the Verilator infrastructure are very much welcome. A few fixes already exist on the accelerators branch (which is intended to make its way to the main branch soon), but any improvements are welcome.

As a side note: the focus of the accelerators branch was to make integration of custom units easier and more self-contained. Its main improvements are:

A move to a distributed decode organization, such that decode_and_issue.sv does not need to be modified for new units
Improved data cache (lower latency and support for uncacheable regions)
Improved runtime stat collection (for simulation only)

In terms of the FPU, one of the things that is a goal of mine is to make the support needed for complex custom units (including FPUs) more flexible. With the distributed decode in the accelerators branch, the remaining pieces are: the regfile, writeback (to support FP exception data) and the renamer (I'll take a more detailed look at your regfile changes).

If you're not planning any unique behaviour for your FPU, I think your work and Simon Fraser University's FPU work should be largely compatible (potentially to the point where you could mix and match units). I've taken a quick look at your repo, and at least for decode_and_issue, renamer, registerfile and the FPU units I didn't see anything unexpected.

So long story short, I think your FPU units will be compatible, and if you have any requirements/requests/ideas about how the generic FPU support is handled we can work together on that.

I saw the restructuring to that distributed code after I had started work on my unit. I did not want to rebase but also figured there would be no performance benefits for me. Good to know that the data cache has improved.

I think I already listed the next improvements to my FPU (using only 2 register ports instead of 3 would require extensions / changes in CVA5. as long as you do not have an FMAC implementation that uses all 3 operands from the start), but at this point it should be good enough for my use case. I am also developing an external hardware accelerator, so one of my interests is to have the various operations be similarly efficient in my accelerator as in the CPU and for both to run at iso-frequency to be better comparable. So these enhancements would likely only make my hardware accelerator less comparable to the CPU (it does not even have FMAC it simply uses the Mul & Add operators that my FMAC is built from separately.)

My changes will probably not achieve the highest frequencies though, since I am being limited by other external components and do not need to be faster than 100 MHz for my use case.

Regarding pulling other changes like my verilator changes in: let me know if you need any of my files that are not inside the CVA5 repo (verilator build scripts, test setup for my interrupt tests etc.). Most files are in a repo similar to the taiga-project repo that includes peripherals, drivers, build scripts, my particular hardware configurations and tests/benchmarks for it. But I am not willing to make the whole repo public as of now, as it also includes my hardware accelerator. So I'd need to either provide those files on their own or move them to other / new repos. But happy to share anything related to CVA5.

openhwgroup / cva5

Interest in Flopoco-based FPU, Instruction invalidation on self-modifying code or Verilator Improvements? #20