Open ramonwirsch opened 10 months ago
I'd like to preface this comment by stating that I'm a Masters student at Simon Fraser University under Dr. Lesley Shannon, whose team created Taiga before it was adopted by OpenHW and renamed to CVA5. My comments reflect my position, not that of OpenHW and I cannot speak for them.
My research focuses on the memory systems, and I am in the process of updating various units (like the icache/dcache/LS unit/arbiter), and creating new ones (prefetchers, L2 cache) that will result in proper support for a coherent multicore system.
With regards to the FPU, I have one in the works that is almost ready for merging, outlined in https://ieeexplore.ieee.org/document/10171529/ and will hopefully be published and merged in the coming weeks/months. It supports the FD extensions, including subnormal, exceptions, and the FCSR registers.
Correct me if I'm wrong, but instruction invalidation on data writes is controlled by the IFENCE instruction from the Zifencei extension. As it stands now, I think this instruction is either not implemented or not fully working in the mainline repo, but I have already addressed this in my private WIP fork.
Those sound like good Verilator changes. I've made some of my own that line up with yours, but I didn't go as far. I don't know if you already knew this, but the accelerators-2023 branch was published with a number of changes/improvements (and I believe the switch to .fst files was one of them).
Correct me if I'm wrong, but instruction invalidation on data writes is controlled by the IFENCE instruction from the Zifencei extension.
Yes, that is what the ISA defines. But since you cannot simply clear the entire ICache in a single cycle on FPGA, like the simplest solution the ISA suggests, the instruction would either need to walk the line set of the entire ICache and clear every valid line (same for branch-predictor data) or invalidation needs to be handled concurrently to all memory writes (what my implementation does), so that when an IFENCE occurs, it simply needs to wait for all invalidation queues to run empty. Handling the invalidation concurrently also saves you from clearing entirely untouched data form the cache that the naive solution of flushing the entire cache would have.
That this concurrent invalidation can be turned off and is off by default is not ISA-compliant / an extension to the ISA, but it could still be used in a ISA-compliant implementation of the IFENCE instruction for FPGA. My implementation simply uses software/the CSR register for controlling the invalidation to also wait for all invalidation queues to be flushed. This is what the IFENCE instruction could do in HW (as long as the invalidation has already been active the entire time) to make this compliant.
Because my implementation being active slows down the processor a bit (stalls due to the invalidation queues becoming full, because invalidation is handled with lower priority than lookups) and I know when precisely I need it, the time was not spent to try to further optimize loss of speed caused by stalling on full invalidation queues.
I agree with Chris that improvements to the Verilator infrastructure are very much welcome. A few fixes already exist on the accelerators branch (which is intended to make its way to the main branch soon), but any improvements are welcome.
As a side note: the focus of the accelerators branch was to make integration of custom units easier and more self-contained. Its main improvements are:
In terms of the FPU, one of the things that is a goal of mine is to make the support needed for complex custom units (including FPUs) more flexible. With the distributed decode in the accelerators branch, the remaining pieces are: the regfile, writeback (to support FP exception data) and the renamer (I'll take a more detailed look at your regfile changes).
If you're not planning any unique behaviour for your FPU, I think your work and Simon Fraser University's FPU work should be largely compatible (potentially to the point where you could mix and match units). I've taken a quick look at your repo, and at least for decode_and_issue, renamer, registerfile and the FPU units I didn't see anything unexpected.
So long story short, I think your FPU units will be compatible, and if you have any requirements/requests/ideas about how the generic FPU support is handled we can work together on that.
I saw the restructuring to that distributed code after I had started work on my unit. I did not want to rebase but also figured there would be no performance benefits for me. Good to know that the data cache has improved.
I think I already listed the next improvements to my FPU (using only 2 register ports instead of 3 would require extensions / changes in CVA5. as long as you do not have an FMAC implementation that uses all 3 operands from the start), but at this point it should be good enough for my use case. I am also developing an external hardware accelerator, so one of my interests is to have the various operations be similarly efficient in my accelerator as in the CPU and for both to run at iso-frequency to be better comparable. So these enhancements would likely only make my hardware accelerator less comparable to the CPU (it does not even have FMAC it simply uses the Mul & Add operators that my FMAC is built from separately.)
My changes will probably not achieve the highest frequencies though, since I am being limited by other external components and do not need to be faster than 100 MHz for my use case.
Regarding pulling other changes like my verilator changes in: let me know if you need any of my files that are not inside the CVA5 repo (verilator build scripts, test setup for my interrupt tests etc.). Most files are in a repo similar to the taiga-project repo that includes peripherals, drivers, build scripts, my particular hardware configurations and tests/benchmarks for it. But I am not willing to make the whole repo public as of now, as it also includes my hardware accelerator. So I'd need to either provide those files on their own or move them to other / new repos. But happy to share anything related to CVA5.
In my fork, I have added some functionality to CVA5. I have a significantly modified environment in which I test, so cannot simply issue pull-requests right now. Is there interest in somehow getting some or all of these changes upstream? And if so, which?
The main additions are: