FPGA Optimizations - Githubissues

vogelpi commented 4 years ago

This issue for tracking different work items for optimizing Ibex for FPGA deployment. @ganoam will join us in January and as part of his Master thesis look into this. Things to investigate:

[x] Register file: Leverage RAM32M primitives or similar instead of FFs/latches.
[x] ALU: Add a new single-cycle multiplier that makes better use of the available DSP slices. (https://github.com/lowRISC/ibex/issues/527)
[x] CSRs: Investigate the effect of the counter width of performance monitors on resource utilization and timing (64-bit counters are known to map badly to FPGA). (https://github.com/lowRISC/ibex/issues/601), PR to infer DSP slices for performance monitor counters: #624
Further ideas:
- [x] Using DSP slices to implement the bit shifting inside the ALU: #625 Initial explorations have shown only very little or no resource savings at all (depending on tool settings) can be gained. This is not going to be investigated further.

GregAC commented 4 years ago

In addition to looking at Xilinx FPGAs let's investigate fitting an Ibex into the ICE40 family (http://www.latticesemi.com/iCE40). They're popular for open hardware FPGA boards and an open source tool chain exists for them.

I tried a brief experiment targeting the LP8K and it just about fit in terms of LUTs but I terminated it before it finished place & route (I suspect it will fail to fit in the device because of this). I can't remember whether I used the E register file or the full register file and whether I hacked out any CSRs. I'll see if I can find it and I'll report back more details.

Bonus points for getting it working with the IceStorm (http://www.clifford.at/icestorm/) flow.

asb commented 4 years ago

CC @wallento who I think started looking at Ibex+ice40?

This is great as a tracking issue (thanks Pirmin!), but we should move any further discussion of the more concrete steps to linked issues (at least the reg file and MUL).

wallento commented 4 years ago

Hi, yes, I have been starting to port it to the Fomu during the summit, but it did not fit so far. Happy to share what I have done as a patch to the Fomu workshop repository, which is a great starting point. Will do so asap. Cheers

ganoam commented 4 years ago

Hi everyone, Thank you @vogelpi for the introduction. I am looking forward to working with you. see you soon, Noam

ganoam commented 4 years ago

Hi everyone, I have implemented a first version of the new reg file. See Pull Request #555. This is my first PR, so if I did anything wrong, please let me know. cheers, Noam

ganoam commented 4 years ago

Hi everyone, A single cycle multiplier implementation is ready for review in Pull Request #559 . I am looking forward to your feedback! cheers, Noam

ganoam commented 4 years ago

Hi everyone,

In my premature Pull Request, implementing a single cycle multiplier implementation, there has been a discussion weather to implement a full single cycle multiplier, or implementing a version, which would execute MUL instructions in a single cycle and MULH instructions in 2 cycles. It was suggested that the former would significantly increase the critical path while not resulting in significant benefits in terms of IPC.

I have now investigated the issue with an implementation of both versions. I have based the evaluation on two possible core configurations, using @GregAC 's draft PR adding a third-stage (#568 ), once including the new Branch Target ALU and the third pipeline stage and once using the existing 2 stage configuration. The evaluated implementations have been synthesized with the corresponding parameters. Also, since the new multiplier implementation will primarily be targeted to FPGA synthesis, I have included the FPGA register file. The utilization and timing results are listed in the following table:

Core Configuration | Mult. Implementation | Logic LUTs | Critical Path /ns | Max. Freq /MHz
-------------------------------------------------------------------------------------------
2 stage            | Baseline             | 2358       | 14.023            | 71.3
  + FPGA Reg. File | MUL 1cc MULH 1cc     | 2270       | 14.755            | 67.8 (-3.5)
                   | MUL 1cc MULH 2cc     | 2300       | 14.436            | 69.3 (-2.0)
-------------------------------------------------------------------------------------------
3 stage + BT ALU   | Baseline             | 2303       | 15.235            | 65.6
  + FPGA Reg. File | MUL 1cc MULH 1cc     | 2288       | 15.499            | 64.5 (-1.1)
                   | MUL 1cc MULH 2cc     | 2317       | 14.997            | 66.8 (+1.2)
-------------------------------------------------------------------------------------------

Implementation was executed using Vivado's Performance_explore implementation strategy.

For the 2stage configuration we observe a decrease of the maximum frequency for both versions of the new multiplier. However MUL 1cc MULH 2cc yields a 1.5 MHz higher maximum frequency than MUL 1cc MULH 2cc. For the 3 stage core configuration we even observe an increase of the maximum frequency for MUL 1cc MULH 2cc However, IMO, the benefits are pretty small.

I have also run coremark in order to compare the impact of the design decision on the IPC as a measure of performance. The results are listed in the following table (The core configuration has no influence):

 Mult. Implementation | Total Ticks        | Ret. Instructions | IPC
-------------------------------------------------------------------------------
 Baseline             | 416639             | 275768            | 0.662
 MUL 1cc MULH 1cc     | 407243             |                   | 0.677 (+0.015)
 MUL 1cc MULH 2cc     | 407243             |                   | 0.677 (+0.015)
-------------------------------------------------------------------------------

Apparently, in coremark the MULH instruction is not used once. I therefore think that indeed we could implement the additional latency without significant performance loss.

Another consideration brought in by @davideschiavone and @vogelpi, is, if the marginal gain in maximum frequency for MUL 1cc MULH 2cc is worth the additional complexity of the code which has to be maintained. I will attach the two versions of my implementation here, for your consideration.

multdiv_versions.zip

What are your opinions on this? Which route should we go?

Also, how should I integrate the new multiplier into the existing codebase? ( @GregAC has suggested to introduce a SingleCycleMultiply parameter into the ibex_multdiv_fast module. Which I think would be a good idea.)

cheers, Noam

vogelpi commented 4 years ago

Nice work @ganoam ! Thanks for providing all the numbers.

There is no notable difference in terms of resource utilization. As you mentioned, there is also little difference in terms of max clock frequency only. I think the +2.3 MHz we get from the from the "MUL 1cc MULH 2cc" version with the 3-stage pipeline currently are not really worth the code complexity (although the code is also not super complex, it's not too bad actually). However, if I remember correctly the critical path for the "MUL 1cc MULH 1cc" version always goes through one of the DSP slices, whereas for the "MUL 1cc MULH 2cc" version, the critical path was somewhere else. Meaning this could further improve depending on how the final 3-stage pipeline implementation will look like.

On the other hand, if we are also considering ASICs, the "MUL 1cc MULH 1cc" might give better results as a smart synthesis tool can optimize more.

GregAC commented 4 years ago

It's important to consider the impact on ASIC implementation also. From a quick look at the RTL the MUL 1cc MULH 2cc version is using 3x16-bit multipliers + a 3 way A + B + C 34-bit adder. The MUL 1cc MULH 1cc version uses a single 1x33-bit multiplier.

I'd suspect the MUL 1cc MULH 2cc version has area advantage in ASIC implementation, less certain about how much it matter to timing.

ganoam commented 4 years ago

About the frequencies: The results I reported are not very rigid. Yesterday, I have truncated some unused signals from the MUL 1cc MULH 1cc design (which I am quite sure, the synthesis tool optimized away anyways) and re-run the implementation twice, once with a target period of 14ns and 14.5ns. The reported critical paths were 15.656ns and 14.708ns respectively for the two-stage core configuration. For the 3 stage configuration it reported 15.054ns (with a target period of 14ns), which is a lot closer to the MUL 1cc MULH 2cc version than originally measured, and better than the baseline. So I think what we can say is, the new multiplier is about half a ns slower for the 2 stage configuration, with the MUL 1cc MULH 2cc version having a slight advantage over the MUL 1cc MULH 2cc version. But the differences between the two versions is very small.
For the 3 stage configuration it is really very hard to tell. Not only because of the apparent volatility of the synthesis tool but also because it was measured with what is likely not the final implementation.

@vogelpi Yes, the critical path seems to be more likely to traverse a DSP slice in the MUL1cc MULH cc implementation. Using the implementation runs I reported yesterday, I have extracted the longest paths through the multiplier (i.e. paths that include the signal which is assigned with the result of the multiplication for MUL 1cc MULH 1cc or the accumulation register input for the other versions.)

Core Configuration | Mult. Implementation | Longest Path Through Mult /ns
---------------------------------------------------------------------
2 stage            | Baseline             | 12.879
  + FPGA Reg. File | MUL 1cc MULH 1cc     | 14.755 (crit. path)
                   | MUL 1cc MULH 2cc     | 14.338
---------------------------------------------------------------------
3 stage + BT ALU   | Baseline             | 13.676
  + FPGA Reg. File | MUL 1cc MULH 1cc     | 15.088
                   | MUL 1cc MULH 2cc     | 14.137
---------------------------------------------------------------------

@GregAC I have not yet done any ASIC analysis, since I thought that this optimization would be targeting FPGAs which have the necessary hardware to spare. I am happy to do the investigation, but I think, considering the pretty modest IPC advantage, it would not be worth it to chose either one of the new implementations for ASICs: I am quite sure that both versions will have quite unsatisfactory results: The MUL 1cc MULH2cc version will consume at least 3x the area of the original implementation (3 of multipliers of the original size + a larger addition structure). The 33-bit multiplier, as @davideschiavone explained to me would blow up in contrast to the 17 bit multiplier he implemented, which is why he implemented it the way he did. As for the timing, I really don't have the experience to speculate.

I am not sure if we should base our decision on the ASIC synthesis. I think it is really unlikely anyone would want to spend the extra area for this little benefit. And I suspect, also timing will be worse.

vogelpi commented 4 years ago

Thanks @ganoam for the additional info. I think we should go for the MUL 1cc MULH 2cc version.

There is a slight advantage on the FPGA in terms of timing but more importantly the multiplier/DSP is not in the critical path for the 3-stage implementation. I would not go into the ASIC domain yet. You will do this once working on the B-extension. Looking at the original paper, one can also see that the 32-bit multiplier of RI5CY is roughly 4x bigger than the baseline 16-bit multiplier (Zero-riscy in the paper). So even if the MUL 1cc MULH 2cc is 3x bigger than the baseline, it is still around 25% smaller than the 32-bit multiplier.

So, unless @GregAC , @tomroberts-lowrisc or @imphil oppose, go ahead and update your PR with the MUL 1cc MULH 2cc version.

GregAC commented 4 years ago

I'm happy with the MUL 1cc MULH 2cc version. I'll push it through my Yosys flow when the PR is ready and see what it makes of it. Thanks @ganoam

ganoam commented 4 years ago

Alright. I'll prepare the PR then. Do you want me to make it dependent on a parameter within the original ibex_multdiv_fast.sv?

GregAC commented 4 years ago

I was thinking about that. Having three separate modules seems bad as there's code duplication. However having the multiply configuration handled by two separate parameters (choose fast or slow then single cycle or not) seems messy.

I think ideally we'd have a div module and multiple different multiplier modules, though that's out of the scope for this task.

So implementing this within multdiv_fast with a parameter to choose single cycle mul or not is good enough for now. Then let's open an issue for someone to take a look at it all and see if we can refactor into something cleaner.

vogelpi commented 4 years ago

Yes, I agree with @GregAC .

Let's include it in the ibex_multdiv_fast.sv for now (we are still sharing some logic with the divider, so this makes sense).

imphil commented 4 years ago

Sounds good to me. No matter where the code ends up living, I'd like to keep a single MultiplierImplementation parameter on the toplevel.

ganoam commented 4 years ago

Ok, the PR (#559) is finally ready for review.

I have shuffled around quite a bit of code, trying to arrange everything somewhat clearly. however, the division structures and the original multiplier are unchanged.

Making it available to chose from the top level required a little bit of meddling with the ex block. I am not sure if this is the proper way to do things... I have also updated the documentation.

vogelpi commented 4 years ago

The single-cycle multiplier for FPGA has been merged. Nice work @ganoam! I have created a new issue to discuss options and progress on the CSR/performance counter optimizations here: https://github.com/lowRISC/ibex/issues/601

ganoam commented 4 years ago

To keep this Issue up to date:

A PR concerning inference of DSP counters for performance monitors is available: #624.
I have had a quick look at using DSPs for a barrel shifter: #625

vogelpi commented 4 years ago

This work has been concluded. I am thus closing the issue.

lowRISC / ibex

FPGA Optimizations #522