lowRISC / ibex

Ibex is a small 32 bit RISC-V CPU core, previously known as zero-riscy.
https://www.lowrisc.org
Apache License 2.0
1.33k stars 515 forks source link

Potential timing issues around pc_set #305

Closed GregAC closed 4 years ago

GregAC commented 4 years ago

The pc_set signal factors in both exceptions that have just occurred and branch conditions and directly feeds into the top-level instr_addr_o output (and associated signals). This is a long path making the instruction memory interface outputs late.

The paper 'Slow and Steady Wins the Race? A Comparison of Ultra-Low-Power RISC-V Cores for Internet-of-Things Applications' (https://www.researchgate.net/profile/Francesco_Conti4/publication/321121782_Slow_and_steady_wins_the_race_A_comparison_of_ultra-low-power_RISC-V_cores_for_Internet-of-Things_applications/links/5a37aa3ea6fdccdd41fc99e1/Slow-and-steady-wins-the-race-A-comparison-of-ultra-low-power-RISC-V-cores-for-Internet-of-Things-applications.pdf) which looked at zero-riscy highlighted this path as the worst in their test system.

The simple fix (stall and set the pc cycle after branch decision/exception) adds a cycle latency to any branch. A simple branch predictor may be able to reduce the impact of this.

vogelpi commented 4 years ago

I think we should also seriously consider your previous idea of adding a separate adder in the IF stage for computing the branch/jump target. Or would this reduce another path into instr_addr_o but not the most critical one of pc_set?

GregAC commented 4 years ago

I think it will help this path, but you've still have an reg file addr -> reg file out -> alu -> branch_target -> iside stuff -> instr_addr_o path. The dedicated ALU should help knock a few stages of logic off but nothing dramatic.

GregAC commented 4 years ago

Actually the branch condition logic remains the same and I realise I'm confusing two long paths here. We have seperately:

branch condition/exception found -> pc_set -> ...

branch target calc -> iside -> instr_addr_o

Though pc_set also factors into instr_addr_o as it chooses the branch target as the fetch address rather than what we would next prefetch.

vogelpi commented 4 years ago

Yep I see. Another question: what about issuing the instruction fetch independently of the branch decision and then deal with it once the IF stage gets back the response?

For example, if the branch is not taken, we just do not write the returned instruction into the fetch FIFO.

GregAC commented 4 years ago

Following our offline discussion, @vogelpi pointed out the branch condition -> pc_set -> ... path doesn't exist (maybe it used to in some version of zero-riscy?) because we calculate branch condition in one cycle then the following cycle calculate target and set PC. However the one beginning with the exception found does.

So it is suggested we break the exception found version of the path. This adds a cycle latency when we take an exception but this probably isn't a big deal.

However if we look at introducing a dedicated branch target ALU we end up bringing back the branch condition -> pc_set -> ... path so how do we deal with it?

The suggestion from @vogelpi above could work well, especially if the prefetch buffer has got a couple of entries in, effectively assume taken and start fetching using the dedicated ALU for branch target calc then once branch decision is known if we don't branch hopefully prefetch buffer will have enough entries in for the next instruction and avoid any stall from the wasted iside fetch.

Will need some prefetch buffer adjustments. I should also investigate what the impact is of putting a small decode for just jumps/branches into the prefetch buffer is and put the branch target ALU there instead.

This will however increase the proportion of useless instruction fetches from the core which has power implications (it's already non-zero due to the prefetch buffer fetching beyond branches).

tjaychen commented 4 years ago

@GregAC @vogelpi I'm not sure if i'm looking at this right, but it looks like in the controller there is also a rd_data -> instr_addr_o path via the jump instruction decoding. I've not seen this show up anywhere in the top 10 paths, so maybe it's not an issue?

Anyways I think you guys will know best.

vogelpi commented 4 years ago

So it is suggested we break the exception found version of the path. This adds a cycle latency when we take an exception but this probably isn't a big deal.

Actually, we can do this without adding more latency but just a couple of flops. Already now, exceptions take two cycles:

  1. In the first cycle, the exception is detected. The controller FSM then goes into the FLUSH state.
  2. In the second cycle, the controller flushes the pipeline and sets the new PC. For all exceptions except LSU exceptions, the exception signal is still set in the second cycle. For LSU exceptions (only high during one cycle), we use two FFs and then evaluate these in the second cycle. I suggest to do the same for the other exceptions to break this critical path without adding anymore latency.

I think the path mentioned by @tjaychen will sooner or later also be among the most critical ones. For this we would need the second branch/jump target ALU.

vogelpi commented 4 years ago

I had a look at this issue and created a timing report (when using Ibex inside PULPissimo). It seems that the main problem is in the detection of illegal CSR operations. This also results in an illegal instruction exception but unlike the ones detected inside the decoder, this path goes through the ALU (computes the CSR address).

ibex_pulpissimo.txt

GregAC commented 4 years ago

Thanks for your insights @vogelpi nice simple fix in the end!

@tjaychen We would expect a path from the instruction data -> instr_addr_o path due to jumps. You need to pull the immediate out of the jump instruction to compute the jump address.

tjaychen commented 4 years ago

@GregAC that makes sense. I just wasn't sure if this was showing up as a critical path for you guys as well. We are going to break the FPGA timing path on our end, but probably keep the asic path as is.