ultraembedded / biriscv

32-bit Superscalar RISC-V CPU
Apache License 2.0
833 stars 145 forks source link

about Branch target buffer #15

Open cool-ic opened 2 years ago

cool-ic commented 2 years ago

I notice that your branch target buffer is a Register file, which have no read latency. So I got two question:

Is this a usual way to use Register file as BTB rather than use a block of sram? And if we use sram, comparing to register file, there may be one cycle delay for read operation. I think the read delay disturb the design of branch prediction. How to handle it?

Best regards

ultraembedded commented 2 years ago

Good question! I’ve been thinking about this recently...

The Rocket and WD SweRV cores use flops for the BTB. Those cores and this one use 28-32 fully associative BTB entries. A larger SRAM based BTB would be possible, I think, but you would lose the fully associative capability, but it could be much bigger to compensate (and would be much more FPGA friendly). As you note, there would be a read latency to deal with - the less than ideal workaround is to read ahead by one address and accept that following a predicted branch, the BTB would not be able to provide another prediction for one cycle.

I have been modelling various branch prediction designs, and it seems that a BTB with one cycle latency is indeed worse for various benchmarks, but it’s not terrible.

stephenry commented 2 years ago

@cool-ic There are varying sizes of BTB. Some CPU have an initial micro-BTB that is implemented in flops, and then have a secondary, larger BTB implemented in SRAM. It's not possible to lookup the second, larger BTB on a cycle-by-cycle basis, but nevertheless its presence allows for the branch target to be resolved earlier in the pipeline should the address miss in the micro-BTB (although not at the very start). You will only really see this for pipelines that have a couple of stages at the front-end.