Closed tilk closed 2 months ago
I found the bug. The problem was when the fetch unit was being stalled by an exception. In that case we want to set flushing_counter
to the number of items in the pipeline, but m.d.sync += flushing_counter.eq(flushing_counter - 1)
took precedence and set the counter to 0. This wasn't not a problem before the change, because stall_exception
would block the second stage transaction and wouldn't let override the flushing counter.
The fix:
diff --git a/coreblocks/frontend/fetch/fetch.py b/coreblocks/frontend/fetch/fetch.py
index c3736066..bd7806e9 100644
--- a/coreblocks/frontend/fetch/fetch.py
+++ b/coreblocks/frontend/fetch/fetch.py
@@ -93,9 +93,6 @@ class FetchUnit(Elaboratable):
def flush():
m.d.comb += flush_now.eq(1)
- with m.If(flush_now):
- m.d.sync += flushing_counter.eq(req_counter.count_next)
-
current_pc = Signal(self.gen_params.isa.xlen, reset=self.gen_params.start_pc)
stalled_unsafe = Signal()
@@ -365,6 +362,9 @@ class FetchUnit(Elaboratable):
with branch(flushing_counter != 0):
m.d.sync += flushing_counter.eq(flushing_counter - 1)
+ with m.If(flush_now):
+ m.d.sync += flushing_counter.eq(req_counter.count_next)
+
@def_method(m, self.resume, ready=(stalled & (flushing_counter == 0)))
def _(pc: Value, resume_from_exception: Value):
log.info(m, True, "Resuming new_pc=0x{:x} from exception={}", pc, resume_from_exception)
aha-mont64 | crc32 | minver | nettle-sha256 | nsichneu | slre | statemate | ud |
---|---|---|---|---|---|---|---|
🔺 0.406 (+0.002) | 🔻 0.457 (-0.000) | 🔻 0.308 (-0.003) | 🔺 0.644 (+0.000) | 🔻 0.345 (-0.000) | 🔻 0.255 (-0.001) | 🔻 0.304 (-0.000) | 🔺 0.398 (+0.000) |
You can view all the metrics here.
Device utilisation: (ECP5) | LUTs used as DFF: (ECP5) | LUTs used as carry: (ECP5) | LUTs used as ram: (ECP5) | Max clock frequency (Fmax) |
---|---|---|---|---|
🔻 22060 (-541) | 5560 (0) | 802 (0) | 1004 (0) | 🔺 51 (+2) |
Device utilisation: (ECP5) | LUTs used as DFF: (ECP5) | LUTs used as carry: (ECP5) | LUTs used as ram: (ECP5) | Max clock frequency (Fmax) |
---|---|---|---|---|
🔻 30843 (-2808) | 8802 (0) | 🔺 1970 (+32) | 1184 (0) | 🔻 40 (-3) |
Looks like the benchmark numbers didn't change much. But if you look into the metrics, you can see that ROB utilization increased (see e.g. rob.size/bucket-32). The performance gain was probably eaten up by the mispredict penalty.
aha-mont64 | crc32 | minver | nettle-sha256 | nsichneu | slre | statemate | ud |
---|---|---|---|---|---|---|---|
🔺 0.406 (+0.002) | 🔻 0.457 (-0.000) | 🔻 0.308 (-0.003) | 🔺 0.644 (+0.000) | 🔻 0.345 (-0.000) | 🔻 0.255 (-0.001) | 🔻 0.304 (-0.000) | 🔺 0.398 (+0.000) |
You can view all the metrics here.
Device utilisation: (ECP5) | LUTs used as DFF: (ECP5) | LUTs used as carry: (ECP5) | LUTs used as ram: (ECP5) | Max clock frequency (Fmax) |
---|---|---|---|---|
🔻 22377 (-1464) | 5560 (0) | 802 (0) | 1004 (0) | 🔺 51 (+1) |
Device utilisation: (ECP5) | LUTs used as DFF: (ECP5) | LUTs used as carry: (ECP5) | LUTs used as ram: (ECP5) | Max clock frequency (Fmax) |
---|---|---|---|---|
🔺 32174 (+434) | 8802 (0) | 1938 (0) | 1184 (0) | 🔺 41 (+3) |
The second stage of the fetcher couldn't proceed with flushing when the output method was not ready, which limited the fetch throughput. This PR fixes that.
This change makes the loop in the
fibonacci
test run at IPC=1 basically, instead of 0.66 (there were two unnecessary wait cycles which this PR removes).Edit: Looks like this change somehow uncovered a correctness problem.