Fix fetcher throughput - Githubissues

tilk commented 2 months ago

The second stage of the fetcher couldn't proceed with flushing when the output method was not ready, which limited the fetch throughput. This PR fixes that.

This change makes the loop in the fibonacci test run at IPC=1 basically, instead of 0.66 (there were two unnecessary wait cycles which this PR removes).

Edit: Looks like this change somehow uncovered a correctness problem.

xThaid commented 2 months ago

I found the bug. The problem was when the fetch unit was being stalled by an exception. In that case we want to set flushing_counter to the number of items in the pipeline, but m.d.sync += flushing_counter.eq(flushing_counter - 1) took precedence and set the counter to 0. This wasn't not a problem before the change, because stall_exception would block the second stage transaction and wouldn't let override the flushing counter. The fix:

diff --git a/coreblocks/frontend/fetch/fetch.py b/coreblocks/frontend/fetch/fetch.py
index c3736066..bd7806e9 100644
--- a/coreblocks/frontend/fetch/fetch.py
+++ b/coreblocks/frontend/fetch/fetch.py
@@ -93,9 +93,6 @@ class FetchUnit(Elaboratable):
         def flush():
             m.d.comb += flush_now.eq(1)

-        with m.If(flush_now):
-            m.d.sync += flushing_counter.eq(req_counter.count_next)
-
         current_pc = Signal(self.gen_params.isa.xlen, reset=self.gen_params.start_pc)

         stalled_unsafe = Signal()
@@ -365,6 +362,9 @@ class FetchUnit(Elaboratable):
                 with branch(flushing_counter != 0):
                     m.d.sync += flushing_counter.eq(flushing_counter - 1)

+        with m.If(flush_now):
+            m.d.sync += flushing_counter.eq(req_counter.count_next)
+
         @def_method(m, self.resume, ready=(stalled & (flushing_counter == 0)))
         def _(pc: Value, resume_from_exception: Value):
             log.info(m, True, "Resuming new_pc=0x{:x} from exception={}", pc, resume_from_exception)

github-actions[bot] commented 2 months ago

Benchmarks summary

Performance benchmarks

aha-mont64	crc32	minver	nettle-sha256	nsichneu	slre	statemate	ud
🔺 0.406 (+0.002)	🔻 0.457 (-0.000)	🔻 0.308 (-0.003)	🔺 0.644 (+0.000)	🔻 0.345 (-0.000)	🔻 0.255 (-0.001)	🔻 0.304 (-0.000)	🔺 0.398 (+0.000)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5)	LUTs used as DFF: (ECP5)	LUTs used as carry: (ECP5)	LUTs used as ram: (ECP5)	Max clock frequency (Fmax)
🔻 22060 (-541)	5560 (0)	802 (0)	1004 (0)	🔺 51 (+2)

Synthesis benchmarks (full)

Device utilisation: (ECP5)	LUTs used as DFF: (ECP5)	LUTs used as carry: (ECP5)	LUTs used as ram: (ECP5)	Max clock frequency (Fmax)
🔻 30843 (-2808)	8802 (0)	🔺 1970 (+32)	1184 (0)	🔻 40 (-3)

tilk commented 2 months ago

Looks like the benchmark numbers didn't change much. But if you look into the metrics, you can see that ROB utilization increased (see e.g. rob.size/bucket-32). The performance gain was probably eaten up by the mispredict penalty.

github-actions[bot] commented 2 months ago

Benchmarks summary

Performance benchmarks

aha-mont64	crc32	minver	nettle-sha256	nsichneu	slre	statemate	ud
🔺 0.406 (+0.002)	🔻 0.457 (-0.000)	🔻 0.308 (-0.003)	🔺 0.644 (+0.000)	🔻 0.345 (-0.000)	🔻 0.255 (-0.001)	🔻 0.304 (-0.000)	🔺 0.398 (+0.000)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5)	LUTs used as DFF: (ECP5)	LUTs used as carry: (ECP5)	LUTs used as ram: (ECP5)	Max clock frequency (Fmax)
🔻 22377 (-1464)	5560 (0)	802 (0)	1004 (0)	🔺 51 (+1)

Synthesis benchmarks (full)

Device utilisation: (ECP5)	LUTs used as DFF: (ECP5)	LUTs used as carry: (ECP5)	LUTs used as ram: (ECP5)	Max clock frequency (Fmax)
🔺 32174 (+434)	8802 (0)	1938 (0)	1184 (0)	🔺 41 (+3)

kuznia-rdzeni / coreblocks

Fix fetcher throughput #667

Benchmarks summary

Performance benchmarks

Synthesis benchmarks (basic)

Synthesis benchmarks (full)

Benchmarks summary

Performance benchmarks

Synthesis benchmarks (basic)

Synthesis benchmarks (full)