kuznia-rdzeni / coreblocks

RISC-V out-of-order core for education and research purposes
https://kuznia-rdzeni.github.io/coreblocks/
BSD 3-Clause "New" or "Revised" License
38 stars 16 forks source link

Make RS feed FUs with garbage if flushing #740

Open Arusekk opened 1 month ago

Arusekk commented 1 month ago

See #598; does not skip FUs but shows the concept.

github-actions[bot] commented 1 month ago

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
0.421 0.513 0.339 0.655 0.364 0.29 0.328 0.43

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
15885 6043 834 1068 43

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
28877 9298 1790 1248 40
github-actions[bot] commented 2 weeks ago

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.421 (+0.004) 0.513 (0.000) ▲ 0.339 (+0.002) ▲ 0.655 (+0.000) ▲ 0.364 (+0.003) 0.290 (0.000) ▲ 0.328 (+0.002) ▼ 0.430 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 14911 (+396) 6043 (0) 834 (0) 1068 (0) ▼ 41 (-14)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 24301 (-546) 9298 (0) ▲ 1790 (+32) 1248 (0) ▼ 35 (-10)
github-actions[bot] commented 1 week ago

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.421 (+0.004) 0.513 (0.000) ▲ 0.339 (+0.002) ▲ 0.655 (+0.000) ▲ 0.364 (+0.003) 0.290 (0.000) ▲ 0.328 (+0.002) ▼ 0.430 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 15122 (+858) 6043 (0) 834 (0) 1068 (0) ▼ 41 (-16)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▲ 25382 (+506) 9298 (0) 1790 (0) 1248 (0) ▼ 32 (-11)
github-actions[bot] commented 1 week ago

Benchmarks summary

Performance benchmarks

aha-mont64 crc32 minver nettle-sha256 nsichneu slre statemate ud
▲ 0.421 (+0.004) 0.513 (0.000) ▲ 0.339 (+0.002) ▲ 0.655 (+0.000) ▲ 0.364 (+0.003) 0.290 (0.000) ▲ 0.328 (+0.002) ▼ 0.430 (-0.001)

You can view all the metrics here.

Synthesis benchmarks (basic)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 13982 (-282) 6043 (0) 834 (0) 1068 (0) ▼ 39 (-18)

Synthesis benchmarks (full)

Device utilisation: (ECP5) LUTs used as DFF: (ECP5) LUTs used as carry: (ECP5) LUTs used as ram: (ECP5) Max clock frequency (Fmax)
▼ 23200 (-1676) 9298 (0) 1790 (0) 1248 (0) ▼ 33 (-9)
piotro888 commented 1 week ago

[additional comments to discussion from meeting]

I checked that synchronous flushing signal would work in RSInsertion - because FreeRF/RF valid bits are also updated in sync domain, effect would be visible next cycle (and old RF entry inserted into RS). Change in RSInsertion would also not cause any performance loss. (and is definitely safe in RS too)

Proposition with resetting RF valid in Register Allocation would be problematic with checkpointing, that pushes new instruction immediately.

The last place is LSU: LSU operations have a very high cost, I don't see why we should de-optimize it if this part of LSU is not on critical path (unless it is).