Closed rw1nkler closed 5 months ago
@proppy Here are the additional reports from the benchmark_synth
and xls_benchmark_verilog
that you asked for:
slow_if
slow_if_asap7_synth_benchmark.log slow_if_verilog_benchmark.log
fast_if
fast_if_asap7_synth_benchmark.log fast_if_verilog_benchmark.log
I pushed the code here so that anyone interested can take a look at it.
indeed the synth report look identical:
Chip area for module '\slow_if': 1269.334800
Longest topological path in slow_if (length=5):
Flop count: 3073 objects.
Liberty: asap7-sc7p5t_rev28_rvt-ccs_ss_SS.lib
End of script. Logfile hash: 59bbcb4cbb, CPU: user 15.37s system 0.30s, MEM: 819.83 MB peak
Chip area for module '\fast_if': 1269.334800
Longest topological path in fast_if (length=5):
Flop count: 3073 objects.
Liberty: asap7-sc7p5t_rev28_rvt-ccs_ss_SS.lib
End of script. Logfile hash: 6024f4c68a, CPU: user 15.51s system 0.25s, MEM: 820.87 MB peak
We've identified the issue, have a prototype fix, and are working to polish it for commit.
Short version: it turns out the timing characterizer has been using a slightly misconfigured script for Yosys (and specifically ABC), meaning that its synthesis runs did a poor job of handling high-fanout logic. We're working on piping the relevant information through; we'll then need to rerun the delay model characterization.
Sounds great! We are looking forward to seeing the fixes. Is the characterization process described in any way, is there any place that we can look at to better understand the process?
@richmckeever might be able to comment on the current state of the characterization process.
It looks like the delay estimates are still wrong. I ran IR benchmark for the code provided by @rw1nkler using latest main branch (769ae4e) and the results are as follows:
slow_if
Critical path delay: 221ps
Critical path entry count: 2
Critical path:
0ps as synthesized (+0ps); 0.00%
221ps (+221ps): sel.10: bits[1024] = sel(cond, cases=[arg2, arg1], id=10, pos=[(0,2,4)])
0ps (+ 0ps): arg1: bits[1024] = param(arg1, id=2)
fast_if
Critical path delay: 63ps
Critical path entry count: 4
Critical path:
0ps as synthesized (+0ps); 0.00%
63ps (+ 19ps): or.27: bits[1024] = or(and.25: bits[1024], nor.26: bits[1024], id=27, pos=[(0,8,18)])
44ps (+ 27ps): nor.26: bits[1024] = nor(not.24: bits[1024], mask: bits[1024], id=26, pos=[(0,8,26)])
17ps (+ 17ps): mask: bits[1024] = sign_ext(cond: bits[1], new_bit_count=1024, id=23, pos=[(0,7,15)])
0ps (+ 0ps): cond: bits[1] = param(cond, id=1)
Also the delay of sel
depends on operands width, while the fast_if
delay is constant.
Have you tried actually synthesizing these at varying widths? I expect that fast_if's delay is not in fact constant with width, since there is a delay effect from fanning out the input in the sign_ext.
Also, the fact is that these are much closer after the update. However, I'm a little surprised - I think our optimizer should probably be expected to convert the slow_if's select to the equivalent of your fast_if, since it is in fact better (likely because the synthesis tool can do this optimization for you in this case). I'll be happy to look into that under a new issue!
@ericastor I checked that and you are right - the delay does depend on the operands' widths.
I created the issue as you asked https://github.com/google/xls/issues/1475. Please take a look into it.
Describe the bug
While working on improving the performance of our designs, we noticed that the delay estimate for IR sel operation is surprisingly high for wide signals in ASAP7 PDK. We have found that the delay introduced by
sel
is (in our case) often a major component of the critical path. When we replaced theif
statements (translated tosel
), with a hand-written hardware mux based on AND and OR operations, we obtained significantly smaller delay estimates and shorter critical paths.The actual place-and-route in ASAP7 (with a single pipeline stage) seems to yield similar results for a standard if, and hand-written mux logic, which may suggest that the delay model for the
sel
operation is inaccurate.To Reproduce
The code below can be used to notice the issue:
The standard if (
slow_if
) is translated to a ternary if operator (?
) in Verilog:While the mux logic
fast_if
is translated as follows:Here are the results of
xls_benchmark_ir
rule for these two functions:slow_if
:fast_if
:In both cases, the P&R results from ASAP7 PDK (for one pipeline stage) are comparable and smaller than the delay reported by the
slow_if
.Here are the additional files useful for investigating the problem:
fast_if
fast_if.v.txt fast_if_verilog.opt.ir.txt fast_ir_opt_ir_benchmark.log fast_if_place_and_route__global_routing.logslow_if
slow_if.v.txt slow_if_verilog.opt.ir.txt slow_ir_opt_ir_benchmark.log slow_if_place_and_route__global_routing.logExpected behavior
The standard if (
slow_if
) delay estimate should be similar to the manual mux approach (fast_if
).Additional context
Here is the delay information for the
sel
operation retrieved from ASAP7 PDK: