The delay estimates for `sel` operation may be inaccurate for ASAP7

rw1nkler commented 5 months ago

Describe the bug

While working on improving the performance of our designs, we noticed that the delay estimate for IR sel operation is surprisingly high for wide signals in ASAP7 PDK. We have found that the delay introduced by sel is (in our case) often a major component of the critical path. When we replaced the if statements (translated to sel), with a hand-written hardware mux based on AND and OR operations, we obtained significantly smaller delay estimates and shorter critical paths.

The actual place-and-route in ASAP7 (with a single pipeline stage) seems to yield similar results for a standard if, and hand-written mux logic, which may suggest that the delay model for the sel operation is inaccurate.

To Reproduce

The code below can be used to notice the issue:


// standard if statement
fn slow_if<N: u32>(cond: bool, arg1: uN[N], arg2: uN[N]) -> uN[N] {
    if cond { arg1 } else { arg2 }
}

// hardware mux logic for if
fn fast_if<N: u32>(cond: bool, arg1: uN[N], arg2: uN[N]) -> uN[N] {
    let mask = if (cond) {!bits[N]:0} else {bits[N]:0};
    (arg1 & mask) | (arg2 & !mask)
}

fn slow_if_inst(cond: bool, arg1: uN[1024], arg2: uN[1024]) -> uN[1024] {
    slow_if<u32:1024>(cond, arg1, arg2)
}

fn fast_if_inst(cond: bool, arg1: uN[1024], arg2: uN[1024]) -> uN[1024] {
    fast_if<u32:1024>(cond, arg1, arg2)
}

The standard if (slow_if) is translated to a ternary if operator (?) in Verilog:

wire [1023:0] p1_sel_20_comb;
assign p1_sel_20_comb = p0_cond ? p0_arg1 : p0_arg2;

While the mux logic fast_if is translated as follows:

assign p1_mask_comb = {1024{p0_cond}};
assign p1_or_41_comb = p0_arg1 & p1_mask_comb | ~(~p0_arg2 | p1_mask_comb);

Here are the results of xls_benchmark_ir rule for these two functions:

slow_if:

Critical path delay: 11608ps
Critical path entry count: 2
Critical path:
  0ps as synthesized (+0ps); 0.00%
11608ps (+11608ps): sel.10: bits[1024] = sel(cond, cases=[arg2, arg1], id=10, pos=[(0,1,4)])
  0ps (+  0ps): arg1: bits[1024] = param(arg1, id=2)

fast_if:

Critical path delay: 73ps
Critical path entry count: 4
Critical path:
  0ps as synthesized (+0ps); 0.00%
 73ps (+ 18ps): or.27: bits[1024] = or(and.25: bits[1024], nor.26: bits[1024], id=27, pos=[(0,6,18)])
 55ps (+ 24ps): and.25: bits[1024] = and(arg1: bits[1024], mask: bits[1024], id=25, pos=[(0,6,10)])
 31ps (+ 31ps): mask: bits[1024] = sign_ext(cond: bits[1], new_bit_count=1024, id=23, pos=[(0,5,15)])
  0ps (+  0ps): cond: bits[1] = param(cond, id=1)

In both cases, the P&R results from ASAP7 PDK (for one pipeline stage) are comparable and smaller than the delay reported by the slow_if.

Here are the additional files useful for investigating the problem:

Expected behavior

The standard if (slow_if) delay estimate should be similar to the manual mux approach (fast_if).

Additional context

Here is the delay information for the sel operation retrieved from ASAP7 PDK:

xls/delay_model/models/asap7.textproto#L2020

rw1nkler commented 5 months ago

@proppy Here are the additional reports from the benchmark_synth and xls_benchmark_verilog that you asked for:

slow_if

slow_if_asap7_synth_benchmark.log slow_if_verilog_benchmark.log

fast_if

fast_if_asap7_synth_benchmark.log fast_if_verilog_benchmark.log

I pushed the code here so that anyone interested can take a look at it.

proppy commented 5 months ago

indeed the synth report look identical:

Chip area for module '\slow_if': 1269.334800
Longest topological path in slow_if (length=5):
Flop count: 3073 objects.
Liberty: asap7-sc7p5t_rev28_rvt-ccs_ss_SS.lib
End of script. Logfile hash: 59bbcb4cbb, CPU: user 15.37s system 0.30s, MEM: 819.83 MB peak

Chip area for module '\fast_if': 1269.334800
Longest topological path in fast_if (length=5):
Flop count: 3073 objects.
Liberty: asap7-sc7p5t_rev28_rvt-ccs_ss_SS.lib
End of script. Logfile hash: 6024f4c68a, CPU: user 15.51s system 0.25s, MEM: 820.87 MB peak

ericastor commented 5 months ago

We've identified the issue, have a prototype fix, and are working to polish it for commit.

Short version: it turns out the timing characterizer has been using a slightly misconfigured script for Yosys (and specifically ABC), meaning that its synthesis runs did a poor job of handling high-fanout logic. We're working on piping the relevant information through; we'll then need to rerun the delay model characterization.

rw1nkler commented 5 months ago

Sounds great! We are looking forward to seeing the fixes. Is the characterization process described in any way, is there any place that we can look at to better understand the process?

proppy commented 5 months ago

@richmckeever might be able to comment on the current state of the characterization process.

m-torhan commented 5 months ago

It looks like the delay estimates are still wrong. I ran IR benchmark for the code provided by @rw1nkler using latest main branch (769ae4e) and the results are as follows:

slow_if

Critical path delay: 221ps
Critical path entry count: 2
Critical path:
  0ps as synthesized (+0ps); 0.00%
221ps (+221ps): sel.10: bits[1024] = sel(cond, cases=[arg2, arg1], id=10, pos=[(0,2,4)])
  0ps (+  0ps): arg1: bits[1024] = param(arg1, id=2)

fast_if

Critical path delay: 63ps
Critical path entry count: 4
Critical path:
  0ps as synthesized (+0ps); 0.00%
 63ps (+ 19ps): or.27: bits[1024] = or(and.25: bits[1024], nor.26: bits[1024], id=27, pos=[(0,8,18)])
 44ps (+ 27ps): nor.26: bits[1024] = nor(not.24: bits[1024], mask: bits[1024], id=26, pos=[(0,8,26)])
 17ps (+ 17ps): mask: bits[1024] = sign_ext(cond: bits[1], new_bit_count=1024, id=23, pos=[(0,7,15)])
  0ps (+  0ps): cond: bits[1] = param(cond, id=1)

Also the delay of sel depends on operands width, while the fast_if delay is constant.

ericastor commented 5 months ago

Have you tried actually synthesizing these at varying widths? I expect that fast_if's delay is not in fact constant with width, since there is a delay effect from fanning out the input in the sign_ext.

Also, the fact is that these are much closer after the update. However, I'm a little surprised - I think our optimizer should probably be expected to convert the slow_if's select to the equivalent of your fast_if, since it is in fact better (likely because the synthesis tool can do this optimization for you in this case). I'll be happy to look into that under a new issue!

m-torhan commented 5 months ago

@ericastor I checked that and you are right - the delay does depend on the operands' widths.

I created the issue as you asked https://github.com/google/xls/issues/1475. Please take a look into it.

google / xls

The delay estimates for `sel` operation may be inaccurate for ASAP7 #1461