Closed litghost closed 4 years ago
Things I've checked so far:
Using inter-cluster delay: 2.7287e-09
Consider an actual inter-cluster delay from the timing report:
Example 1:
$auto$alumacc.cc:474:replace_alu$1295.slice[0].carry4_1st_full.O2 (CARRY4_VPR) 0.558 4.982
$abc$6342$auto$alumacc.cc:474:replace_alu$1233.BB[2].in[0] (.names) 1.748 6.730
Example 2:
$abc$6342$auto$blifparse.cc:492:parse_blif$6431.T0.in[0] (.names) 4.067 15.048
A net delay of 1.748
ns and 4.067
is in the right ballpark as the inter-cluster delay approximation.
I also checked that the placer slack approximation is close to the router approximation.
Final placer slack:
------- ------- ---------- ---------- ---------- ---------- ------- ---------- -------- ------- ------- ------- ------ --------- ------
T Cost Av BB Cost Av TD Cost Av Tot Del P to P Del CPD sTNS sWNS Ac Rate Std Dev R limit Exp Tot Moves Alpha
------- ------- ---------- ---------- ---------- ---------- ------- ---------- -------- ------- ------- ------- ------ --------- ------
0.000 0.9998 25.9159 1.6597e-07 4.701e-06 2.0955e-09 32.573 -1.55e+03 -22.573 0.0029 0.0001 1.0000 8.000 153762 0.000
Initial router slack:
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter Time pres BBs Heap Re-Rtd Re-Rtd Overused RR Nodes Wirelength CPD sTNS sWNS hTNS hWNS Est Succ
(sec) fac Updt push Nets Conns (ns) (ns) (ns) (ns) (ns) Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
1 1.4 0.0 0 8837045 678 2238 2047 ( 0.133%) 125716 ( 1.2%) 22.053 -437.0 -12.053 0.000 0.000 N/A
The placer was overestimated the delay significantly, so maybe the placer timing approximation is poor behavior for the placer? I'll need to dig in to determine how to debug if the quality of the placer approximation explains the result.
@vaughnbetz I've noticed that the placer is fairly sensitive to it's parameters. For example:
--timing_tradeoff 0.5
:
Final critical path: 16.9727 ns, Fmax: 58.9183 MHz
Setup Worst Negative Slack (sWNS): -6.97265 ns
Setup Total Negative Slack (sTNS): -211.903 ns
--timing_tradeoff 0.3
:
Final critical path: 14.5773 ns, Fmax: 68.5998 MHz
Setup Worst Negative Slack (sWNS): -4.5773 ns
Setup Total Negative Slack (sTNS): -162.235 ns
--timing_tradeoff 0.2
:
Final critical path: 20.8454 ns, Fmax: 47.9721 MHz
Setup Worst Negative Slack (sWNS): -10.8454 ns
Setup Total Negative Slack (sTNS): -263.465 ns
Is it expected that the placer be so sensitive to it's parameters? Is this an indication of a degenerate grid/rrgraph? Or is it expected given the placer design?
Another thought, the placer is fairly fast, but appears to have a large impact on the final QoR w.r.t. critical path. Given that the placer is fast, does it make sense for the placer to have an outer loop on critical parameters? For example, optimize timing_tradeoff
for some number of iterators. In the example above, the default (.5) doesn't appear to be the "right" choice, something a little lower leads to a better result.
I've noticed that the placer is fairly sensitive to it's parameters.
Yes this is the case. A small tweak in parameters can cause the placer to take a different randomized walk through the solution space. One thing you could try is performing a sweep of several seed values (which change only the random number generator sequence used by the placer). That will give you an idea for what the variability is like on this design.
This is what I see as the critical path after placement:
And the corresponding path report (generated with --post_place_timing_report):
#Path 1
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$4587.Q (FDRE_ZINI clocked by clk)
Endpoint : $auto$simplemap.cc:442:simplemap_dffe$3756.CE (FDRE_ZINI clocked by clk)
Path Type : setup
Point Incr Path
--------------------------------------------------------------------------------------------------------------------------
clock clk (rise edge) 0.000 0.000
clock source latency 0.000 0.000
clk.inpad (.input) 0.000 0.000
$auto$simplemap.cc:442:simplemap_dffe$4587.C (FDRE_ZINI) 4.119 4.119
$auto$simplemap.cc:442:simplemap_dffe$4587.Q (FDRE_ZINI) [clock-to-output] 0.010 4.129
$auto$alumacc.cc:474:replace_alu$1286.slice[0].carry4_1st_full.S1 (CARRY4_VPR) 1.716 5.845
$auto$alumacc.cc:474:replace_alu$1286.slice[0].carry4_1st_full.CO_CHAIN (CARRY4_VPR) 0.528 6.373
$auto$alumacc.cc:474:replace_alu$1286.slice[1].carry4_full.CIN (CARRY4_VPR) 1.756 8.130
$auto$alumacc.cc:474:replace_alu$1286.slice[1].carry4_full.O2 (CARRY4_VPR) 0.256 8.386
$abc$6342$auto$alumacc.cc:474:replace_alu$1273.BB[6].in[0] (.names) 1.702 10.088
$abc$6342$auto$alumacc.cc:474:replace_alu$1273.BB[6].out (.names) 0.068 10.156
$auto$alumacc.cc:474:replace_alu$1273.slice[1].carry4_full.S2 (CARRY4_VPR) 4.789 14.945
$auto$alumacc.cc:474:replace_alu$1273.slice[1].carry4_full.CO_CHAIN (CARRY4_VPR) 0.376 15.321
$auto$alumacc.cc:474:replace_alu$1273.slice[2].carry4_full.CIN (CARRY4_VPR) 1.756 17.077
$auto$alumacc.cc:474:replace_alu$1273.slice[2].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 17.191
$auto$alumacc.cc:474:replace_alu$1273.slice[3].carry4_full.CIN (CARRY4_VPR) 1.756 18.947
$auto$alumacc.cc:474:replace_alu$1273.slice[3].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 19.061
$auto$alumacc.cc:474:replace_alu$1273.slice[4].carry4_full.CIN (CARRY4_VPR) 1.756 20.818
$auto$alumacc.cc:474:replace_alu$1273.slice[4].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 20.932
$auto$alumacc.cc:474:replace_alu$1273.slice[5].carry4_full.CIN (CARRY4_VPR) 1.756 22.688
$auto$alumacc.cc:474:replace_alu$1273.slice[5].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 22.802
$auto$alumacc.cc:474:replace_alu$1273.slice[6].carry4_full.CIN (CARRY4_VPR) 1.756 24.558
$auto$alumacc.cc:474:replace_alu$1273.slice[6].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 24.672
$auto$alumacc.cc:474:replace_alu$1273.slice[7].carry4_full.CIN (CARRY4_VPR) 1.756 26.429
$auto$alumacc.cc:474:replace_alu$1273.slice[7].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 26.543
$auto$alumacc.cc:474:replace_alu$1273.slice[8].carry4_part.CIN (CARRY4_VPR) 1.756 28.299
$auto$alumacc.cc:474:replace_alu$1273.slice[8].carry4_part.CO_FABRIC0 (CARRY4_VPR) 0.293 28.592
$abc$6342$auto$blifparse.cc:492:parse_blif$6357.fpga_mux_2.S (MUXF8) 1.407 29.999
$abc$6342$auto$blifparse.cc:492:parse_blif$6357.fpga_mux_2.O (MUXF8) 0.010 30.009
$abc$6342$new_n484_.in[1] (.names) 1.295 31.304
$abc$6342$new_n484_.out (.names) 0.068 31.372
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.T0.in[2] (.names) 1.308 32.680
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.T0.out (.names) 0.068 32.748
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.fpga_mux_0.I0 (MUXF6) 0.000 32.748
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.fpga_mux_0.O (MUXF6) 0.010 32.758
$auto$simplemap.cc:442:simplemap_dffe$3756.CE (FDRE_ZINI) 1.308 34.066
data arrival time 34.066
clock clk (rise edge) 10.000 10.000
clock source latency 0.000 10.000
clk.inpad (.input) 0.000 10.000
$auto$simplemap.cc:442:simplemap_dffe$3756.C (FDRE_ZINI) 1.502 11.502
clock uncertainty 0.000 11.502
cell setup time -0.010 11.492
data required time 11.492
--------------------------------------------------------------------------------------------------------------------------
data required time 11.492
data arrival time -34.066
--------------------------------------------------------------------------------------------------------------------------
slack (VIOLATED) -22.573
Both of which show a large 1.76ns delay from COUT to CIN between blocks which seems very high, as usually this is a highly delay optimized part of the architecture.
To me that likely indicates something going wrong with the placer delay model, probably relating to direct-connects for the carry chain links (which aren't getting sampled correctly). A good thing to check would be the placer delay model echo files (generated with --echo_file on).
There is an alternate delay model which I've been working on which tries to ensure override delays are always correct.
Trying with --place_delay_model delta_override
I get:
And the corresponding post-place timing report:
#Path 1
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$4586.Q (FDRE_ZINI clocked by clk)
Endpoint : $auto$simplemap.cc:442:simplemap_dffe$3756.CE (FDRE_ZINI clocked by clk)
Path Type : setup
Point Incr Path
--------------------------------------------------------------------------------------------------------------------------
clock clk (rise edge) 0.000 0.000
clock source latency 0.000 0.000
clk.inpad (.input) 0.000 0.000
$auto$simplemap.cc:442:simplemap_dffe$4586.C (FDRE_ZINI) 3.348 3.348
$auto$simplemap.cc:442:simplemap_dffe$4586.Q (FDRE_ZINI) [clock-to-output] 0.010 3.358
$abc$6342$auto$alumacc.cc:474:replace_alu$1286.S[0].in[0] (.names) 1.308 4.666
$abc$6342$auto$alumacc.cc:474:replace_alu$1286.S[0].out (.names) 0.068 4.734
$auto$alumacc.cc:474:replace_alu$1286.slice[0].carry4_1st_full.S0 (CARRY4_VPR) 1.362 6.096
$auto$alumacc.cc:474:replace_alu$1286.slice[0].carry4_1st_full.CO_CHAIN (CARRY4_VPR) 0.508 6.604
$auto$alumacc.cc:474:replace_alu$1286.slice[1].carry4_full.CIN (CARRY4_VPR) 0.000 6.604
$auto$alumacc.cc:474:replace_alu$1286.slice[1].carry4_full.O2 (CARRY4_VPR) 0.256 6.860
$abc$6342$auto$alumacc.cc:474:replace_alu$1273.BB[6].in[0] (.names) 1.395 8.254
$abc$6342$auto$alumacc.cc:474:replace_alu$1273.BB[6].out (.names) 0.068 8.322
$auto$alumacc.cc:474:replace_alu$1273.slice[1].carry4_full.S2 (CARRY4_VPR) 4.789 13.111
$auto$alumacc.cc:474:replace_alu$1273.slice[1].carry4_full.CO_CHAIN (CARRY4_VPR) 0.376 13.487
$auto$alumacc.cc:474:replace_alu$1273.slice[2].carry4_full.CIN (CARRY4_VPR) 0.000 13.487
$auto$alumacc.cc:474:replace_alu$1273.slice[2].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 13.601
$auto$alumacc.cc:474:replace_alu$1273.slice[3].carry4_full.CIN (CARRY4_VPR) 0.000 13.601
$auto$alumacc.cc:474:replace_alu$1273.slice[3].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 13.715
$auto$alumacc.cc:474:replace_alu$1273.slice[4].carry4_full.CIN (CARRY4_VPR) 0.000 13.715
$auto$alumacc.cc:474:replace_alu$1273.slice[4].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 13.829
$auto$alumacc.cc:474:replace_alu$1273.slice[5].carry4_full.CIN (CARRY4_VPR) 0.000 13.829
$auto$alumacc.cc:474:replace_alu$1273.slice[5].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 13.943
$auto$alumacc.cc:474:replace_alu$1273.slice[6].carry4_full.CIN (CARRY4_VPR) 0.000 13.943
$auto$alumacc.cc:474:replace_alu$1273.slice[6].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 14.057
$auto$alumacc.cc:474:replace_alu$1273.slice[7].carry4_full.CIN (CARRY4_VPR) 0.000 14.057
$auto$alumacc.cc:474:replace_alu$1273.slice[7].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 14.171
$auto$alumacc.cc:474:replace_alu$1273.slice[8].carry4_part.CIN (CARRY4_VPR) 0.000 14.171
$auto$alumacc.cc:474:replace_alu$1273.slice[8].carry4_part.CO_FABRIC0 (CARRY4_VPR) 0.293 14.464
$abc$6342$auto$blifparse.cc:492:parse_blif$6357.fpga_mux_2.S (MUXF8) 1.531 15.995
$abc$6342$auto$blifparse.cc:492:parse_blif$6357.fpga_mux_2.O (MUXF8) 0.010 16.005
$abc$6342$new_n484_.in[1] (.names) 1.266 17.271
$abc$6342$new_n484_.out (.names) 0.068 17.339
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.T0.in[2] (.names) 1.308 18.647
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.T0.out (.names) 0.068 18.715
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.fpga_mux_0.I0 (MUXF6) 0.000 18.715
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.fpga_mux_0.O (MUXF6) 0.010 18.725
$auto$simplemap.cc:442:simplemap_dffe$3756.CE (FDRE_ZINI) 1.308 20.033
data arrival time 20.033
clock clk (rise edge) 10.000 10.000
clock source latency 0.000 10.000
clk.inpad (.input) 0.000 10.000
$auto$simplemap.cc:442:simplemap_dffe$3756.C (FDRE_ZINI) 2.388 12.388
clock uncertainty 0.000 12.388
cell setup time -0.010 12.378
data required time 12.378
--------------------------------------------------------------------------------------------------------------------------
data required time 12.378
data arrival time -20.033
--------------------------------------------------------------------------------------------------------------------------
slack (VIOLATED) -7.655
Which is an improvement (from 34ns to 22ns arrival time), with all the previous 1.76ns delays replaced with 0ns (since the direct connect delays are accurately considered).
Since the placer now has a more accurate delay model this also reduces the difference between the post-place (20ns) and initial (congestion oblivious) routed delay (22.9ns):
Confirming router algorithm: TIMING_DRIVEN.
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter Time pres BBs Heap Re-Rtd Re-Rtd Overused RR Nodes Wirelength CPD sTNS sWNS hTNS hWNS Est Succ
(sec) fac Updt push Nets Conns (ns) (ns) (ns) (ns) (ns) Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 4095: 17 timing startpoints were not constrained during timing analysis
Warning 4096: 17 timing endpoints were not constrained during timing analysis
1 1.5 0.0 0 1.1e+07 678 2238 1946 ( 0.126%) 134648 ( 1.2%) 22.975 -580.5 -12.975 0.000 0.000 N/A
2 60.9 0.5 255 2.1e+07 482 1663 416 ( 0.027%) 79216 ( 0.7%) 19.289 -253.7 -9.289 0.000 0.000 N/A
3 2.3 0.6 113 1.6e+07 279 1037 260 ( 0.017%) 79242 ( 0.7%) 24.144 -279.9 -14.144 0.000 0.000 N/A
4 2.5 0.8 72 1.5e+07 177 774 159 ( 0.010%) 78276 ( 0.7%) 20.891 -204.1 -10.891 0.000 0.000 N/A
5 1.5 1.1 46 9431957 96 514 31 ( 0.002%) 77312 ( 0.7%) 20.824 -242.7 -10.824 0.000 0.000 N/A
6 0.4 1.4 42 2445967 27 184 8 ( 0.001%) 76929 ( 0.7%) 18.906 -196.5 -8.906 0.000 0.000 N/A
7 0.1 1.9 37 745465 9 59 9 ( 0.001%) 77867 ( 0.7%) 20.815 -258.9 -10.815 0.000 0.000 N/A
8 0.2 2.4 28 938236 14 65 9 ( 0.001%) 77064 ( 0.7%) 19.330 -217.5 -9.330 0.000 0.000 N/A
9 0.2 3.1 27 876362 11 62 3 ( 0.000%) 77146 ( 0.7%) 19.137 -189.6 -9.137 0.000 0.000 N/A
10 0.1 4.1 29 841558 11 62 2 ( 0.000%) 77540 ( 0.7%) 18.326 -205.8 -8.326 0.000 0.000 12
11 0.1 5.3 26 173852 2 14 0 ( 0.000%) 77353 ( 0.7%) 17.376 -181.1 -7.376 0.000 0.000 12
Both of which show a large 1.76ns delay from COUT to CIN between blocks which seems very high, as usually this is a highly delay optimized part of the architecture.
The critical path delay model appears to accurate, I've compared the post-route timing reports to the truth model. In the case of post-placement timing, you are right that it appears to be adding a net delay where one doesn't exist. This might explain the difference I was seeing in https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/592#issuecomment-494981636
Compare that path with the post-route timing model:
#Path 1
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$5609.Q (FDRE_ZINI clocked by clk)
Endpoint : $auto$simplemap.cc:442:simplemap_dffe$5605.D (FDRE_ZINI clocked by clk)
Path Type : setup
Point Incr Path
-------------------------------------------------------------------------------------------------------------------------------------------
clock clk (rise edge) 0.000 0.000
clock source latency 0.000 0.000
clk.inpad (.input) 0.000 0.000
$auto$simplemap.cc:442:simplemap_dffe$5609.C (FDRE_ZINI) 0.378 0.378
$auto$simplemap.cc:442:simplemap_dffe$5609.Q (FDRE_ZINI) [clock-to-output] 0.010 0.388
$auto$alumacc.cc:474:replace_alu$1295.slice[1].carry4_full.S1 (CARRY4_VPR) 4.128 4.516
$auto$alumacc.cc:474:replace_alu$1295.slice[1].carry4_full.O3 (CARRY4_VPR) 0.618 5.134
$abc$6342$auto$alumacc.cc:474:replace_alu$1233.BB[7].in[0] (.names) 1.615 6.749
$abc$6342$auto$alumacc.cc:474:replace_alu$1233.BB[7].out (.names) 0.068 6.817
$auto$alumacc.cc:474:replace_alu$1251.slice[1].carry4_full.S3 (CARRY4_VPR) 2.284 9.102
$auto$alumacc.cc:474:replace_alu$1251.slice[1].carry4_full.CO_CHAIN (CARRY4_VPR) 0.380 9.482
$auto$alumacc.cc:474:replace_alu$1251.slice[2].carry4_full.CIN (CARRY4_VPR) 0.000 9.482
$auto$alumacc.cc:474:replace_alu$1251.slice[2].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 9.596
$auto$alumacc.cc:474:replace_alu$1251.slice[3].carry4_full.CIN (CARRY4_VPR) 0.000 9.596
$auto$alumacc.cc:474:replace_alu$1251.slice[3].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 9.710
$auto$alumacc.cc:474:replace_alu$1251.slice[4].carry4_full.CIN (CARRY4_VPR) 0.000 9.710
$auto$alumacc.cc:474:replace_alu$1251.slice[4].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 9.824
$auto$alumacc.cc:474:replace_alu$1251.slice[5].carry4_full.CIN (CARRY4_VPR) 0.000 9.824
$auto$alumacc.cc:474:replace_alu$1251.slice[5].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 9.938
$auto$alumacc.cc:474:replace_alu$1251.slice[6].carry4_full.CIN (CARRY4_VPR) 0.000 9.938
$auto$alumacc.cc:474:replace_alu$1251.slice[6].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 10.052
$auto$alumacc.cc:474:replace_alu$1251.slice[7].carry4_full.CIN (CARRY4_VPR) 0.000 10.052
$auto$alumacc.cc:474:replace_alu$1251.slice[7].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 10.166
$auto$alumacc.cc:474:replace_alu$1251.slice[8].carry4_part.CIN (CARRY4_VPR) 0.000 10.166
$auto$alumacc.cc:474:replace_alu$1251.slice[8].carry4_part.CO_FABRIC0 (CARRY4_VPR) 0.293 10.459
$abc$6342$auto$blifparse.cc:492:parse_blif$6625.fpga_mux_0.S (MUXF6) 1.587 12.046
$abc$6342$auto$blifparse.cc:492:parse_blif$6625.fpga_mux_0.O (MUXF6) 0.010 12.056
$abc$6342$techmap$techmap\output_logic.$procmux$581.$ternary$?techmap.v?:445$4523_Y[1].in[3] (.names) 1.682 13.738
$abc$6342$techmap$techmap\output_logic.$procmux$581.$ternary$?techmap.v?:445$4523_Y[1].out (.names) 0.068 13.806
$auto$simplemap.cc:442:simplemap_dffe$5605.D (FDRE_ZINI) 2.163 15.969
data arrival time 15.969
Maybe the placer is overestimating the carry chain delay?
Since the placer now has a more accurate delay model this also reduces the difference between the post-place (20ns) and initial (congestion oblivious) routed delay (22.9ns):
Awesome. Why does the placer default to an estimate that doesn't account for carry chains, or other direct paths?
Awesome. Why does the placer default to an estimate that doesn't account for carry chains, or other direct paths?
For most of the architectures we usually run it doesn't make much of a difference, the standard delay model usually captures the direct-connect delay impacts correctly.
Also modelling the direct-connect delays correctly comes with a run-time hit during placement (~+20%) which is why we've left it off by default so far.
Also modelling the direct-connect delays correctly comes with a run-time hit during placement (~+20%) which is why we've left it off by default so far.
Fair enough. In this case placement appears to be a small fraction of the total pack/place/route time, so a modest increase in a small fraction is still small.
the standard delay model usually captures the direct-connect delay impacts correctly.
Is the idea that dy = 1, dx = 0 should be 0 in this case? That would underestimate delays on non-directly connected paths.
Is the idea that dy = 1, dx = 0 should be 0 in this case? That would underestimate delays on non-directly connected paths.
Yep, which motivated creating delta_override, but I found it had no real benefit on the architectures I evaluated it on.
Another thing to try is changing the lookahead used by the router. The map
lookahead tries to profile the routing architecture in a more general way than the default (whether it captures the details of your architecture correctly I'm not sure).
With --router_lookahead map
delay is further reduced:
## Computing router lookahead map
## Computing router lookahead map took 1.34 seconds (max_rss 2831.4 MiB, delta_rss +0.0 MiB)
Confirming router algorithm: TIMING_DRIVEN.
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter Time pres BBs Heap Re-Rtd Re-Rtd Overused RR Nodes Wirelength CPD sTNS sWNS hTNS hWNS Est Succ
(sec) fac Updt push Nets Conns (ns) (ns) (ns) (ns) (ns) Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 4095: 17 timing startpoints were not constrained during timing analysis
Warning 4096: 17 timing endpoints were not constrained during timing analysis
1 94.7 0.0 0 4.3e+08 678 2238 1341 ( 0.087%) 104155 ( 1.0%) 18.210 -271.4 -8.210 0.000 0.000 N/A
2 64.0 0.5 203 2.8e+08 415 1433 417 ( 0.027%) 84404 ( 0.8%) 16.711 -142.0 -6.711 0.000 0.000 N/A
3 45.7 0.6 172 2.0e+08 279 1066 248 ( 0.016%) 83450 ( 0.8%) 15.644 -115.9 -5.644 0.000 0.000 N/A
4 30.8 0.8 107 1.4e+08 193 811 172 ( 0.011%) 84394 ( 0.8%) 16.756 -154.1 -6.756 0.000 0.000 N/A
5 27.6 1.1 85 1.2e+08 124 660 50 ( 0.003%) 84970 ( 0.8%) 15.232 -110.7 -5.232 0.000 0.000 N/A
6 12.5 1.4 74 5.6e+07 59 308 23 ( 0.001%) 85181 ( 0.8%) 15.856 -119.9 -5.856 0.000 0.000 N/A
7 3.7 1.9 66 1.7e+07 33 144 15 ( 0.001%) 85313 ( 0.8%) 16.489 -125.1 -6.489 0.000 0.000 N/A
8 2.2 2.4 48 1.0e+07 14 76 2 ( 0.000%) 85256 ( 0.8%) 16.088 -122.5 -6.088 0.000 0.000 N/A
9 1.2 3.1 43 5856273 8 39 0 ( 0.000%) 85670 ( 0.8%) 16.297 -128.3 -6.297 0.000 0.000 N/A
Restoring best routing
Critical path: 16.2967 ns
Successfully routed after 9 routing iterations.
And the routed timing report:
#Path 1
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$3852.Q (FDRE_ZINI clocked by clk)
Endpoint : $auto$simplemap.cc:420:simplemap_dff$2676.D (FDRE_ZINI clocked by clk)
Path Type : setup
Point Incr Path
--------------------------------------------------------------------------------------------------------------------------
clock clk (rise edge) 0.000 0.000
clock source latency 0.000 0.000
clk.inpad (.input) 0.000 0.000
$auto$simplemap.cc:442:simplemap_dffe$3852.C (FDRE_ZINI) 0.378 0.378
$auto$simplemap.cc:442:simplemap_dffe$3852.Q (FDRE_ZINI) [clock-to-output] 0.010 0.388
$abc$6342$auto$alumacc.cc:474:replace_alu$1292.S[0].in[0] (.names) 0.796 1.184
$abc$6342$auto$alumacc.cc:474:replace_alu$1292.S[0].out (.names) 0.068 1.252
$auto$alumacc.cc:474:replace_alu$1292.slice[0].carry4_1st_full.S0 (CARRY4_VPR) 0.958 2.210
$auto$alumacc.cc:474:replace_alu$1292.slice[0].carry4_1st_full.CO_CHAIN (CARRY4_VPR) 0.508 2.718
$auto$alumacc.cc:474:replace_alu$1292.slice[1].carry4_full.CIN (CARRY4_VPR) 0.000 2.718
$auto$alumacc.cc:474:replace_alu$1292.slice[1].carry4_full.O2 (CARRY4_VPR) 0.256 2.974
$abc$6342$auto$alumacc.cc:474:replace_alu$1260.BB[6].in[0] (.names) 1.793 4.766
$abc$6342$auto$alumacc.cc:474:replace_alu$1260.BB[6].out (.names) 0.068 4.834
$auto$alumacc.cc:474:replace_alu$1260.slice[1].carry4_full.S2 (CARRY4_VPR) 2.914 7.748
$auto$alumacc.cc:474:replace_alu$1260.slice[1].carry4_full.CO_CHAIN (CARRY4_VPR) 0.376 8.124
$auto$alumacc.cc:474:replace_alu$1260.slice[2].carry4_full.CIN (CARRY4_VPR) 0.000 8.124
$auto$alumacc.cc:474:replace_alu$1260.slice[2].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 8.238
$auto$alumacc.cc:474:replace_alu$1260.slice[3].carry4_full.CIN (CARRY4_VPR) 0.000 8.238
$auto$alumacc.cc:474:replace_alu$1260.slice[3].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 8.352
$auto$alumacc.cc:474:replace_alu$1260.slice[4].carry4_full.CIN (CARRY4_VPR) 0.000 8.352
$auto$alumacc.cc:474:replace_alu$1260.slice[4].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 8.466
$auto$alumacc.cc:474:replace_alu$1260.slice[5].carry4_full.CIN (CARRY4_VPR) 0.000 8.466
$auto$alumacc.cc:474:replace_alu$1260.slice[5].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 8.580
$auto$alumacc.cc:474:replace_alu$1260.slice[6].carry4_full.CIN (CARRY4_VPR) 0.000 8.580
$auto$alumacc.cc:474:replace_alu$1260.slice[6].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 8.694
$auto$alumacc.cc:474:replace_alu$1260.slice[7].carry4_full.CIN (CARRY4_VPR) 0.000 8.694
$auto$alumacc.cc:474:replace_alu$1260.slice[7].carry4_full.CO_CHAIN (CARRY4_VPR) 0.114 8.808
$auto$alumacc.cc:474:replace_alu$1260.slice[8].carry4_part.CIN (CARRY4_VPR) 0.000 8.808
$auto$alumacc.cc:474:replace_alu$1260.slice[8].carry4_part.CO_FABRIC0 (CARRY4_VPR) 0.293 9.101
$abc$6342$auto$blifparse.cc:492:parse_blif$6359.fpga_mux_2.S (MUXF8) 0.831 9.932
$abc$6342$auto$blifparse.cc:492:parse_blif$6359.fpga_mux_2.O (MUXF8) 0.010 9.942
$abc$6342$auto$blifparse.cc:492:parse_blif$6391.T0.in[3] (.names) 2.434 12.376
$abc$6342$auto$blifparse.cc:492:parse_blif$6391.T0.out (.names) 0.068 12.444
$abc$6342$auto$blifparse.cc:492:parse_blif$6391.fpga_mux_0.I0 (MUXF6) 0.000 12.444
$abc$6342$auto$blifparse.cc:492:parse_blif$6391.fpga_mux_0.O (MUXF6) 0.010 12.454
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_lut_1.T0.in[4] (.names) 3.567 16.021
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_lut_1.T0.out (.names) 0.068 16.089
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_lut_1.fpga_mux_0.I0 (MUXF6) 0.000 16.089
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_lut_1.fpga_mux_0.O (MUXF6) 0.010 16.099
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_mux_0.I1 (MUXF7) 0.000 16.099
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_mux_0.O (MUXF7) 0.010 16.109
$auto$simplemap.cc:420:simplemap_dff$2676.D (FDRE_ZINI) 0.556 16.665
data arrival time 16.665
clock clk (rise edge) 10.000 10.000
clock source latency 0.000 10.000
clk.inpad (.input) 0.000 10.000
$auto$simplemap.cc:420:simplemap_dff$2676.C (FDRE_ZINI) 0.378 10.378
clock uncertainty 0.000 10.378
cell setup time -0.010 10.368
data required time 10.368
--------------------------------------------------------------------------------------------------------------------------
data required time 10.368
data arrival time -16.665
--------------------------------------------------------------------------------------------------------------------------
slack (VIOLATED) -6.297
Now the fact that the router is able to find a substantially faster path than the placer likely implies that the placer delay model is still off somewhat. If that is the case then the placer is likely optimizing the wrong parts of the circuit.
One idea is to try and compare paths in the placer and router reports to identify where the placer delay model is inaccurate. The option --timing_report_detail aggregated
produces timing reports with more detailed delay breakdowns which may be helpful.
Now the fact that the router is able to find a substantially faster path than the placer likely implies that the placer delay model is still off somewhat. If that is the case then the placer is likely optimizing the wrong parts of the circuit.
Ya, I figured that. So you would expect that the placement delay estimate should be less than the actual routed delay? This makes sense to me, I just wasn't confident in the reasoning.
Now the fact that the router is able to find a substantially faster path than the placer likely implies that the placer delay model is still off somewhat. If that is the case then the placer is likely optimizing the wrong parts of the circuit.
One idea I had to avoid this "placer delay estimate" problem was to have the placer directly sample the routing graph. This was very slow, but also didn't lower the estimates. Is it possible that there is something different in the placement route delay function (calculate_delay) that causes the placer to have worse routes than the router itself?
The VPR router uses a lookahead to estimate the cost to reach a destination through the RR graph (A* search). If the lookahead doesn't do a good job the router will produce poorer quality results.
How aggressively the router uses the lookahead is controlled by --astar_fac
:
The VPR default is astar_fac = 1.2, so lowering it can sometimes improve results at the cost of increased run-time.
is something different in the placement route delay function (calculate_delay) that causes the placer to have worse routes than the router itself?
Potentially. The sub-optimal nature of the path finding can interact with congestion and result in the router exploring a different part of the search space and finding a better result than a single connection would otherwise. However usually this indicates the lookahead is doing a poor job.
Thanks for that.
In an interesting turn of events, I found that the version of the rrgraph without timing actually returned a better (!!!) final critical path delay (as measured by the truth model). In order to reduce variables, I copied the .net
and .place
from the better result and attempted to re-route that design with the timing values present in the rrgraph.
True critical path delay on design routed without correct timing present in rrgraph: -4.206 ns
True critical path delay on same clustering and placement, but with timing present in rrgraph: -14.988 ns
I'll explore with an astar_fac = 0 to see if I can close that gap. The fact that on the same clustering and placement there is such variability is very interesting.
It is very strange that using an rr-graph without timing improved results. I suspect this means there is a problem in the timing annotated on the rr-graph.
Since this is a very small design, you can turn the quality parameters way up to try to improve result quality, which may help you focus on root issues (poor delay estimation in the placer and/or router for example). So I'd set a high inner_num (e.g. 20) and a low astar_fac (e.g. 0.5 or even 0). They will cost cpu time, but at this point you don't care.
It is very strange that using an rr-graph without timing improved results. I suspect this means there is a problem in the timing annotated on the rr-graph.
I agree. I'm working on a way to use VPR's timing analysis (with timing annotations) on the route file generated without timing. I expect that it will reveal an incorrect timing calculation, which is pushing the placer/router in the wrong direction.
I've confirmed that simply zeroing all the timing annotations in the rrgraph enables the router to generate a route with a true critical path setup slack of -2.322 ns
. What is interesting is that the VPR analysis using the annotated rrgraph returns a completely reasonable critical path delay of 9.15997 ns
. FYI, the fact that this design closes using the VPR timing annotations is not surprising as the delay from Cinternal is not included in VPR yet.
I'm not sure how to explain this. From my simple experiment, the timing annotations are correct, and yet it causes the router to fail to converge to a valid solution.
I'll upload a new tarball with:
At this point I'm very confused.
One thing I immediately notice with the zero'd out routing graph is that routing takes significantly slower. Is it possible that having zero'd out timing would cause the router to work harder to find good paths?
How close is the VPR timing analysis to Vivado's estimate of this circuit, or a hand estimate? That would provide a coarse check of the timing values.
I've never seen deleting timing info improve the router's critical path performance (at least not consistently; on a small circuit there's always a chance you're getting lucky with a non-timing driven algorithm). So I think something is broken.
I'd try running with astar_fac = 0. Maybe the router timing model is OK, but the timing part of the lookahead is totally wrong. astar_fac = 0 essentially makes the lookahead irrelevant (but slows down the router) so running that experiment helps eliminate or implicate the lookahead in the problem.
How close is the VPR timing analysis to Vivado's estimate of this circuit, or a hand estimate? That would provide a coarse check of the timing values.
The estimate is good, if a little low. The reason the delays are lower in VPR's model is because Cinternal support isn't there yet. As a result fanout RC delays are not included.
I'd try running with astar_fac = 0. Maybe the router timing model is OK, but the timing part of the lookahead is totally wrong. astar_fac = 0 essentially makes the lookahead irrelevant (but slows down the router) so running that experiment helps eliminate or implicate the lookahead in the problem.
I'll try that next. I tried a low (like .1) astar_fac, but it didn't help. astar_fac = 0 seemed really really slow, but maybe that is required until the problem with the lookahead can be diagnosed.
I'd try running with astar_fac = 0. Maybe the router timing model is OK, but the timing part of the lookahead is totally wrong. astar_fac = 0 essentially makes the lookahead irrelevant (but slows down the router) so running that experiment helps eliminate or implicate the lookahead in the problem.
Confirm, --astar_fac 0
results in timing closing, at the cost of very significant increase in runtime.
@kmurray @vaughnbetz How/where is the router look ahead computed? The runtime performance with --astar_fac 0
is pretty bad, so identifying what is happening here is going to be something I'll do tommorow. Any insights into the router look ahead's operation will help.
@kmurray @vaughnbetz How/where is the router look ahead computed?
It looks like both router look ahead methods are based on segment definitions, which we originally elided, as it wasn't clear how segments were being used. I'll need to think about how to define segments that enable reasonable look ahead.
Is there a non-segment based approach that might make sense here?
Is there a non-segment based approach that might make sense here?
I have some ideas about non-segment based lookaheads, but that is more of a research project than something that can be addressed quickly.
Probably your best approach is to fill in the segment information with reasonable values to get the current lookaheads to work better.
Probably your best approach is to fill in the segment information with reasonable values to get the current lookaheads to work better.
Given that we don't really generate nodes from the segments, what is the important aspects to capture with the segment definitions? For example, are the sb/cb definitions important?
are the sb/cb definitions important?
I believe those are only relevant for guiding RR graph construction which shouldn't matter if your loading an externally generated RR graph.
what is the important aspects to capture with the segment definitions?
You want to capture the delay relevant characteristics of the different wire types. Usually you'll have a different segment definition for each unique wire length, since they'll have different Rmetal/Cmetal. That is the key information needed for the lookaheads.
Each CHANX/CHANY RR node should then reference the segment type it represents (i.e. segments are a flyweight of common RR node characteristics).
The imported routing fabric is a mix of horizontal and vertical wires, that should be able to be described with segments. The thing I don't know how to describe is the L-shaped wires. For example, SE6BEG3
is a Southeast (SE), length 6 (dx = 3, dy = 3), wire. I don't know how to expose that wire to the lookahead.
In addition, near the edge of the fabric, the regularity of the fabric breaks down due to the boundary. How important is lookahead near the boundry of the fabric (which may be in the interior when it reachs hard blocks).
Modeling the boundary well isn't that important. The lookahead is a "best guess" of what is going to happen in the future in terms of what wires will be used, and the resulting delay and wiring cost. So if it models most cases well it works pretty well, and having some things it doesn't model that well isn't catastrophic, although it can hurt QoR some if they're relatively common mispredictions.
You can also compensate for a less than ideal lookahead by lowering astar_fac, but of course that comes at a cost of increased cpu time.
So I'd focus on getting the most common wires and routing paths correct or nearly correct in the lookahead.
For length 6 wires (SE6BEG3), are those only for direct connect (i.e. you drive this wire only from a logic block or other block, never from another routing wire)? Or can other routing wires also drive them? If they can only be reached from a block output modeling them well in the lookahead won't matter much or at all, as they'll be explored at the start of a net's routing no matter what, and won't be explored again if the routing has to go a long distance.
If they can be reached by other routing wires, then modeling them in the lookahead is more important. The best way would probably start with modeling them as two length 3 wires with a non-configurable (short) edge between them.
You should also use the more sophisticated, "map", lookahead. @kmurray, I don't see it listed in the command line option documentation. Has it been brought out to the command line?
If they can be reached by other routing wires, then modeling them in the lookahead is more important. The best way would probably start with modeling them as two length 3 wires with a non-configurable (short) edge between them.
The L-shaped wires are modeled correctly as nodes and edges, exactly as you said. However the lookahead doesn't examine the real graph. It uses the segment definitions.
The map
lookahead is exposed as an option, but it seems to uses segments over the real graph too.
For length 6 wires (SE6BEG3), are those only for direct connect (i.e. you drive this wire only from a logic block or other block, never from another routing wire)? Or can other routing wires also drive them?
These L-shaped wires are general routing fabric wires, so having them in the look ahead is important.
Ok, agreed. I think the map lookahead will do pretty well with them, as it profiles / samples paths through the routing graph to build its tables / predictions. It uses the segment definitions only to decide what table to look up the prediction in -- the prediction is a function of how far you have to go (dx,dy) and what type of wire you are currently on (segment type/id). So it doesn't encode any significant assumptions about how the segments interact; it just needs to know the grouping of which wires are which kind of segment so it can look up in the proper table.
So it doesn't encode any significant assumptions about how the segments interact; it just needs to know the grouping of which wires are which kind of segment so it can look up in the proper table.
This sounds promising. So effectively I need to create a segment for each variant of wire. So (dx = 12, dy = 0) is one segment, and (dx = 3, dy = 3) is one segment, (dx = 3, dy = -3) is one segment, etc. That should be pretty straight forward. I'll read up on the paper linked in the comment and see if I can grok what is going on.
You can read about the map lookahead here https://ieeexplore.ieee.org/document/7577326 (see enhanced router lookahead section) or here https://tspace.library.utoronto.ca/bitstream/1807/75854/3/Petelin_Oleg_201611_MAS_thesis.pdf (page 33-35).
Yes, the commented out lines should do essentially the same thing as the .T_linear etc. lines. Kevin found there was a small difference, which cost CPU time as a lookahead underpredict slows the router down. There is some CAD noise / runtime vs. quality trade-off in exactly how you set a lookahead though, so frankly if we remeasured and/or slightly retuned astar_fac there's a good chance we'd be happy with the commented out code's performance. So if you make that work there's no fundamental reason that can't become the new overall code path after some testing and tuning.
Vaughn
On Fri, May 24, 2019 at 1:00 PM litghost notifications@github.com wrote:
So it does look like the map method is still using the segment data rather than the node data:
But that is marked with a TODO. The currently commented logic doesn't correctly handle shorted nodes, but that is a straight forward fix.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/592?email_source=notifications&email_token=ABNDPJ3GZ432QYGYL4LOJB3PXANK7A5CNFSM4HOYD5B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWF7C2Y#issuecomment-495710571, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNDPJ5NUXGXX5SMV7JNXPDPXANK7ANCNFSM4HOYD5BQ .
Should also note that the placer uses algorithmic profiling of the routing fabric to create its estimates of delay vs. distance travelled. If the rr-graph and delays are OK those look ups should be OK, but when the router lookahead is in shape it is also worth checking the placer delay model in and around https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/0aab7670747ec21e0fcecbddd42c3588d82089c4/vpr/src/place/timing_place_lookup.cpp#L268
@vaughnbetz @kmurray Is there a way to output the A true delay vs heuristic delay? From [my limited understanding of A](https://en.wikipedia.org/wiki/A*_search_algorithm#Properties), if the heuristic delay was under estimating the delay, and astar_fac < 1
, then it should've returned the optimal path.
Maybe there should be diagnostic mode that checks that the A* heurstic is admissible?
Is there a way to output the A* true delay vs heuristic delay?
Nothing built in, but you could code it up yourself.
To get the optimal delay you'd run the router with astar_fac = 0.
The lookaheads are wrapped in a class, if you specify the criticality to be 1.0 (params.criticality = 1.
) the returned cost should be the lookahead's delay estimate.
Maybe there should be diagnostic mode that checks that the A* heurstic is admissible?
The lookaheads/heuristics VPR uses are actually not strictly admissible (c.f. earlier discussions). This isn't a big issue since there are various reasons (e.g. congestion, multi-fanout nets) which would prevent us from getting the true optimal path for a particular connection anyway.
(Of course having a good lookahead which guides the router in the correct direction is important, it just doesn't have to be strictly admissible.)
Expected Behaviour
VPR should satisfy timing constraint.
Current Behaviour
Timing constraints are violated.
Possible Solution
It is unclear at this time why VPR fails to close timing on the attached design. I can imagine many possibilities, but before exploring those possibilities, I'd like to get other eyes on the results to see if there are some obvious problems (like a command line flag, etc).
Steps to Reproduce
Context
We are bringing up a realistic VPR arch that matches the 7-series architecture. The attached arch and routing graph is an approximation of the 1/5 of the XC7A35T-1CPG236C fabric, and relatively small initial design.
Your Environment