verilog-to-routing / vtr-verilog-to-routing

Verilog to Routing -- Open Source CAD Flow for FPGA Research
https://verilogtorouting.org
Other
1.01k stars 388 forks source link

VPR not closing timing on small (150 clusters/500 LUT) design #592

Closed litghost closed 4 years ago

litghost commented 5 years ago

Expected Behaviour

VPR should satisfy timing constraint.

Current Behaviour

Timing constraints are violated.

-------------------------------------------------------------------------------------------------------------------------------------------
data required time                                                                                                                   12.424
data arrival time                                                                                                                   -21.948
-------------------------------------------------------------------------------------------------------------------------------------------
slack (VIOLATED)                                                                                                                     -9.524

Possible Solution

It is unclear at this time why VPR fails to close timing on the attached design. I can imagine many possibilities, but before exploring those possibilities, I'd like to get other eyes on the results to see if there are some obvious problems (like a command line flag, etc).

Steps to Reproduce

  1. Download arch, routing graph, and eblif
  2. Build VPR at https://github.com/SymbiFlow/vtr-verilog-to-routing/pull/57/commits/352ef1f57010a06e36b38455c5bf556d0f9f1fe8
  3. Run:
vpr \
 arch.unique_pack.xml \
 top.eblif \
 --device xc7a50t-basys3-test \
 --read_rr_graph rr_graph_xc7a50t-basys3_test.rr_graph.real.xml \
 --min_route_chan_width_hint 100 \
 --max_router_iterations 500 \
 --routing_failure_predictor off \
 --router_high_fanout_threshold -1 \
 --constant_net_method route \
 --route_chan_width 500 \
 --clock_modeling route \
 --allow_unrelated_clustering on \
 --clustering_pin_feasibility_filter off \
 --disable_check_route on \
 --fix_pins top_io.place \
 --sdc_file basys3.sdc --pack --place --route

Context

We are bringing up a realistic VPR arch that matches the 7-series architecture. The attached arch and routing graph is an approximation of the 1/5 of the XC7A35T-1CPG236C fabric, and relatively small initial design.

Your Environment

litghost commented 5 years ago

Things I've checked so far:

Example 1:

$auto$alumacc.cc:474:replace_alu$1295.slice[0].carry4_1st_full.O2 (CARRY4_VPR)                                              0.558     4.982
$abc$6342$auto$alumacc.cc:474:replace_alu$1233.BB[2].in[0] (.names)                                                         1.748     6.730

Example 2:

$abc$6342$auto$blifparse.cc:492:parse_blif$6431.T0.in[0] (.names)                                                           4.067    15.048

A net delay of 1.748 ns and 4.067 is in the right ballpark as the inter-cluster delay approximation.

litghost commented 5 years ago

I also checked that the placer slack approximation is close to the router approximation.

Final placer slack:

------- ------- ---------- ---------- ---------- ---------- ------- ---------- -------- ------- ------- ------- ------ --------- ------
      T    Cost Av BB Cost Av TD Cost Av Tot Del P to P Del     CPD       sTNS     sWNS Ac Rate Std Dev R limit    Exp Tot Moves  Alpha
------- ------- ---------- ---------- ---------- ---------- ------- ---------- -------- ------- ------- ------- ------ --------- ------
  0.000  0.9998    25.9159 1.6597e-07 4.701e-06  2.0955e-09  32.573  -1.55e+03  -22.573  0.0029  0.0001  1.0000  8.000   153762  0.000

Initial router slack:

---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter   Time    pres  BBs    Heap  Re-Rtd  Re-Rtd Overused RR Nodes      Wirelength      CPD       sTNS       sWNS       hTNS       hWNS Est Succ
      (sec)     fac Updt    push    Nets   Conns                                       (ns)       (ns)       (ns)       (ns)       (ns)     Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
   1    1.4     0.0    0 8837045     678    2238    2047 ( 0.133%)  125716 ( 1.2%)   22.053     -437.0    -12.053      0.000      0.000      N/A

The placer was overestimated the delay significantly, so maybe the placer timing approximation is poor behavior for the placer? I'll need to dig in to determine how to debug if the quality of the placer approximation explains the result.

litghost commented 5 years ago

@vaughnbetz I've noticed that the placer is fairly sensitive to it's parameters. For example:

--timing_tradeoff 0.5:

Final critical path: 16.9727 ns, Fmax: 58.9183 MHz
Setup Worst Negative Slack (sWNS): -6.97265 ns
Setup Total Negative Slack (sTNS): -211.903 ns

--timing_tradeoff 0.3:

Final critical path: 14.5773 ns, Fmax: 68.5998 MHz
Setup Worst Negative Slack (sWNS): -4.5773 ns
Setup Total Negative Slack (sTNS): -162.235 ns

--timing_tradeoff 0.2:

Final critical path: 20.8454 ns, Fmax: 47.9721 MHz
Setup Worst Negative Slack (sWNS): -10.8454 ns
Setup Total Negative Slack (sTNS): -263.465 ns

Is it expected that the placer be so sensitive to it's parameters? Is this an indication of a degenerate grid/rrgraph? Or is it expected given the placer design?

litghost commented 5 years ago

Another thought, the placer is fairly fast, but appears to have a large impact on the final QoR w.r.t. critical path. Given that the placer is fast, does it make sense for the placer to have an outer loop on critical parameters? For example, optimize timing_tradeoff for some number of iterators. In the example above, the default (.5) doesn't appear to be the "right" choice, something a little lower leads to a better result.

kmurray commented 5 years ago

I've noticed that the placer is fairly sensitive to it's parameters.

Yes this is the case. A small tweak in parameters can cause the placer to take a different randomized walk through the solution space. One thing you could try is performing a sweep of several seed values (which change only the random number generator sequence used by the placer). That will give you an idea for what the variability is like on this design.

kmurray commented 5 years ago

This is what I see as the critical path after placement: Selection_186

And the corresponding path report (generated with --post_place_timing_report):

#Path 1
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$4587.Q (FDRE_ZINI clocked by clk)
Endpoint  : $auto$simplemap.cc:442:simplemap_dffe$3756.CE (FDRE_ZINI clocked by clk)
Path Type : setup

Point                                                                                                       Incr      Path
--------------------------------------------------------------------------------------------------------------------------
clock clk (rise edge)                                                                                      0.000     0.000
clock source latency                                                                                       0.000     0.000
clk.inpad (.input)                                                                                         0.000     0.000
$auto$simplemap.cc:442:simplemap_dffe$4587.C (FDRE_ZINI)                                                   4.119     4.119
$auto$simplemap.cc:442:simplemap_dffe$4587.Q (FDRE_ZINI) [clock-to-output]                                 0.010     4.129
$auto$alumacc.cc:474:replace_alu$1286.slice[0].carry4_1st_full.S1 (CARRY4_VPR)                             1.716     5.845
$auto$alumacc.cc:474:replace_alu$1286.slice[0].carry4_1st_full.CO_CHAIN (CARRY4_VPR)                       0.528     6.373
$auto$alumacc.cc:474:replace_alu$1286.slice[1].carry4_full.CIN (CARRY4_VPR)                                1.756     8.130
$auto$alumacc.cc:474:replace_alu$1286.slice[1].carry4_full.O2 (CARRY4_VPR)                                 0.256     8.386
$abc$6342$auto$alumacc.cc:474:replace_alu$1273.BB[6].in[0] (.names)                                        1.702    10.088
$abc$6342$auto$alumacc.cc:474:replace_alu$1273.BB[6].out (.names)                                          0.068    10.156
$auto$alumacc.cc:474:replace_alu$1273.slice[1].carry4_full.S2 (CARRY4_VPR)                                 4.789    14.945
$auto$alumacc.cc:474:replace_alu$1273.slice[1].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.376    15.321
$auto$alumacc.cc:474:replace_alu$1273.slice[2].carry4_full.CIN (CARRY4_VPR)                                1.756    17.077
$auto$alumacc.cc:474:replace_alu$1273.slice[2].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    17.191
$auto$alumacc.cc:474:replace_alu$1273.slice[3].carry4_full.CIN (CARRY4_VPR)                                1.756    18.947
$auto$alumacc.cc:474:replace_alu$1273.slice[3].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    19.061
$auto$alumacc.cc:474:replace_alu$1273.slice[4].carry4_full.CIN (CARRY4_VPR)                                1.756    20.818
$auto$alumacc.cc:474:replace_alu$1273.slice[4].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    20.932
$auto$alumacc.cc:474:replace_alu$1273.slice[5].carry4_full.CIN (CARRY4_VPR)                                1.756    22.688
$auto$alumacc.cc:474:replace_alu$1273.slice[5].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    22.802
$auto$alumacc.cc:474:replace_alu$1273.slice[6].carry4_full.CIN (CARRY4_VPR)                                1.756    24.558
$auto$alumacc.cc:474:replace_alu$1273.slice[6].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    24.672
$auto$alumacc.cc:474:replace_alu$1273.slice[7].carry4_full.CIN (CARRY4_VPR)                                1.756    26.429
$auto$alumacc.cc:474:replace_alu$1273.slice[7].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    26.543
$auto$alumacc.cc:474:replace_alu$1273.slice[8].carry4_part.CIN (CARRY4_VPR)                                1.756    28.299
$auto$alumacc.cc:474:replace_alu$1273.slice[8].carry4_part.CO_FABRIC0 (CARRY4_VPR)                         0.293    28.592
$abc$6342$auto$blifparse.cc:492:parse_blif$6357.fpga_mux_2.S (MUXF8)                                       1.407    29.999
$abc$6342$auto$blifparse.cc:492:parse_blif$6357.fpga_mux_2.O (MUXF8)                                       0.010    30.009
$abc$6342$new_n484_.in[1] (.names)                                                                         1.295    31.304
$abc$6342$new_n484_.out (.names)                                                                           0.068    31.372
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.T0.in[2] (.names)                                          1.308    32.680
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.T0.out (.names)                                            0.068    32.748
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.fpga_mux_0.I0 (MUXF6)                                      0.000    32.748
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.fpga_mux_0.O (MUXF6)                                       0.010    32.758
$auto$simplemap.cc:442:simplemap_dffe$3756.CE (FDRE_ZINI)                                                  1.308    34.066
data arrival time                                                                                                   34.066

clock clk (rise edge)                                                                                     10.000    10.000
clock source latency                                                                                       0.000    10.000
clk.inpad (.input)                                                                                         0.000    10.000
$auto$simplemap.cc:442:simplemap_dffe$3756.C (FDRE_ZINI)                                                   1.502    11.502
clock uncertainty                                                                                          0.000    11.502
cell setup time                                                                                           -0.010    11.492
data required time                                                                                                  11.492
--------------------------------------------------------------------------------------------------------------------------
data required time                                                                                                  11.492
data arrival time                                                                                                  -34.066
--------------------------------------------------------------------------------------------------------------------------
slack (VIOLATED)                                                                                                   -22.573

Both of which show a large 1.76ns delay from COUT to CIN between blocks which seems very high, as usually this is a highly delay optimized part of the architecture.

kmurray commented 5 years ago

To me that likely indicates something going wrong with the placer delay model, probably relating to direct-connects for the carry chain links (which aren't getting sampled correctly). A good thing to check would be the placer delay model echo files (generated with --echo_file on).

There is an alternate delay model which I've been working on which tries to ensure override delays are always correct.

Trying with --place_delay_model delta_override I get: Selection_187

And the corresponding post-place timing report:

#Path 1
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$4586.Q (FDRE_ZINI clocked by clk)
Endpoint  : $auto$simplemap.cc:442:simplemap_dffe$3756.CE (FDRE_ZINI clocked by clk)
Path Type : setup

Point                                                                                                       Incr      Path
--------------------------------------------------------------------------------------------------------------------------
clock clk (rise edge)                                                                                      0.000     0.000
clock source latency                                                                                       0.000     0.000
clk.inpad (.input)                                                                                         0.000     0.000
$auto$simplemap.cc:442:simplemap_dffe$4586.C (FDRE_ZINI)                                                   3.348     3.348
$auto$simplemap.cc:442:simplemap_dffe$4586.Q (FDRE_ZINI) [clock-to-output]                                 0.010     3.358
$abc$6342$auto$alumacc.cc:474:replace_alu$1286.S[0].in[0] (.names)                                         1.308     4.666
$abc$6342$auto$alumacc.cc:474:replace_alu$1286.S[0].out (.names)                                           0.068     4.734
$auto$alumacc.cc:474:replace_alu$1286.slice[0].carry4_1st_full.S0 (CARRY4_VPR)                             1.362     6.096
$auto$alumacc.cc:474:replace_alu$1286.slice[0].carry4_1st_full.CO_CHAIN (CARRY4_VPR)                       0.508     6.604
$auto$alumacc.cc:474:replace_alu$1286.slice[1].carry4_full.CIN (CARRY4_VPR)                                0.000     6.604
$auto$alumacc.cc:474:replace_alu$1286.slice[1].carry4_full.O2 (CARRY4_VPR)                                 0.256     6.860
$abc$6342$auto$alumacc.cc:474:replace_alu$1273.BB[6].in[0] (.names)                                        1.395     8.254
$abc$6342$auto$alumacc.cc:474:replace_alu$1273.BB[6].out (.names)                                          0.068     8.322
$auto$alumacc.cc:474:replace_alu$1273.slice[1].carry4_full.S2 (CARRY4_VPR)                                 4.789    13.111
$auto$alumacc.cc:474:replace_alu$1273.slice[1].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.376    13.487
$auto$alumacc.cc:474:replace_alu$1273.slice[2].carry4_full.CIN (CARRY4_VPR)                                0.000    13.487
$auto$alumacc.cc:474:replace_alu$1273.slice[2].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    13.601
$auto$alumacc.cc:474:replace_alu$1273.slice[3].carry4_full.CIN (CARRY4_VPR)                                0.000    13.601
$auto$alumacc.cc:474:replace_alu$1273.slice[3].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    13.715
$auto$alumacc.cc:474:replace_alu$1273.slice[4].carry4_full.CIN (CARRY4_VPR)                                0.000    13.715
$auto$alumacc.cc:474:replace_alu$1273.slice[4].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    13.829
$auto$alumacc.cc:474:replace_alu$1273.slice[5].carry4_full.CIN (CARRY4_VPR)                                0.000    13.829
$auto$alumacc.cc:474:replace_alu$1273.slice[5].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    13.943
$auto$alumacc.cc:474:replace_alu$1273.slice[6].carry4_full.CIN (CARRY4_VPR)                                0.000    13.943
$auto$alumacc.cc:474:replace_alu$1273.slice[6].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    14.057
$auto$alumacc.cc:474:replace_alu$1273.slice[7].carry4_full.CIN (CARRY4_VPR)                                0.000    14.057
$auto$alumacc.cc:474:replace_alu$1273.slice[7].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114    14.171
$auto$alumacc.cc:474:replace_alu$1273.slice[8].carry4_part.CIN (CARRY4_VPR)                                0.000    14.171
$auto$alumacc.cc:474:replace_alu$1273.slice[8].carry4_part.CO_FABRIC0 (CARRY4_VPR)                         0.293    14.464
$abc$6342$auto$blifparse.cc:492:parse_blif$6357.fpga_mux_2.S (MUXF8)                                       1.531    15.995
$abc$6342$auto$blifparse.cc:492:parse_blif$6357.fpga_mux_2.O (MUXF8)                                       0.010    16.005
$abc$6342$new_n484_.in[1] (.names)                                                                         1.266    17.271
$abc$6342$new_n484_.out (.names)                                                                           0.068    17.339
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.T0.in[2] (.names)                                          1.308    18.647
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.T0.out (.names)                                            0.068    18.715
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.fpga_mux_0.I0 (MUXF6)                                      0.000    18.715
$abc$6342$auto$blifparse.cc:492:parse_blif$6398.fpga_mux_0.O (MUXF6)                                       0.010    18.725
$auto$simplemap.cc:442:simplemap_dffe$3756.CE (FDRE_ZINI)                                                  1.308    20.033
data arrival time                                                                                                   20.033

clock clk (rise edge)                                                                                     10.000    10.000
clock source latency                                                                                       0.000    10.000
clk.inpad (.input)                                                                                         0.000    10.000
$auto$simplemap.cc:442:simplemap_dffe$3756.C (FDRE_ZINI)                                                   2.388    12.388
clock uncertainty                                                                                          0.000    12.388
cell setup time                                                                                           -0.010    12.378
data required time                                                                                                  12.378
--------------------------------------------------------------------------------------------------------------------------
data required time                                                                                                  12.378
data arrival time                                                                                                  -20.033
--------------------------------------------------------------------------------------------------------------------------
slack (VIOLATED)                                                                                                    -7.655

Which is an improvement (from 34ns to 22ns arrival time), with all the previous 1.76ns delays replaced with 0ns (since the direct connect delays are accurately considered).

Since the placer now has a more accurate delay model this also reduces the difference between the post-place (20ns) and initial (congestion oblivious) routed delay (22.9ns):

Confirming router algorithm: TIMING_DRIVEN.
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter   Time    pres  BBs    Heap  Re-Rtd  Re-Rtd Overused RR Nodes      Wirelength      CPD       sTNS       sWNS       hTNS       hWNS Est Succ
      (sec)     fac Updt    push    Nets   Conns                                       (ns)       (ns)       (ns)       (ns)       (ns)     Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 4095: 17 timing startpoints were not constrained during timing analysis
Warning 4096: 17 timing endpoints were not constrained during timing analysis
   1    1.5     0.0    0 1.1e+07     678    2238    1946 ( 0.126%)  134648 ( 1.2%)   22.975     -580.5    -12.975      0.000      0.000      N/A
   2   60.9     0.5  255 2.1e+07     482    1663     416 ( 0.027%)   79216 ( 0.7%)   19.289     -253.7     -9.289      0.000      0.000      N/A
   3    2.3     0.6  113 1.6e+07     279    1037     260 ( 0.017%)   79242 ( 0.7%)   24.144     -279.9    -14.144      0.000      0.000      N/A
   4    2.5     0.8   72 1.5e+07     177     774     159 ( 0.010%)   78276 ( 0.7%)   20.891     -204.1    -10.891      0.000      0.000      N/A
   5    1.5     1.1   46 9431957      96     514      31 ( 0.002%)   77312 ( 0.7%)   20.824     -242.7    -10.824      0.000      0.000      N/A
   6    0.4     1.4   42 2445967      27     184       8 ( 0.001%)   76929 ( 0.7%)   18.906     -196.5     -8.906      0.000      0.000      N/A
   7    0.1     1.9   37  745465       9      59       9 ( 0.001%)   77867 ( 0.7%)   20.815     -258.9    -10.815      0.000      0.000      N/A
   8    0.2     2.4   28  938236      14      65       9 ( 0.001%)   77064 ( 0.7%)   19.330     -217.5     -9.330      0.000      0.000      N/A
   9    0.2     3.1   27  876362      11      62       3 ( 0.000%)   77146 ( 0.7%)   19.137     -189.6     -9.137      0.000      0.000      N/A
  10    0.1     4.1   29  841558      11      62       2 ( 0.000%)   77540 ( 0.7%)   18.326     -205.8     -8.326      0.000      0.000       12
  11    0.1     5.3   26  173852       2      14       0 ( 0.000%)   77353 ( 0.7%)   17.376     -181.1     -7.376      0.000      0.000       12
litghost commented 5 years ago

Both of which show a large 1.76ns delay from COUT to CIN between blocks which seems very high, as usually this is a highly delay optimized part of the architecture.

The critical path delay model appears to accurate, I've compared the post-route timing reports to the truth model. In the case of post-placement timing, you are right that it appears to be adding a net delay where one doesn't exist. This might explain the difference I was seeing in https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/592#issuecomment-494981636

Compare that path with the post-route timing model:

#Path 1
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$5609.Q (FDRE_ZINI clocked by clk)
Endpoint  : $auto$simplemap.cc:442:simplemap_dffe$5605.D (FDRE_ZINI clocked by clk)
Path Type : setup

Point                                                                                                                        Incr      Path
-------------------------------------------------------------------------------------------------------------------------------------------
clock clk (rise edge)                                                                                                       0.000     0.000
clock source latency                                                                                                        0.000     0.000
clk.inpad (.input)                                                                                                          0.000     0.000
$auto$simplemap.cc:442:simplemap_dffe$5609.C (FDRE_ZINI)                                                                    0.378     0.378
$auto$simplemap.cc:442:simplemap_dffe$5609.Q (FDRE_ZINI) [clock-to-output]                                                  0.010     0.388
$auto$alumacc.cc:474:replace_alu$1295.slice[1].carry4_full.S1 (CARRY4_VPR)                                                  4.128     4.516
$auto$alumacc.cc:474:replace_alu$1295.slice[1].carry4_full.O3 (CARRY4_VPR)                                                  0.618     5.134
$abc$6342$auto$alumacc.cc:474:replace_alu$1233.BB[7].in[0] (.names)                                                         1.615     6.749
$abc$6342$auto$alumacc.cc:474:replace_alu$1233.BB[7].out (.names)                                                           0.068     6.817
$auto$alumacc.cc:474:replace_alu$1251.slice[1].carry4_full.S3 (CARRY4_VPR)                                                  2.284     9.102
$auto$alumacc.cc:474:replace_alu$1251.slice[1].carry4_full.CO_CHAIN (CARRY4_VPR)                                            0.380     9.482
$auto$alumacc.cc:474:replace_alu$1251.slice[2].carry4_full.CIN (CARRY4_VPR)                                                 0.000     9.482
$auto$alumacc.cc:474:replace_alu$1251.slice[2].carry4_full.CO_CHAIN (CARRY4_VPR)                                            0.114     9.596
$auto$alumacc.cc:474:replace_alu$1251.slice[3].carry4_full.CIN (CARRY4_VPR)                                                 0.000     9.596
$auto$alumacc.cc:474:replace_alu$1251.slice[3].carry4_full.CO_CHAIN (CARRY4_VPR)                                            0.114     9.710
$auto$alumacc.cc:474:replace_alu$1251.slice[4].carry4_full.CIN (CARRY4_VPR)                                                 0.000     9.710
$auto$alumacc.cc:474:replace_alu$1251.slice[4].carry4_full.CO_CHAIN (CARRY4_VPR)                                            0.114     9.824
$auto$alumacc.cc:474:replace_alu$1251.slice[5].carry4_full.CIN (CARRY4_VPR)                                                 0.000     9.824
$auto$alumacc.cc:474:replace_alu$1251.slice[5].carry4_full.CO_CHAIN (CARRY4_VPR)                                            0.114     9.938
$auto$alumacc.cc:474:replace_alu$1251.slice[6].carry4_full.CIN (CARRY4_VPR)                                                 0.000     9.938
$auto$alumacc.cc:474:replace_alu$1251.slice[6].carry4_full.CO_CHAIN (CARRY4_VPR)                                            0.114    10.052
$auto$alumacc.cc:474:replace_alu$1251.slice[7].carry4_full.CIN (CARRY4_VPR)                                                 0.000    10.052
$auto$alumacc.cc:474:replace_alu$1251.slice[7].carry4_full.CO_CHAIN (CARRY4_VPR)                                            0.114    10.166
$auto$alumacc.cc:474:replace_alu$1251.slice[8].carry4_part.CIN (CARRY4_VPR)                                                 0.000    10.166
$auto$alumacc.cc:474:replace_alu$1251.slice[8].carry4_part.CO_FABRIC0 (CARRY4_VPR)                                          0.293    10.459
$abc$6342$auto$blifparse.cc:492:parse_blif$6625.fpga_mux_0.S (MUXF6)                                                        1.587    12.046
$abc$6342$auto$blifparse.cc:492:parse_blif$6625.fpga_mux_0.O (MUXF6)                                                        0.010    12.056
$abc$6342$techmap$techmap\output_logic.$procmux$581.$ternary$?techmap.v?:445$4523_Y[1].in[3] (.names)                       1.682    13.738
$abc$6342$techmap$techmap\output_logic.$procmux$581.$ternary$?techmap.v?:445$4523_Y[1].out (.names)                         0.068    13.806
$auto$simplemap.cc:442:simplemap_dffe$5605.D (FDRE_ZINI)                                                                    2.163    15.969
data arrival time                                                                                                                    15.969

Maybe the placer is overestimating the carry chain delay?

litghost commented 5 years ago

Since the placer now has a more accurate delay model this also reduces the difference between the post-place (20ns) and initial (congestion oblivious) routed delay (22.9ns):

Awesome. Why does the placer default to an estimate that doesn't account for carry chains, or other direct paths?

kmurray commented 5 years ago

Awesome. Why does the placer default to an estimate that doesn't account for carry chains, or other direct paths?

For most of the architectures we usually run it doesn't make much of a difference, the standard delay model usually captures the direct-connect delay impacts correctly.

Also modelling the direct-connect delays correctly comes with a run-time hit during placement (~+20%) which is why we've left it off by default so far.

litghost commented 5 years ago

Also modelling the direct-connect delays correctly comes with a run-time hit during placement (~+20%) which is why we've left it off by default so far.

Fair enough. In this case placement appears to be a small fraction of the total pack/place/route time, so a modest increase in a small fraction is still small.

litghost commented 5 years ago

the standard delay model usually captures the direct-connect delay impacts correctly.

Is the idea that dy = 1, dx = 0 should be 0 in this case? That would underestimate delays on non-directly connected paths.

kmurray commented 5 years ago

Is the idea that dy = 1, dx = 0 should be 0 in this case? That would underestimate delays on non-directly connected paths.

Yep, which motivated creating delta_override, but I found it had no real benefit on the architectures I evaluated it on.

kmurray commented 5 years ago

Another thing to try is changing the lookahead used by the router. The map lookahead tries to profile the routing architecture in a more general way than the default (whether it captures the details of your architecture correctly I'm not sure).

With --router_lookahead map delay is further reduced:

## Computing router lookahead map
## Computing router lookahead map took 1.34 seconds (max_rss 2831.4 MiB, delta_rss +0.0 MiB)
Confirming router algorithm: TIMING_DRIVEN.
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter   Time    pres  BBs    Heap  Re-Rtd  Re-Rtd Overused RR Nodes      Wirelength      CPD       sTNS       sWNS       hTNS       hWNS Est Succ
      (sec)     fac Updt    push    Nets   Conns                                       (ns)       (ns)       (ns)       (ns)       (ns)     Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 4095: 17 timing startpoints were not constrained during timing analysis
Warning 4096: 17 timing endpoints were not constrained during timing analysis
   1   94.7     0.0    0 4.3e+08     678    2238    1341 ( 0.087%)  104155 ( 1.0%)   18.210     -271.4     -8.210      0.000      0.000      N/A
   2   64.0     0.5  203 2.8e+08     415    1433     417 ( 0.027%)   84404 ( 0.8%)   16.711     -142.0     -6.711      0.000      0.000      N/A
   3   45.7     0.6  172 2.0e+08     279    1066     248 ( 0.016%)   83450 ( 0.8%)   15.644     -115.9     -5.644      0.000      0.000      N/A
   4   30.8     0.8  107 1.4e+08     193     811     172 ( 0.011%)   84394 ( 0.8%)   16.756     -154.1     -6.756      0.000      0.000      N/A
   5   27.6     1.1   85 1.2e+08     124     660      50 ( 0.003%)   84970 ( 0.8%)   15.232     -110.7     -5.232      0.000      0.000      N/A
   6   12.5     1.4   74 5.6e+07      59     308      23 ( 0.001%)   85181 ( 0.8%)   15.856     -119.9     -5.856      0.000      0.000      N/A
   7    3.7     1.9   66 1.7e+07      33     144      15 ( 0.001%)   85313 ( 0.8%)   16.489     -125.1     -6.489      0.000      0.000      N/A
   8    2.2     2.4   48 1.0e+07      14      76       2 ( 0.000%)   85256 ( 0.8%)   16.088     -122.5     -6.088      0.000      0.000      N/A
   9    1.2     3.1   43 5856273       8      39       0 ( 0.000%)   85670 ( 0.8%)   16.297     -128.3     -6.297      0.000      0.000      N/A
Restoring best routing
Critical path: 16.2967 ns
Successfully routed after 9 routing iterations.

And the routed timing report:

#Path 1
Startpoint: $auto$simplemap.cc:442:simplemap_dffe$3852.Q (FDRE_ZINI clocked by clk)
Endpoint  : $auto$simplemap.cc:420:simplemap_dff$2676.D (FDRE_ZINI clocked by clk)
Path Type : setup

Point                                                                                                       Incr      Path
--------------------------------------------------------------------------------------------------------------------------
clock clk (rise edge)                                                                                      0.000     0.000
clock source latency                                                                                       0.000     0.000
clk.inpad (.input)                                                                                         0.000     0.000
$auto$simplemap.cc:442:simplemap_dffe$3852.C (FDRE_ZINI)                                                   0.378     0.378
$auto$simplemap.cc:442:simplemap_dffe$3852.Q (FDRE_ZINI) [clock-to-output]                                 0.010     0.388
$abc$6342$auto$alumacc.cc:474:replace_alu$1292.S[0].in[0] (.names)                                         0.796     1.184
$abc$6342$auto$alumacc.cc:474:replace_alu$1292.S[0].out (.names)                                           0.068     1.252
$auto$alumacc.cc:474:replace_alu$1292.slice[0].carry4_1st_full.S0 (CARRY4_VPR)                             0.958     2.210
$auto$alumacc.cc:474:replace_alu$1292.slice[0].carry4_1st_full.CO_CHAIN (CARRY4_VPR)                       0.508     2.718
$auto$alumacc.cc:474:replace_alu$1292.slice[1].carry4_full.CIN (CARRY4_VPR)                                0.000     2.718
$auto$alumacc.cc:474:replace_alu$1292.slice[1].carry4_full.O2 (CARRY4_VPR)                                 0.256     2.974
$abc$6342$auto$alumacc.cc:474:replace_alu$1260.BB[6].in[0] (.names)                                        1.793     4.766
$abc$6342$auto$alumacc.cc:474:replace_alu$1260.BB[6].out (.names)                                          0.068     4.834
$auto$alumacc.cc:474:replace_alu$1260.slice[1].carry4_full.S2 (CARRY4_VPR)                                 2.914     7.748
$auto$alumacc.cc:474:replace_alu$1260.slice[1].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.376     8.124
$auto$alumacc.cc:474:replace_alu$1260.slice[2].carry4_full.CIN (CARRY4_VPR)                                0.000     8.124
$auto$alumacc.cc:474:replace_alu$1260.slice[2].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114     8.238
$auto$alumacc.cc:474:replace_alu$1260.slice[3].carry4_full.CIN (CARRY4_VPR)                                0.000     8.238
$auto$alumacc.cc:474:replace_alu$1260.slice[3].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114     8.352
$auto$alumacc.cc:474:replace_alu$1260.slice[4].carry4_full.CIN (CARRY4_VPR)                                0.000     8.352
$auto$alumacc.cc:474:replace_alu$1260.slice[4].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114     8.466
$auto$alumacc.cc:474:replace_alu$1260.slice[5].carry4_full.CIN (CARRY4_VPR)                                0.000     8.466
$auto$alumacc.cc:474:replace_alu$1260.slice[5].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114     8.580
$auto$alumacc.cc:474:replace_alu$1260.slice[6].carry4_full.CIN (CARRY4_VPR)                                0.000     8.580
$auto$alumacc.cc:474:replace_alu$1260.slice[6].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114     8.694
$auto$alumacc.cc:474:replace_alu$1260.slice[7].carry4_full.CIN (CARRY4_VPR)                                0.000     8.694
$auto$alumacc.cc:474:replace_alu$1260.slice[7].carry4_full.CO_CHAIN (CARRY4_VPR)                           0.114     8.808
$auto$alumacc.cc:474:replace_alu$1260.slice[8].carry4_part.CIN (CARRY4_VPR)                                0.000     8.808
$auto$alumacc.cc:474:replace_alu$1260.slice[8].carry4_part.CO_FABRIC0 (CARRY4_VPR)                         0.293     9.101
$abc$6342$auto$blifparse.cc:492:parse_blif$6359.fpga_mux_2.S (MUXF8)                                       0.831     9.932
$abc$6342$auto$blifparse.cc:492:parse_blif$6359.fpga_mux_2.O (MUXF8)                                       0.010     9.942
$abc$6342$auto$blifparse.cc:492:parse_blif$6391.T0.in[3] (.names)                                          2.434    12.376
$abc$6342$auto$blifparse.cc:492:parse_blif$6391.T0.out (.names)                                            0.068    12.444
$abc$6342$auto$blifparse.cc:492:parse_blif$6391.fpga_mux_0.I0 (MUXF6)                                      0.000    12.444
$abc$6342$auto$blifparse.cc:492:parse_blif$6391.fpga_mux_0.O (MUXF6)                                       0.010    12.454
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_lut_1.T0.in[4] (.names)                               3.567    16.021
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_lut_1.T0.out (.names)                                 0.068    16.089
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_lut_1.fpga_mux_0.I0 (MUXF6)                           0.000    16.089
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_lut_1.fpga_mux_0.O (MUXF6)                            0.010    16.099
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_mux_0.I1 (MUXF7)                                      0.000    16.099
$abc$6342$auto$blifparse.cc:492:parse_blif$6634.fpga_mux_0.O (MUXF7)                                       0.010    16.109
$auto$simplemap.cc:420:simplemap_dff$2676.D (FDRE_ZINI)                                                    0.556    16.665
data arrival time                                                                                                   16.665

clock clk (rise edge)                                                                                     10.000    10.000
clock source latency                                                                                       0.000    10.000
clk.inpad (.input)                                                                                         0.000    10.000
$auto$simplemap.cc:420:simplemap_dff$2676.C (FDRE_ZINI)                                                    0.378    10.378
clock uncertainty                                                                                          0.000    10.378
cell setup time                                                                                           -0.010    10.368
data required time                                                                                                  10.368
--------------------------------------------------------------------------------------------------------------------------
data required time                                                                                                  10.368
data arrival time                                                                                                  -16.665
--------------------------------------------------------------------------------------------------------------------------
slack (VIOLATED)                                                                                                    -6.297

Now the fact that the router is able to find a substantially faster path than the placer likely implies that the placer delay model is still off somewhat. If that is the case then the placer is likely optimizing the wrong parts of the circuit.

kmurray commented 5 years ago

One idea is to try and compare paths in the placer and router reports to identify where the placer delay model is inaccurate. The option --timing_report_detail aggregated produces timing reports with more detailed delay breakdowns which may be helpful.

litghost commented 5 years ago

Now the fact that the router is able to find a substantially faster path than the placer likely implies that the placer delay model is still off somewhat. If that is the case then the placer is likely optimizing the wrong parts of the circuit.

Ya, I figured that. So you would expect that the placement delay estimate should be less than the actual routed delay? This makes sense to me, I just wasn't confident in the reasoning.

litghost commented 5 years ago

Now the fact that the router is able to find a substantially faster path than the placer likely implies that the placer delay model is still off somewhat. If that is the case then the placer is likely optimizing the wrong parts of the circuit.

One idea I had to avoid this "placer delay estimate" problem was to have the placer directly sample the routing graph. This was very slow, but also didn't lower the estimates. Is it possible that there is something different in the placement route delay function (calculate_delay) that causes the placer to have worse routes than the router itself?

kmurray commented 5 years ago

The VPR router uses a lookahead to estimate the cost to reach a destination through the RR graph (A* search). If the lookahead doesn't do a good job the router will produce poorer quality results.

How aggressively the router uses the lookahead is controlled by --astar_fac:

The VPR default is astar_fac = 1.2, so lowering it can sometimes improve results at the cost of increased run-time.

is something different in the placement route delay function (calculate_delay) that causes the placer to have worse routes than the router itself?

Potentially. The sub-optimal nature of the path finding can interact with congestion and result in the router exploring a different part of the search space and finding a better result than a single connection would otherwise. However usually this indicates the lookahead is doing a poor job.

litghost commented 5 years ago

Thanks for that.

In an interesting turn of events, I found that the version of the rrgraph without timing actually returned a better (!!!) final critical path delay (as measured by the truth model). In order to reduce variables, I copied the .net and .place from the better result and attempted to re-route that design with the timing values present in the rrgraph.

True critical path delay on design routed without correct timing present in rrgraph: -4.206 ns True critical path delay on same clustering and placement, but with timing present in rrgraph: -14.988 ns

I'll explore with an astar_fac = 0 to see if I can close that gap. The fact that on the same clustering and placement there is such variability is very interesting.

vaughnbetz commented 5 years ago

It is very strange that using an rr-graph without timing improved results. I suspect this means there is a problem in the timing annotated on the rr-graph.

Since this is a very small design, you can turn the quality parameters way up to try to improve result quality, which may help you focus on root issues (poor delay estimation in the placer and/or router for example). So I'd set a high inner_num (e.g. 20) and a low astar_fac (e.g. 0.5 or even 0). They will cost cpu time, but at this point you don't care.

litghost commented 5 years ago

It is very strange that using an rr-graph without timing improved results. I suspect this means there is a problem in the timing annotated on the rr-graph.

I agree. I'm working on a way to use VPR's timing analysis (with timing annotations) on the route file generated without timing. I expect that it will reveal an incorrect timing calculation, which is pushing the placer/router in the wrong direction.

litghost commented 5 years ago

I've confirmed that simply zeroing all the timing annotations in the rrgraph enables the router to generate a route with a true critical path setup slack of -2.322 ns. What is interesting is that the VPR analysis using the annotated rrgraph returns a completely reasonable critical path delay of 9.15997 ns. FYI, the fact that this design closes using the VPR timing annotations is not surprising as the delay from Cinternal is not included in VPR yet.

I'm not sure how to explain this. From my simple experiment, the timing annotations are correct, and yet it causes the router to fail to converge to a valid solution.

I'll upload a new tarball with:

At this point I'm very confused.

litghost commented 5 years ago

One thing I immediately notice with the zero'd out routing graph is that routing takes significantly slower. Is it possible that having zero'd out timing would cause the router to work harder to find good paths?

vaughnbetz commented 5 years ago

How close is the VPR timing analysis to Vivado's estimate of this circuit, or a hand estimate? That would provide a coarse check of the timing values.

I've never seen deleting timing info improve the router's critical path performance (at least not consistently; on a small circuit there's always a chance you're getting lucky with a non-timing driven algorithm). So I think something is broken.

I'd try running with astar_fac = 0. Maybe the router timing model is OK, but the timing part of the lookahead is totally wrong. astar_fac = 0 essentially makes the lookahead irrelevant (but slows down the router) so running that experiment helps eliminate or implicate the lookahead in the problem.

litghost commented 5 years ago

How close is the VPR timing analysis to Vivado's estimate of this circuit, or a hand estimate? That would provide a coarse check of the timing values.

The estimate is good, if a little low. The reason the delays are lower in VPR's model is because Cinternal support isn't there yet. As a result fanout RC delays are not included.

I'd try running with astar_fac = 0. Maybe the router timing model is OK, but the timing part of the lookahead is totally wrong. astar_fac = 0 essentially makes the lookahead irrelevant (but slows down the router) so running that experiment helps eliminate or implicate the lookahead in the problem.

I'll try that next. I tried a low (like .1) astar_fac, but it didn't help. astar_fac = 0 seemed really really slow, but maybe that is required until the problem with the lookahead can be diagnosed.

litghost commented 5 years ago

I'd try running with astar_fac = 0. Maybe the router timing model is OK, but the timing part of the lookahead is totally wrong. astar_fac = 0 essentially makes the lookahead irrelevant (but slows down the router) so running that experiment helps eliminate or implicate the lookahead in the problem.

Confirm, --astar_fac 0 results in timing closing, at the cost of very significant increase in runtime.

litghost commented 5 years ago

@kmurray @vaughnbetz How/where is the router look ahead computed? The runtime performance with --astar_fac 0 is pretty bad, so identifying what is happening here is going to be something I'll do tommorow. Any insights into the router look ahead's operation will help.

litghost commented 5 years ago

@kmurray @vaughnbetz How/where is the router look ahead computed?

It looks like both router look ahead methods are based on segment definitions, which we originally elided, as it wasn't clear how segments were being used. I'll need to think about how to define segments that enable reasonable look ahead.

Is there a non-segment based approach that might make sense here?

kmurray commented 5 years ago

Is there a non-segment based approach that might make sense here?

I have some ideas about non-segment based lookaheads, but that is more of a research project than something that can be addressed quickly.

Probably your best approach is to fill in the segment information with reasonable values to get the current lookaheads to work better.

litghost commented 5 years ago

Probably your best approach is to fill in the segment information with reasonable values to get the current lookaheads to work better.

Given that we don't really generate nodes from the segments, what is the important aspects to capture with the segment definitions? For example, are the sb/cb definitions important?

kmurray commented 5 years ago

are the sb/cb definitions important?

I believe those are only relevant for guiding RR graph construction which shouldn't matter if your loading an externally generated RR graph.

what is the important aspects to capture with the segment definitions?

You want to capture the delay relevant characteristics of the different wire types. Usually you'll have a different segment definition for each unique wire length, since they'll have different Rmetal/Cmetal. That is the key information needed for the lookaheads.

Each CHANX/CHANY RR node should then reference the segment type it represents (i.e. segments are a flyweight of common RR node characteristics).

litghost commented 5 years ago

The imported routing fabric is a mix of horizontal and vertical wires, that should be able to be described with segments. The thing I don't know how to describe is the L-shaped wires. For example, SE6BEG3 is a Southeast (SE), length 6 (dx = 3, dy = 3), wire. I don't know how to expose that wire to the lookahead.

In addition, near the edge of the fabric, the regularity of the fabric breaks down due to the boundary. How important is lookahead near the boundry of the fabric (which may be in the interior when it reachs hard blocks).

vaughnbetz commented 5 years ago

Modeling the boundary well isn't that important. The lookahead is a "best guess" of what is going to happen in the future in terms of what wires will be used, and the resulting delay and wiring cost. So if it models most cases well it works pretty well, and having some things it doesn't model that well isn't catastrophic, although it can hurt QoR some if they're relatively common mispredictions.

You can also compensate for a less than ideal lookahead by lowering astar_fac, but of course that comes at a cost of increased cpu time.

So I'd focus on getting the most common wires and routing paths correct or nearly correct in the lookahead.

For length 6 wires (SE6BEG3), are those only for direct connect (i.e. you drive this wire only from a logic block or other block, never from another routing wire)? Or can other routing wires also drive them? If they can only be reached from a block output modeling them well in the lookahead won't matter much or at all, as they'll be explored at the start of a net's routing no matter what, and won't be explored again if the routing has to go a long distance.

If they can be reached by other routing wires, then modeling them in the lookahead is more important. The best way would probably start with modeling them as two length 3 wires with a non-configurable (short) edge between them.

You should also use the more sophisticated, "map", lookahead. @kmurray, I don't see it listed in the command line option documentation. Has it been brought out to the command line?

litghost commented 5 years ago

If they can be reached by other routing wires, then modeling them in the lookahead is more important. The best way would probably start with modeling them as two length 3 wires with a non-configurable (short) edge between them.

The L-shaped wires are modeled correctly as nodes and edges, exactly as you said. However the lookahead doesn't examine the real graph. It uses the segment definitions.

The map lookahead is exposed as an option, but it seems to uses segments over the real graph too.

litghost commented 5 years ago

For length 6 wires (SE6BEG3), are those only for direct connect (i.e. you drive this wire only from a logic block or other block, never from another routing wire)? Or can other routing wires also drive them?

These L-shaped wires are general routing fabric wires, so having them in the look ahead is important.

vaughnbetz commented 5 years ago

Ok, agreed. I think the map lookahead will do pretty well with them, as it profiles / samples paths through the routing graph to build its tables / predictions. It uses the segment definitions only to decide what table to look up the prediction in -- the prediction is a function of how far you have to go (dx,dy) and what type of wire you are currently on (segment type/id). So it doesn't encode any significant assumptions about how the segments interact; it just needs to know the grouping of which wires are which kind of segment so it can look up in the proper table.

litghost commented 5 years ago

So it doesn't encode any significant assumptions about how the segments interact; it just needs to know the grouping of which wires are which kind of segment so it can look up in the proper table.

This sounds promising. So effectively I need to create a segment for each variant of wire. So (dx = 12, dy = 0) is one segment, and (dx = 3, dy = 3) is one segment, (dx = 3, dy = -3) is one segment, etc. That should be pretty straight forward. I'll read up on the paper linked in the comment and see if I can grok what is going on.

vaughnbetz commented 5 years ago

You can read about the map lookahead here https://ieeexplore.ieee.org/document/7577326 (see enhanced router lookahead section) or here https://tspace.library.utoronto.ca/bitstream/1807/75854/3/Petelin_Oleg_201611_MAS_thesis.pdf (page 33-35).

vaughnbetz commented 5 years ago

Yes, the commented out lines should do essentially the same thing as the .T_linear etc. lines. Kevin found there was a small difference, which cost CPU time as a lookahead underpredict slows the router down. There is some CAD noise / runtime vs. quality trade-off in exactly how you set a lookahead though, so frankly if we remeasured and/or slightly retuned astar_fac there's a good chance we'd be happy with the commented out code's performance. So if you make that work there's no fundamental reason that can't become the new overall code path after some testing and tuning.

Vaughn

On Fri, May 24, 2019 at 1:00 PM litghost notifications@github.com wrote:

So it does look like the map method is still using the segment data rather than the node data:

https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vpr/src/route/router_lookahead_map.cpp#L161

But that is marked with a TODO. The currently commented logic doesn't correctly handle shorted nodes, but that is a straight forward fix.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/592?email_source=notifications&email_token=ABNDPJ3GZ432QYGYL4LOJB3PXANK7A5CNFSM4HOYD5B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWF7C2Y#issuecomment-495710571, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNDPJ5NUXGXX5SMV7JNXPDPXANK7ANCNFSM4HOYD5BQ .

vaughnbetz commented 5 years ago

Should also note that the placer uses algorithmic profiling of the routing fabric to create its estimates of delay vs. distance travelled. If the rr-graph and delays are OK those look ups should be OK, but when the router lookahead is in shape it is also worth checking the placer delay model in and around https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/0aab7670747ec21e0fcecbddd42c3588d82089c4/vpr/src/place/timing_place_lookup.cpp#L268

litghost commented 5 years ago

@vaughnbetz @kmurray Is there a way to output the A true delay vs heuristic delay? From [my limited understanding of A](https://en.wikipedia.org/wiki/A*_search_algorithm#Properties), if the heuristic delay was under estimating the delay, and astar_fac < 1, then it should've returned the optimal path.

Maybe there should be diagnostic mode that checks that the A* heurstic is admissible?

kmurray commented 5 years ago

Is there a way to output the A* true delay vs heuristic delay?

Nothing built in, but you could code it up yourself.

To get the optimal delay you'd run the router with astar_fac = 0.

The lookaheads are wrapped in a class, if you specify the criticality to be 1.0 (params.criticality = 1.) the returned cost should be the lookahead's delay estimate.

Maybe there should be diagnostic mode that checks that the A* heurstic is admissible?

The lookaheads/heuristics VPR uses are actually not strictly admissible (c.f. earlier discussions). This isn't a big issue since there are various reasons (e.g. congestion, multi-fanout nets) which would prevent us from getting the true optimal path for a particular connection anyway.

(Of course having a good lookahead which guides the router in the correct direction is important, it just doesn't have to be strictly admissible.)