Handling IPIN delays at rrgraph / pb_type interface

litghost commented 5 years ago

Context

This question is a mix of modelling and route behavior. I'll start with a description of the 7-series rrgraph import behavior around "site pin delays", and then describe the observed effect this has on the router. Then a short discussion around potential alternatives.

Modelling of delay between the rrgraph and pb_type

The 7-series rrgraph part description has 3 principle parts:

Nodes
Pips
Site pins

Pips connect two nodes together. This is modelled in VPR as an edge between two CHANX / CHANX rr nodes.

Site pins represent the boundary between the routing graph (e.g. nodes and pips) and a site (BELs, routing BELs, site wires). In VPR terms, a "site" is roughly a pb_type. Site pins contain timing information, in particular:

An intrinsic delay, similar to a VPR switch
Output drive resistance (for site pins that connect from the site to the routing graph)
Input capacitance (for site pins that connect from the routing graph to the site)

How the site pin is currently modeled

Because the site pins lives on the boundary between the rrgraph and the pb_type, there are some choices where it is modelling. The output drive resistance and input capacitance pretty much have to live in the rrgraph to be models, so currently the 7-series graph import models the site pin as an additional CHANX/Y and edge between the IPIN or OPIN node. Currently the 7-series routing import places the intrinsic delay on this edge.

Digression about BEL delays

The choice really comes about where to model the intrinsic delay of the site pin. There is an additional point of information that is very relevant here, which about BEL timing of BELs that connect directly to an IPIN. The combination timing information on those BEL can be expressed without loss of generality as:

output of BEL = delay from BEL input to output + site pin delay

therefore there is a free variable in how the delays are modeled. I'm explicitly mentioning this because of how the LUT6 delays are modeled. The BEL delays are the same for each pin of the LUT6, which I've pasted below:

    (CELL
        (CELLTYPE "LUT6")
        (INSTANCE SLICEL/A6LUT)
        (DELAY
            (ABSOLUTE
                (IOPATH A1 O6 (0.045::0.056)(0.1::0.124))
                (IOPATH A2 O6 (0.045::0.056)(0.1::0.124))
                (IOPATH A3 O6 (0.045::0.056)(0.1::0.124))
                (IOPATH A4 O6 (0.045::0.056)(0.1::0.124))
                (IOPATH A5 O6 (0.045::0.056)(0.1::0.124))
                (IOPATH A6 O6 (0.045::0.056)(0.1::0.124))
            )
        )
    )

What you can see is that the delay from BEL input to output shows 0 delay differences between the 6 inputs, which is surprising. The LUT structure should likely have some variation between the inputs, and they do. However this delay variation is modelling not in the BEL delay but in the site pin delay. Example is below. The delays for each input of the LUT are A1 - A6 respectively.

"A1": {
    "cap": "0.000",
    "delay": [
        "0.172",
        "0.214",
        "0.416",
        "0.516"
    ],
    "wire": "CLBLL_L_A1"
},
"A2": {
    "cap": "0.000",
    "delay": [
        "0.170",
        "0.212",
        "0.409",
        "0.507"
    ],
    "wire": "CLBLL_L_A2"
},
"A3": {
    "cap": "0.000",
    "delay": [
        "0.107",
        "0.133",
        "0.278",
        "0.344"
    ],
    "wire": "CLBLL_L_A3"
},
"A4": {
    "cap": "0.000",
    "delay": [
        "0.086",
        "0.107",
        "0.229",
        "0.284"
    ],
    "wire": "CLBLL_L_A4"
},
"A5": {
    "cap": "0.000",
    "delay": [
        "0.033",
        "0.042",
        "0.091",
        "0.112"
    ],
    "wire": "CLBLL_L_A5"
},
"A6": {
    "cap": "0.000",
    "delay": [
        "0.002",
        "0.002",
        "0.004",
        "0.005"
    ],
    "wire": "CLBLL_L_A6"
},

As a speculation, this was done to allow the Vivado router the choice of LUT inputs at route time to perform LUT rotation, rather than doing LUT rotation during pack/place.

Implication of where the site pin intrinsic delay is modelled

There two implications to modelling the intrinsic delay in the routing graph:

The packer is not aware of the delay variation between the LUT pins, so the LUT rotation done in the packer is likely poorer than it could be.
The router encounters a cost "jump" just before reaching the IPIN/SINK node

The first point is not as important at this minute, so lets focus on the implication of the second point. Let me layout an example.

Router behavior example

The router wants to route from CLBLL_R_X17Y138/CLBLL_L_CMUX to CLBLM_R_X29Y18/CLBLM_M_A2, this is an on graph distance of about (-28, -125). The final three nodes in the route will always be:

rt_node: CLBLM_R_X29Y18/CLBLM_IMUX2 (3061138) (CHANY)
  rt_node: CLBLM_R_X29Y18/CLBLM_M_A2 (3061293) (CHANY)
    rt_node: CLBLM_R_X29Y18/CLBLM_M_A2 (3061416) (IPIN)
      rt_node: 3061364 (SINK)

Node 3061293 and 3061293 -> 3061416 edge is the node/edge that models the site pin delay. When routing with an A* factor = 1.0, the router pops node 3061293 until after 4017 pops. The router doesn't pop node 3061416 until after 1210952 pops. This is because the backward cost of moving from node 3061293 to 3061416 increases from 9.7148e-10 to 1.48348e-09 (e.g. increase of 5.12e-10). This increase in delay can be completely attributed to the site pin intrinsic delay:

"A2": {
    "cap": "0.000",
    "delay": [
        "0.174",
        "0.216",
        "0.413",
        "0.512"
    ],
    "wire": "CLBLM_M_A2"

The lookahead doesn't account for this level of sharp delay at the end, because it varies dramatically from pin to pin, consider from the same tile:

"A6": {
    "cap": "0.000",
    "delay": [
        "0.007",
        "0.008",
        "0.013",
        "0.016"
    ],
    "wire": "CLBLM_M_A6"

litghost commented 5 years ago

After writing out https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/1027#issue-515780120, I realized a simple solution to fix the site pin "hump" behavior. There is still an open question of whether the modelling strategy should be changed to accommodate LUT rotation.

litghost commented 5 years ago

@vaughnbetz @kmurray @HackerFoo

HackerFoo commented 5 years ago

@litghost What is the simple solution?

litghost commented 5 years ago

@litghost What is the simple solution?

There is already a connection box annotation per IPIN node, so we can add the site pin delay to the that annotation, and combine the generic delay matrix with specific site pin delay.

During computation of the delay matrix, we can also subtract out the specific site pin delay, so that we have compatible data.

kmurray commented 5 years ago

Node 3061293 and 3061293 -> 3061416 edge is the node/edge that models the site pin delay. When routing with an A* factor = 1.0, the router pops node 3061293 until after 4017 pops. The router doesn't pop node 3061416 until after 1210952 pops. This is because the backward cost of moving from node 3061293 to 3061416 increases from 9.7148e-10 to 1.48348e-09 (e.g. increase of 5.12e-10)

I'd classify this as a lookahead issue. If you aren't capturing these characteristics then the router will continue to explore since it doesn't realize there is no better path.

There is already a connection box annotation per IPIN node, so we can add the site pin delay to the that annotation, and combine the generic delay matrix with specific site pin delay.

During computation of the delay matrix, we can also subtract out the specific site pin delay, so that we have compatible data.

Yes, fixing the lookahead to capture the effect of these different delays seems like the right approach.

kmurray commented 5 years ago

I'll add that at the moment the packer router is not timing driven, so it only performs LUT rotations for routability purposes.

On the long-term roadmap the plan is to unify the inter and intra block RR graphs which would allow a single stage routing which would do LUT rotations correctly for delay.

vaughnbetz commented 5 years ago

Modeling the average input pin/site delay in the lookahead will help avoid back-tracking. You could at least model the smallest delay, which will lead to some backtracking, but less backtracking than if you model 0 ps in the lookahead for these elements. If the router really has no ability to choose a better input and get a faster result, then modeling the average delay in the lookahead and changing the rr-graph to match that also seems possible. I suggest we discuss this when I'm at Google tomorrow, since it will be easier to have an interactive discussion about the details.

verilog-to-routing / vtr-verilog-to-routing