Wire lookahead runtime scales poorly with number of switch/segment types

petergrossmann21 commented 3 days ago

During testing of the delay modeling approach I am developing (related to https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/2664, although I am not using that feature just yet to help stage the debugging process), my testing flow has matured to the point where I can comment more definitively on the impact of adding a large number of switch and segment types on router lookahead runtime. This was flagged as a possible concern by @rs-dhow in previous discussions.

It appears that when the router lookahead algorithm has a large number of switch/segment types to manage (I am generating segments and switches in matched pairs, so the number of segment types and switch types in my architecture models are equal), the lookahead time does indeed increase. I'm putting together some slides and additional experiments to describe the results in more detail.

Expected Behaviour

For a small architecture (1500 LUTs), I would expect little to no increase in router lookahead runtime if I increase the segment count by a significant amount.

Current Behaviour

For a small architecture (~1500 LUTs), I observe that when the switch list and segment lists counts increase from 2 to ~200, the router lookahead runtime increases from ~6 seconds to ~50 seconds and dominates total router runtime. I expect to be able to show further degradation with larger architectures; experiments to quantify such a trend are in flight.

Possible Solution

I have not looked into possible solutions yet, but this is central enough to my IC development needs that I could invest time into developing a PR to resolve the issue. A collaboration with more seasoned VPR developers is likely the most effective approach, even if I do most of the heavy lifting.

Steps to Reproduce

Please contact me to make arrangments for test case sharing, as I have not yet open-sourced all components used in generating the results described above.

Context

The goal of the delay model I am attempting to generate is to harvest ASIC EDA software output for use in delay annotation of the components of an eFPGA that will be productized. The delay information is unique on a per tile-type, per routing channel basis, so the number of unique delay parameters is O(1K) to O(10K). For accurate delay modeling, VPR's algorithms thus need to scale well as the number of segment types and switch types become large.

Your Environment

VTR revision used: de31f094aa4f894a5e6e0dc32c66365f4b341190
Operating System and version: Ubuntu 20.04
Compiler version: GCC 10.5.0

soheilshahrouz commented 1 day ago

The size of map router lookahead grows lineaerly with the number of segment types.

https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/66f35d9f26255de8e01d68f0395b84ed006ca28f/vpr/src/route/router_lookahead_map.h#L43-L53

Router lookahead computation involves iterating over all segment types, sampling nodes for each type, and building cost maps. This likely explains why computation time increases as the number of segment types grows.

https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/66f35d9f26255de8e01d68f0395b84ed006ca28f/vpr/src/route/router_lookahead_map.cpp#L522-L554

Two potential solutions are:

Compute cost maps for different segment types in parallel.
Use a smaller set of representative segment types to generate the lookahead.

With the first solution, the memory usage of the router lookahead still increases linearly with the number of segment types. However, if each segment type is confined to a limited area of the device, this approach reduces reliance on the translation-invariance assumption in the router lookahead.

vaughnbetz commented 12 hours ago

If the key reason for having many different segment types (hundreds to thousands) is to have different delays on them, the most direct solution would be to allow the RRgraph to store a different delay (or delay offset) per wire (not wire type). We would hide the details behind a rr_node.delay() or some such method. This would allow you to use a small number of truly distinct segment types, each of which gives the average delay of a wire of that type (this makes things backward-compatible: if you don't need per wire delays, just don't specify them as the average == the delay of node(i)).

Then we'd need to make some method(s) to load those delay offsets per rr_node. They could be stored in a vector from [0..rr_node.size()-1] so only those who want them pay the memory for them. One method could load every delay for every rr_node. Another might load them from a smaller file that gives a list of segment type / track number and delay offsets.

Do you need R and C to be different for each of these wire types, or just the delay? If we need more per wire data, we might want to have rr_node.phys_info_id instead of delay_offset, but then use a similar approach to load all the per wire unique data, with the segment level data again being defined to be the average.

verilog-to-routing / vtr-verilog-to-routing