verilog-to-routing / vtr-verilog-to-routing

Verilog to Routing -- Open Source CAD Flow for FPGA Research
https://verilogtorouting.org
Other
1k stars 388 forks source link

Update BinaryHeap to FourAryHeap #2627

Closed nedsels closed 3 months ago

nedsels commented 3 months ago

A new FourAryHeap has been added, and this has been made the default VPR heap implementation.

Description

A new abstract class KAryHeap has been created. This class is like the previous BinaryHeap, except it doesn't implement is_valid(), get_heap_head(), or sift_down(). Also, a new struct heap_elem has been added, which contains a t_heap pointer and a float cost for a certain node. This is now what the heap_ vector contains, instead of just t_heap pointers. In KAryHeap's methods, instead of dereferencing a t_heap pointer to get its cost for node comparisons, it is stored in heap_elem.

BinaryHeap and FourAryHeap now inherit from KAryHeap, and implement its pure virtual methods (all other methods are not specific to either of these two heap types). FOUR_ARY_HEAP has been added to the e_heap_type enum, and has been made the default heap implementation.

Motivation and Context

  1. Previously, BinaryHeap's heap_ vector contains t_heap pointers for each node in the heap. To get a node's cost, the respective pointer would be dereferenced, which increase runtime since this would occur whenever there was a comparison between two nodes. Now, instead of heap_ being made of these pointers, it is of a struct heap_elem which contains both the pointer and the node's cost. This cost is attained by dereferencing the pointer once in add_to_heap(), eliminating the need for dereferencing in future comparisons.
  2. This heap_elem is 12 bytes (pointer and float), which means 5 can fit on a 64-byte cache line; so, using a 4-ary heap instead of a binary heap is more cache-friendly, and improves runtime even with more comparisons (between 4 nodes vs 2).

For the future: if, instead of using t_heap pointers, a 4-byte ID can be used, heap_elem can be reduced to 8 bytes. Then, an 8-ary heap might be even better, even with many more comparisons. An 8-ary heap is fastest when using t_heap pointers in heap_ (8 can fit on a cache line) and this heap implementation as VPR's heap.

How Has This Been Tested?

Three changes were considered:

  1. Using heap_elem instead of t_heap pointers in the heap_ vector to eliminate unnecessary dereferencing
  2. Using a 4-ary heap instead of a binary heap
  3. Using memory alignment (8 bytes for when change (1.) is off, 16 bytes for when (1.) is on)

The 8 combinations of these changes were tested on the koios_medium benchmarks, and some combinations on titan benchmarks, on desktop. For smaller circuits, the results are inconsistent, but on titan, the both changes (1.) and (2.) result in runtime improvements. Change (3.) seems to have a negligible effect on runtime. Using changes (1.) and (2.) together results in a 4.2% reduction in runtime.

On titan benchmarks (all values are geomeans, averaged across 8 runs): Total Routing Runtime (s) Runtime Improvement CPD (ns) Wirelength Memory Footprint (max_rss)
Old VPR Heap 186.8157 0.00% 17.1627 3200614 5501.6439
New VPR Heap 178.8911 4.24% 17.1809 3201187 5503.7495

Full Results

Types of changes

Checklist:

nedsels commented 3 months ago

@vaughnbetz I'm wondering about two things:

  1. Is it worth adding regression tests that test both BinaryHeap and FourAryHeap?
  2. There are separate .h and .cpp files for BinaryHeap and FourAryHeap, but they are fairly small. Should I put these new classes in k_ary_heap.h and .cpp instead?
vaughnbetz commented 3 months ago

You'll have to check out the CI results. I look at vtr_nightly_test1_odin, and it is OK. A few small circuits have QoR changes that are beyond the tolerances, but for these small circuits they are not concerning. One has too good a minimum channel width, while the others have ones that are a bit worse but not concerning for small designs. You update the golden results and check them in (see the developer guide) to make those pass. Below are the tests from vtr_nightly_test1_odin as an example.

Amin was also going to widen the error tolerance for some of these tests; I'll check on that PR as rebasing after it is merged may save you some work.

You should also attach your QoR spreadsheets so we can see the results (which you have summarized) on big circuits. We'd want to see a speedup and no overall critical path increase or wirelength increase (or sub-1% so it is within noise).

I'm fine with the current code and the small binary heap file -- more smaller files is fine.

2024-06-21T20:47:50.7138774Z 20:47:50 | Calculating QoR results... 2024-06-21T20:47:50.7139378Z 20:47:50 | 2024-06-21T20:47:50.7140262Z 20:47:50 | regression_tests/vtr_reg_nightly_test1_odin/vpr_reg_mcnc_equiv...[Fail] 2024-06-21T20:47:50.7141338Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7142816Z 20:47:50 | k6_N10_40nm.xml/tseng.pre-vpr.blif/common min_chan_width relative value 0.17582417582417584 outside of range [0.25,1.3] and not equal to golden value: 182.0 2024-06-21T20:47:50.7144133Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7145671Z 20:47:50 | k6_N10_40nm.xml/tseng.pre-vpr.blif/common min_chan_width_routing_area_total relative value 0.2212341391243292 outside of range [0.7,1.3] and not equal to golden value: 1595120.0 2024-06-21T20:47:50.7147087Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7148618Z 20:47:50 | k6_N10_40nm.xml/tseng.pre-vpr.blif/common min_chan_width_routing_area_per_tile relative value 0.22123501890118832 outside of range [0.7,1.3] and not equal to golden value: 9438.56 2024-06-21T20:47:50.7150057Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7151603Z 20:47:50 | k6_N10_40nm.xml/tseng.pre-vpr.blif/common crit_path_routing_area_total relative value 0.2147835400787127 outside of range [0.7,1.3] and not equal to golden value: 2007300.0 2024-06-21T20:47:50.7152993Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7154521Z 20:47:50 | k6_N10_40nm.xml/tseng.pre-vpr.blif/common crit_path_routing_area_per_tile relative value 0.21478341401810147 outside of range [0.7,1.3] and not equal to golden value: 11877.5 2024-06-21T20:47:50.7156147Z 20:47:50 | regression_tests/vtr_reg_nightly_test1_odin/vpr_reg_mcnc...[Pass] 2024-06-21T20:47:50.7157289Z 20:47:50 | regression_tests/vtr_reg_nightly_test1_odin/vtr_reg_fpu_soft_logic_arch...[Pass] 2024-06-21T20:47:50.7158497Z 20:47:50 | regression_tests/vtr_reg_nightly_test1_odin/vtr_reg_fpu_hard_block_arch...[Pass] 2024-06-21T20:47:50.7159596Z 20:47:50 | vtr_reg_nightly_test1_odin/arithmetic_tasks/figure_8...[Pass] 2024-06-21T20:47:50.7160606Z 20:47:50 | vtr_reg_nightly_test1_odin/arithmetic_tasks/FIR_filters...[Pass] 2024-06-21T20:47:50.7161696Z 20:47:50 | vtr_reg_nightly_test1_odin/arithmetic_tasks/FIR_filters_frac...[Pass] 2024-06-21T20:47:50.7162512Z 20:47:50 | 2024-06-21T20:47:50.7163420Z 20:47:50 | vtr_reg_nightly_test1_odin/arithmetic_tasks/multless_consts...[Fail] 2024-06-21T20:47:50.7164302Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7165605Z 20:47:50 | fixed_k6_N8_gate_boost_0.2V_22nm.xml/mult_022.v/common min_chan_width relative value 1.4666666666666666 outside of range [0.25,1.3] and not equal to golden value: 30.0 2024-06-21T20:47:50.7166939Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7168337Z 20:47:50 | fixed_k6_N8_gate_boost_0.2V_22nm.xml/mult_022.v/common min_chan_width_routing_area_total relative value 1.4112435202627822 outside of range [0.7,1.3] and not equal to golden value: 526063.0 2024-06-21T20:47:50.7169764Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7171207Z 20:47:50 | fixed_k6_N8_gate_boost_0.2V_22nm.xml/mult_022.v/common min_chan_width_routing_area_per_tile relative value 1.4112421647100186 outside of range [0.7,1.3] and not equal to golden value: 1820.29 2024-06-21T20:47:50.7172665Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7174032Z 20:47:50 | fixed_k6_N8_gate_boost_0.2V_22nm.xml/mult_022.v/common crit_path_routing_area_total relative value 1.4061912035217121 outside of range [0.7,1.3] and not equal to golden value: 666494.0 2024-06-21T20:47:50.7175545Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7176930Z 20:47:50 | fixed_k6_N8_gate_boost_0.2V_22nm.xml/mult_022.v/common crit_path_routing_area_per_tile relative value 1.406190242866001 outside of range [0.7,1.3] and not equal to golden value: 2306.21 2024-06-21T20:47:50.7178342Z 20:47:50 | 2024-06-21T20:47:50.7179070Z 20:47:50 | vtr_reg_nightly_test1_odin/arithmetic_tasks/multless_consts...[Fail] 2024-06-21T20:47:50.7179892Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7181245Z 20:47:50 | fixed_k6_N8_gate_boost_0.2V_22nm.xml/mult_045.v/common min_chan_width relative value 1.3333333333333333 outside of range [0.25,1.3] and not equal to golden value: 30.0 2024-06-21T20:47:50.7182652Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7184024Z 20:47:50 | fixed_k6_N8_gate_boost_0.2V_22nm.xml/mult_045.v/common crit_path_routing_area_total relative value 1.3088865016039155 outside of range [0.7,1.3] and not equal to golden value: 666494.0 2024-06-21T20:47:50.7185435Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7186818Z 20:47:50 | fixed_k6_N8_gate_boost_0.2V_22nm.xml/mult_045.v/common crit_path_routing_area_per_tile relative value 1.3088834061078565 outside of range [0.7,1.3] and not equal to golden value: 2306.21 2024-06-21T20:47:50.7188231Z 20:47:50 | 2024-06-21T20:47:50.7188965Z 20:47:50 | vtr_reg_nightly_test1_odin/arithmetic_tasks/multless_consts...[Fail] 2024-06-21T20:47:50.7189788Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7191204Z 20:47:50 | fixed_k6_frac_2ripple_N8_22nm.xml/mult_007.v/common min_chan_width_routing_area_total relative value 1.4869164661785093 outside of range [0.7,1.3] and not equal to golden value: 706193.0 2024-06-21T20:47:50.7192700Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7194097Z 20:47:50 | fixed_k6_frac_2ripple_N8_22nm.xml/mult_007.v/common min_chan_width_routing_area_per_tile relative value 1.4869085522061893 outside of range [0.7,1.3] and not equal to golden value: 2443.58 2024-06-21T20:47:50.7195532Z 20:47:50 | 2024-06-21T20:47:50.7196265Z 20:47:50 | vtr_reg_nightly_test1_odin/arithmetic_tasks/multless_consts...[Fail] 2024-06-21T20:47:50.7197088Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7198333Z 20:47:50 | fixed_k6_frac_N8_22nm.xml/mult_012.v/common min_chan_width relative value 1.3333333333333333 outside of range [0.25,1.3] and not equal to golden value: 36.0 2024-06-21T20:47:50.7199606Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7200944Z 20:47:50 | fixed_k6_frac_N8_22nm.xml/mult_012.v/common min_chan_width_routing_area_total relative value 1.3335469993281848 outside of range [0.7,1.3] and not equal to golden value: 648988.0 2024-06-21T20:47:50.7202360Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7203829Z 20:47:50 | fixed_k6_frac_N8_22nm.xml/mult_012.v/common min_chan_width_routing_area_per_tile relative value 1.3335500505426092 outside of range [0.7,1.3] and not equal to golden value: 2245.63 2024-06-21T20:47:50.7205459Z 20:47:50 | vtr_reg_nightly_test1_odin/arithmetic_tasks/open_cores...[Pass] 2024-06-21T20:47:50.7206495Z 20:47:50 | vtr_reg_nightly_test1_odin/arithmetic_tasks/open_cores_frac...[Pass] 2024-06-21T20:47:50.7207621Z 20:47:50 | regression_tests/vtr_reg_nightly_test1_odin/power_extended_arch_list...[Pass] 2024-06-21T20:47:50.7208488Z 20:47:50 | 2024-06-21T20:47:50.7209294Z 20:47:50 | regression_tests/vtr_reg_nightly_test1_odin/power_extended_circuit_list...[Fail] 2024-06-21T20:47:50.7210181Z 20:47:50 | [Fail] 2024-06-21T20:47:50.7211558Z 20:47:50 | k6_N10_I40_Fi6_L4_frac1_ff1_45nm.xml/mkDelayWorker32B.v/common total_power relative value 0.8973242065961418 outside of range [0.9,1.1] and not equal to golden value: 0.1607 2024-06-21T20:47:50.7212942Z 20:47:50 | 2024-06-21T20:47:50.7213746Z 20:47:50 | Test 'vtr_reg_nightly_test1_odin' had 19 qor test failures 2024-06-21T20:47:50.7214572Z 20:47:50 | 2024-06-21T20:47:50.7215303Z 20:47:50 | Test 'vtr_reg_nightly_test1_odin' had 0 run failures

nedsels commented 3 months ago

Looks good. I suggest some more comments in various places, as listed in the detailed feedback.

These issues should now be fixed

nedsels commented 3 months ago

I added the heap section to the API documentation. However, I think this instruction may be outdated or incorrect; I could not run it successfully (doc/conf.py imports markdown_code_symlinks; pip install markdown_code_symlinks gives an error "ERROR: No matching distribution found for markdown_code_symlinks").

nedsels commented 3 months ago

@vaughnbetz All tests now passing