verilog-to-routing / vtr-verilog-to-routing

Verilog to Routing -- Open Source CAD Flow for FPGA Research
https://verilogtorouting.org
Other
979 stars 378 forks source link

Placer reporting cost errors during consistency checks #1402

Open litghost opened 3 years ago

litghost commented 3 years ago

Expected Behaviour

Placer places circuit and writes placement file.

Current Behaviour

Placer aborts with the following error:

BB estimate of min-dist (placement) wire length: 310
Error 1: timing_cost_check: 1.75857e-08 and timing_cost: 1.17164e-09 differ in check_place.
Incr Slack updates 1 in 7.5417e-05 sec
Full Max Req/Worst Slack updates 1 in 1.8772e-05 sec
Incr Max Req/Worst Slack updates 0 in 0 sec
Incr Criticality updates 0 in 0 sec
Full Criticality updates 1 in 1.5237e-05 sec
# Placement took 0.23 seconds (max_rss 673.7 MiB, delta_rss +0.0 MiB)
Error 2: 
Type: Placement
File: vtr-verilog-to-routing/vpr/src/place/place.cpp
Line: 2689
Message: 
Completed placement consistency check, 1 errors found.
Aborting program.

Possible Solution

Working on reduced test case. Circuit in this case was very simple, multiple inpad -> outpad connections.

Steps to Reproduce

1.Get input test data:

Context

Your Environment

litghost commented 3 years ago

I'm working on identifying when this abort started happening. @acomodi believes it was around when the incremental timing analysis was added.

litghost commented 3 years ago

I tested a circuit that was a little more complicated (a counter), rather than simple straight inport -> outport, same issue. I'll see how much I can reduce the circuit and still produce the error.

HackerFoo commented 3 years ago

I also ran into this issue when working on https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/1205. Disabling incremental timing avoided the error.

litghost commented 3 years ago

I also ran into this issue when working on #1205. Disabling incremental timing avoided the error.

Which graph were you running on? So far I've noticed the placer error showing up on 7-series ROI based graphs, but I would've assumed you would've been testing on either full 7-series graphs or a titan arch.

litghost commented 3 years ago

@acomodi mentioned this, but if the ASSERT level is increased from 2 to 3 the error changes to:

vtr-verilog-to-routing/vpr/src/place/place.cpp:1963 comp_td_connection_cost: Assertion 'ipin > 0' failed (Shouldn't be calculating connection timing cost for driver pins).
acomodi commented 3 years ago

An additional note, the function that provides the driver pin, which causes this assertion error is the following:

https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/6af4a382ff4187bd4ac7659bdb427d9eff6bcec0/vpr/src/place/place.cpp#L1862-L1923

HackerFoo commented 3 years ago

@litghost I was using the full 50t graph. I assumed that the problem was triggered by my new placement code.

litghost commented 3 years ago

@litghost I was using the full 50t graph. I assumed that the problem was triggered by my new placement code.

No I don't believe your new placement code along trigger this issue. However in my limited testing, the full 50T graph did not produce this error, so it possible that ROI or full 50T + new placement triggers the issues. There is some kind of latent issue being uncovered here.

litghost commented 3 years ago

I did a quick test with https://github.com/HackerFoo/vtr-verilog-to-routing/commit/7e35d152ce2d1f05aa70bbbc017f9b8f44668138 and the counter test on the xc7 50T ROI arch was able to complete if assert level was 2. On assert level 3, I got:

vtr-verilog-to-routing/vpr/src/place/place.cpp:1966 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == comp_td_connection_delay(delay_model, net, ipin)' 
failed (Connection delays should already be updated).  

WIll continue investigating this week.

litghost commented 3 years ago

@kmurray Why is the assertion at https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/6af4a382ff4187bd4ac7659bdb427d9eff6bcec0/vpr/src/place/place.cpp#L1963 valid? In the parent context DRIVER pin types are ignored, see here: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blame/master/vpr/src/place/place.cpp#L1889

I'm not clear why ipin == 0 is an assertion violation?

litghost commented 3 years ago

I did a quick test with HackerFoo@7e35d15 and the counter test on the xc7 50T ROI arch was able to complete if assert level was 2. On assert level 3, I got:

vtr-verilog-to-routing/vpr/src/place/place.cpp:1966 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == comp_td_connection_delay(delay_model, net, ipin)' 
failed (Connection delays should already be updated).  

WIll continue investigating this week.

The assertion from here should probably be behind the incremental #ifdef.

vaughnbetz commented 3 years ago

The assumption that pin #0 on a net is the driver is deeply embedded in the netlist data structures. So the assert looks valid. I'd run the sanitizers to check memory faults, since this is a strange failure. I agree that this assertion doesn't look like it should fire based on the code, so something strange is going on (memory corruption? bad data in the architecture file?).

vaughnbetz commented 3 years ago

@Bill-hbrhbr this may be of interest; hard to debug without being able to run Symbiflow though.

litghost commented 3 years ago

@vaughnbetz / @Bill-hbrhbr I've added instructions for replicating our failure. I can provide more blif's if that is useful, however the current one is just a single input / output.

vaughnbetz commented 3 years ago

@Bill-hbrhbr : Keith has provided instructions on how to reproduce at the top of this issue -- can you take a look?

Bill-hbrhbr commented 3 years ago

The inherent rule is that the criticalities, delays, and timing costs are only calculated for sink pins, not the driver pins. All the data structures (or at least for Kevin's incremental sta code) are based on this fact, including the PlacerTimingCosts's heap in timing_place.h

An additional note, the function that provides the driver pin, which causes this assertion error is the following:

https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/6af4a382ff4187bd4ac7659bdb427d9eff6bcec0/vpr/src/place/place.cpp#L1862-L1923

Indeed, the bug appeared since place_crit.pins_with_modified_criticality() provided a driver pin in the failed test case and had its timing cost computed, which messed up the heap data structure for calculating the total timing cost (sum of all sink pins' costs).

This effect can be generated by a lot of different test cases, and it also depends on how the make_range function orders the elements in the vtr_id_set structure (which in turn affects how the heap is updated).

This should not happen. I suspect that the cause is within the following function: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/6af4a382ff4187bd4ac7659bdb427d9eff6bcec0/vpr/src/place/timing_place.cpp#L31-L87

Specifically, this line

cluster_pins_with_modified_criticality_.insert(pins.begin(), pins.end()); 

created the problem. Only sink pins should be added according to the logic of timing cost computation. I suppose the logic is the same for non-incremental timing cost computation, so both line #42 and line #72 need to be modified. Regarding how disabling the incremental timing will bypass this bug, I'm not so sure about the cause, so correct my reasonings above if they are wrong.

I implemented a fix in my local repo, which only adds sink pins to be computed for criticalities. It made the bug go away and still passed vtr_reg_strong tests. I'll open a PR upon the request of anyone in this thread.

vaughnbetz commented 3 years ago

Nice work finding this so quickly Bill! I agree that code looks wrong. Perhaps this is also the underlying reason Kevin had to implement a more complex summation (tree summation) when he added incremental timing analysis, in order to get the timing costs close enough to pass the incremental vs. brute-force timing cost consistency check (i.e. perhaps the new summation order masked this bug for the test cases run). Should definitely be fixed. Please do a QoR run with your fixed code and add the results (spreadsheets of QoR vs. master for vtr and titan designs). Assuming they look good, you should open a PR and fix this.

litghost commented 3 years ago

I'll open a PR upon the request of anyone in this thread.

If you have a fix, please create a PR as soon as possible. This bug is currently blocking downstream VTR + symbiflow integration, and it would be good to confirm if your fix works or not.

vaughnbetz commented 3 years ago

Whoops, fat finger. Bill has a fix.

Bill-hbrhbr commented 3 years ago

Yep working on it right now

litghost commented 3 years ago

@Bill-hbrhbr : We are still seeing:

/tmpfs/src/github/vtr-verilog-to-routing/vpr/src/place/place.cpp:1966 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == comp_td_connection_delay(delay_model, net, ipin)' failed (Connection delays should already be updated).

which I believe you will replicate if you increase the ASSERT level from 2 to 3.

litghost commented 3 years ago

Maybe there needs to be an additional check to the assert that "crit_exponent == last_critexponent && ..."?

Bill-hbrhbr commented 3 years ago

looking into it right now

litghost commented 3 years ago

The way I see it, the logic that you modified does a brute force retiming if crit_exponent changes. However, if the underlying result is not a function of the crit_exponent, (e.g. 0 ^ crit_exponent => 0), then the assertion might fail?

Bill-hbrhbr commented 3 years ago

@litghost I haven't been able to replicate this bug. Is this produced on a different set of benchmarks?

litghost commented 3 years ago

@Bill-hbrhbr I've been able to replicate the failing assertion issue, here are the latest replication instructions:

1.Get input test data:

2d2c0b6490c7914383bf326bee973feac455269b has your earlier fix, so if you change cmake -DVTR_ASSERT_LEVEL=3 .. to cmake -DVTR_ASSERT_LEVEL=2 .. it doesn't assert, and completes without error.

litghost commented 3 years ago

@vaughnbetz / @Bill-hbrhbr I believe this assertion is not in error, but I'm not totally sure. I'm debugging on my side, let me know if you find anything.

vaughnbetz commented 3 years ago

Responding to a prior comment: I don't think this assert has anything to do with crit_exponent. The assert checks that the delay of a connection stored in a certain matrix entry matches what we get if we ask for the delay of that connection right now. We're failing that test. It looks like the routine is called in two places: once by the regular (incremental) timing cost update routine, and once by the brute-force routine that walks through all the connections. Bill, if you set a breakpoint on the assert and trace back to which caller it was, which connection it was (inet, ipin) and how different the stored delay is versus the recomputed one that may shed some light on this. Is there anything strange about this connection?

litghost commented 3 years ago

Is there anything strange about this connection?

Nope! In fact this design is super simple, one inpad and one outpad connected. I working on adding diagnostics prints to see if can find the issue first.

vaughnbetz commented 3 years ago

Does the assert only occur on that design, or also on larger ones? Are both I/Os locked in place (nothing to move)? Maybe there's a corner case if we don't move anything.

Or, if it happens on larger designs too, it's not that ...

litghost commented 3 years ago

Does the assert only occur on that design, or also on larger ones? Are both I/Os locked in place (nothing to move)? Maybe there's a corner case if we don't move anything.

They are not locked. I've already added a print checking a move took place.

litghost commented 3 years ago

Looks like the bug is in driven_by_moved_block. I'll provide a print trace in a minute that points in that direction.

litghost commented 3 years ago

This design has 1 net, with two pins 1 (driver) and 0 (sink). You can see in the trace that the placer only moves one block, but the incremental logic sees that the sink has been affected, but still skips the update because it believes the driver has also moved. Not sure why just yet:

Moves per temperature: 2
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 0
Recomputing 0[1]
Commiting td cost
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 0
Recomputing 0[1]
Commiting td cost

---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
Tnum   Time       T Av Cost Av BB Cost Av TD Cost     CPD       sTNS     sWNS Ac Rate Std Dev  R lim Crit Exp Tot Moves  Alpha
      (sec)                                          (ns)       (ns)     (ns)                                                 
---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 0
Recomputing 0[1]
Commiting td cost
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
Commiting td cost
   1    0.0 8.8e+00   1.370       0.11 1.2145e-09   1.180      -1.18   -1.180   1.000  0.1962  161.0     1.00         2  0.500
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
Commiting td cost
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
Commiting td cost
   2    0.0 4.4e+00   0.647       0.04 1.2145e-09   1.235      -1.23   -1.235   1.000  0.1128  161.0     1.00         4  0.500
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
   3    0.0 2.2e+00   1.000       0.02 1.2145e-09   1.235      -1.23   -1.235   0.000  0.0000  161.0     1.00         6  0.950
/usr/local/google/home/keithrothman/cat_x/vtr-verilog-to-routing/vpr/src/place/place.cpp:1987 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == delay' failed (Connection delays should already be updated, 0[1] 1.21452e-09 != 8.1553e-10).
litghost commented 3 years ago

It looks like on iteration 1, both the driver and the sink move, which makes sense. However in further iterations, only the sink moves, but the driven_by_moved_block still returns true, causing the timing update to be skipped.

litghost commented 3 years ago

The bug is in the "PinType" annotation in the clustered netlist. It looks like the driver pin is being marked as a sink.

litghost commented 3 years ago

I've added an assertion to confirm what I believe is the PinType bug:

      VTR_ASSERT_SAFE_MSG((pin_type == PinType::DRIVER && cluster_ctx.clb_nlist.net_driver(net) == pin) || (pin_type == PinType::SINK && cluster_ctx.clb_nlist.net_driver(net) != pin),
              "Net driver pin should be a DRIVER pin type, or not the driver!"); 

which does throw an assertion violation:

Moves per temperature: 2
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId out:do[0] ClusterPinId 0 PinType 1
Driver block 1
Block 0 moved
2. Recomputing net 0[1] block 0[0]
Computing delay 1[1] -> 0[0]
Commiting td cost
Driver block 1
Block 0 moved
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId out:do[0] ClusterPinId 0 PinType 1
Driver block 1
Block 0 moved
2. Recomputing net 0[1] block 0[0]
Computing delay 1[1] -> 0[0]
Commiting td cost
Driver block 1
Block 0 moved

---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
Tnum   Time       T Av Cost Av BB Cost Av TD Cost     CPD       sTNS     sWNS Ac Rate Std Dev  R lim Crit Exp Tot Moves  Alpha
      (sec)                                          (ns)       (ns)     (ns)                                                 
---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId out:do[0] ClusterPinId 0 PinType 1
Driver block 1
Block 0 moved
2. Recomputing net 0[1] block 0[0]
Computing delay 1[1] -> 0[0]
Commiting td cost
Driver block 1
Block 0 moved
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId di[1] ClusterPinId 1 PinType 1
vtr-verilog-to-routing/vpr/src/place/place.cpp:1625 update_td_delta_costs: Assertion '(pin_type == PinType::DRIVER && cluster_ctx.clb_nlist.net_driver(net) == pin) || (pin_type == PinType::SINK && cluster_ctx.clb_nlist.net_driver(net) != pin)' failed (Net driver pin should be a DRIVER pin type, or not the driver!).
litghost commented 3 years ago

I added the following sanity check to check_netlist, and it is failing:

diff --git a/vpr/src/base/check_netlist.cpp b/vpr/src/base/check_netlist.cpp
index e0344415e..f60af17a2 100644
--- a/vpr/src/base/check_netlist.cpp
+++ b/vpr/src/base/check_netlist.cpp
@@ -73,6 +73,45 @@ void check_netlist(int verbosity) {
         }
     }

+    // Verify clustered netlist relationships and assumptions.
+    for (ClusterNetId net_id : cluster_ctx.clb_nlist.nets()) {
+        ClusterPinId driver_pin = cluster_ctx.clb_nlist.net_driver(net_id);
+        for (ClusterPinId pin : cluster_ctx.clb_nlist.net_pins(net_id)) {
+            if (cluster_ctx.clb_nlist.pin_net(pin) != net_id) {
+                VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                        "Net %zu pin %zu are inconsistently mapped\n", size_t(net_id), size_t(pin));
+            }
+
+            PinType pin_type = cluster_ctx.clb_nlist.pin_type(pin);
+            int pin_net_index = cluster_ctx.clb_nlist.pin_net_index(pin);
+            if(pin == driver_pin) {
+                if(pin_type != PinType::DRIVER) {
+                    VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                            "Pin %zu for net %zu is the net driver, but pin_type(%d) != PinType::DRIVER",
+                            size_t(pin), size_t(net_id), pin_type);
+                }
+
+                if(pin_net_index != 0) {
+                    VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                            "Pin %zu for net %zu is the net driver, but pin_net_index(%d) != 0",
+                            size_t(pin), size_t(net_id), pin_net_index);
+                }
+            } else {
+                if(pin_type != PinType::SINK) {
+                    VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                            "Pin %zu for net %zu is a net sink, but pin_type(%d) != PinType::SINK",
+                            size_t(pin), size_t(net_id), pin_type);
+                }
+
+                if(pin_net_index == 0) {
+                    VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                            "Pin %zu for net %zu is the net sink, but pin_net_index(%d) == 0",
+                            size_t(pin), size_t(net_id), pin_net_index);
+                }
+            }
+        }
+    }
+
     error += check_for_duplicated_names();

     if (error != 0) {

If that check looks good, we should probably merged that, and I'll create a PR.

litghost commented 3 years ago

I add a call to Netlist::verify and got a similar assertion violation:

diff --git a/vpr/src/base/check_netlist.cpp b/vpr/src/base/check_netlist.cpp
index e0344415e..ac2b7a963 100644
--- a/vpr/src/base/check_netlist.cpp
+++ b/vpr/src/base/check_netlist.cpp
@@ -39,6 +39,7 @@ void check_netlist(int verbosity) {

     /* This routine checks that the netlist makes sense. */
     auto& cluster_ctx = g_vpr_ctx.mutable_clustering();
+    cluster_ctx.clb_nlist.verify();
Error 1: 
Type: Unrecognized Error
File: vpr/src/base/netlist.tpp
Line: 1699
Message: Driver pin not found at expected index in net
The entire flow of VPR took 0.41 seconds (max_rss 67.2 MiB)

I think this confirms my earlier investigation. Now to determine where the problem is coming from.

litghost commented 3 years ago

Something weird is going on:

Detected 0 constant generators (to see names run with higher pack verbosity)
Adding pin 0 to net 0 that is pin_type 0
Adding pin 1 to net 0 that is pin_type 1
Finished loading packed FPGA netlist file (took 0.017435 seconds).
# Load Packing took 0.02 seconds (max_rss 67.6 MiB, delta_rss +36.3 MiB)
Warning 14: Netlist contains 0 global net to non-global architecture pin connections
Warning 15: Logic block #1 (di) has only 1 input pin 'di.inpad[0]' -- the whole block is hanging logic that should be swept.
Error 1: 
Type: Other
File: /usr/local/google/home/keithrothman/cat_x/vtr-verilog-to-routing/vpr/src/base/check_netlist.cpp
Line: 92
Message: Pin 0 for net 0 is the net driver, but pin_type(1) != PinType::DRIVER
The entire flow of VPR took 0.23 seconds (max_rss 67.8 MiB)

So pin 0 is added to net 0, and is pin type = 0 (DRIVER). Then a little while later, pin 0 is pin type 1? Still investigating.

litghost commented 3 years ago

I've located the bug, making a fix now.

litghost commented 3 years ago

PR opened for fix. I still need to make a replicating regression test, which I'll start on now.

vaughnbetz commented 3 years ago

Nice bug hunting on this Keith! You probably already thought of this, but just in case: you can turn on sanitizers in the makefile if you suspect memory corruption.