Open litghost opened 3 years ago
I'm working on identifying when this abort started happening. @acomodi believes it was around when the incremental timing analysis was added.
I tested a circuit that was a little more complicated (a counter), rather than simple straight inport -> outport, same issue. I'll see how much I can reduce the circuit and still produce the error.
I also ran into this issue when working on https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/1205. Disabling incremental timing avoided the error.
I also ran into this issue when working on #1205. Disabling incremental timing avoided the error.
Which graph were you running on? So far I've noticed the placer error showing up on 7-series ROI based graphs, but I would've assumed you would've been testing on either full 7-series graphs or a titan arch.
@acomodi mentioned this, but if the ASSERT level is increased from 2 to 3 the error changes to:
vtr-verilog-to-routing/vpr/src/place/place.cpp:1963 comp_td_connection_cost: Assertion 'ipin > 0' failed (Shouldn't be calculating connection timing cost for driver pins).
An additional note, the function that provides the driver pin, which causes this assertion error is the following:
@litghost I was using the full 50t graph. I assumed that the problem was triggered by my new placement code.
@litghost I was using the full 50t graph. I assumed that the problem was triggered by my new placement code.
No I don't believe your new placement code along trigger this issue. However in my limited testing, the full 50T graph did not produce this error, so it possible that ROI or full 50T + new placement triggers the issues. There is some kind of latent issue being uncovered here.
I did a quick test with https://github.com/HackerFoo/vtr-verilog-to-routing/commit/7e35d152ce2d1f05aa70bbbc017f9b8f44668138 and the counter test on the xc7 50T ROI arch was able to complete if assert level was 2. On assert level 3, I got:
vtr-verilog-to-routing/vpr/src/place/place.cpp:1966 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == comp_td_connection_delay(delay_model, net, ipin)'
failed (Connection delays should already be updated).
WIll continue investigating this week.
@kmurray Why is the assertion at https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/6af4a382ff4187bd4ac7659bdb427d9eff6bcec0/vpr/src/place/place.cpp#L1963 valid? In the parent context DRIVER pin types are ignored, see here: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blame/master/vpr/src/place/place.cpp#L1889
I'm not clear why ipin == 0
is an assertion violation?
I did a quick test with HackerFoo@7e35d15 and the counter test on the xc7 50T ROI arch was able to complete if assert level was 2. On assert level 3, I got:
vtr-verilog-to-routing/vpr/src/place/place.cpp:1966 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == comp_td_connection_delay(delay_model, net, ipin)' failed (Connection delays should already be updated).
WIll continue investigating this week.
The assertion from here should probably be behind the incremental #ifdef
.
The assumption that pin #0 on a net is the driver is deeply embedded in the netlist data structures. So the assert looks valid. I'd run the sanitizers to check memory faults, since this is a strange failure. I agree that this assertion doesn't look like it should fire based on the code, so something strange is going on (memory corruption? bad data in the architecture file?).
@Bill-hbrhbr this may be of interest; hard to debug without being able to run Symbiflow though.
@vaughnbetz / @Bill-hbrhbr I've added instructions for replicating our failure. I can provide more blif's if that is useful, however the current one is just a single input / output.
@Bill-hbrhbr : Keith has provided instructions on how to reproduce at the top of this issue -- can you take a look?
The inherent rule is that the criticalities, delays, and timing costs are only calculated for sink pins, not the driver pins. All the data structures (or at least for Kevin's incremental sta code) are based on this fact, including the PlacerTimingCosts's heap in timing_place.h
An additional note, the function that provides the driver pin, which causes this assertion error is the following:
Indeed, the bug appeared since place_crit.pins_with_modified_criticality() provided a driver pin in the failed test case and had its timing cost computed, which messed up the heap data structure for calculating the total timing cost (sum of all sink pins' costs).
This effect can be generated by a lot of different test cases, and it also depends on how the make_range function orders the elements in the vtr_id_set structure (which in turn affects how the heap is updated).
This should not happen. I suspect that the cause is within the following function: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/6af4a382ff4187bd4ac7659bdb427d9eff6bcec0/vpr/src/place/timing_place.cpp#L31-L87
Specifically, this line
cluster_pins_with_modified_criticality_.insert(pins.begin(), pins.end());
created the problem. Only sink pins should be added according to the logic of timing cost computation. I suppose the logic is the same for non-incremental timing cost computation, so both line #42 and line #72 need to be modified. Regarding how disabling the incremental timing will bypass this bug, I'm not so sure about the cause, so correct my reasonings above if they are wrong.
I implemented a fix in my local repo, which only adds sink pins to be computed for criticalities. It made the bug go away and still passed vtr_reg_strong tests. I'll open a PR upon the request of anyone in this thread.
Nice work finding this so quickly Bill! I agree that code looks wrong. Perhaps this is also the underlying reason Kevin had to implement a more complex summation (tree summation) when he added incremental timing analysis, in order to get the timing costs close enough to pass the incremental vs. brute-force timing cost consistency check (i.e. perhaps the new summation order masked this bug for the test cases run). Should definitely be fixed. Please do a QoR run with your fixed code and add the results (spreadsheets of QoR vs. master for vtr and titan designs). Assuming they look good, you should open a PR and fix this.
I'll open a PR upon the request of anyone in this thread.
If you have a fix, please create a PR as soon as possible. This bug is currently blocking downstream VTR + symbiflow integration, and it would be good to confirm if your fix works or not.
Whoops, fat finger. Bill has a fix.
Yep working on it right now
@Bill-hbrhbr : We are still seeing:
/tmpfs/src/github/vtr-verilog-to-routing/vpr/src/place/place.cpp:1966 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == comp_td_connection_delay(delay_model, net, ipin)' failed (Connection delays should already be updated).
which I believe you will replicate if you increase the ASSERT level from 2 to 3.
Maybe there needs to be an additional check to the assert that "crit_exponent == last_critexponent && ..."?
looking into it right now
The way I see it, the logic that you modified does a brute force retiming if crit_exponent
changes. However, if the underlying result is not a function of the crit_exponent
, (e.g. 0 ^ crit_exponent => 0), then the assertion might fail?
@litghost I haven't been able to replicate this bug. Is this produced on a different set of benchmarks?
@Bill-hbrhbr I've been able to replicate the failing assertion issue, here are the latest replication instructions:
1.Get input test data:
wget https://storage.googleapis.com/debug_data/replicate2.tar.bz2
mkdir vpr_failing_test
pushd vpr_failing_test
tar -xf replicate2.tar.bz2
popd
git clone https://github.com/litghost/vtr-verilog-to-routing.git
cd vtr-verilog-to-routing
git checkout -b test
git reset --hard 2d2c0b6490c7914383bf326bee973feac455269b
mkdir build
cd build
cmake -DVTR_ASSERT_LEVEL=3 ..
make -j
cd ../../vpr_failing_test
cp ../vtr-verilog-to-routing/vpr/vpr .
./run.sh
2d2c0b6490c7914383bf326bee973feac455269b has your earlier fix, so if you change cmake -DVTR_ASSERT_LEVEL=3 ..
to cmake -DVTR_ASSERT_LEVEL=2 ..
it doesn't assert, and completes without error.
@vaughnbetz / @Bill-hbrhbr I believe this assertion is not in error, but I'm not totally sure. I'm debugging on my side, let me know if you find anything.
Responding to a prior comment: I don't think this assert has anything to do with crit_exponent. The assert checks that the delay of a connection stored in a certain matrix entry matches what we get if we ask for the delay of that connection right now. We're failing that test. It looks like the routine is called in two places: once by the regular (incremental) timing cost update routine, and once by the brute-force routine that walks through all the connections. Bill, if you set a breakpoint on the assert and trace back to which caller it was, which connection it was (inet, ipin) and how different the stored delay is versus the recomputed one that may shed some light on this. Is there anything strange about this connection?
Is there anything strange about this connection?
Nope! In fact this design is super simple, one inpad
and one outpad
connected. I working on adding diagnostics prints to see if can find the issue first.
Does the assert only occur on that design, or also on larger ones? Are both I/Os locked in place (nothing to move)? Maybe there's a corner case if we don't move anything.
Or, if it happens on larger designs too, it's not that ...
Does the assert only occur on that design, or also on larger ones? Are both I/Os locked in place (nothing to move)? Maybe there's a corner case if we don't move anything.
They are not locked. I've already added a print checking a move took place.
Looks like the bug is in driven_by_moved_block
. I'll provide a print trace in a minute that points in that direction.
This design has 1 net, with two pins 1 (driver) and 0 (sink). You can see in the trace that the placer only moves one block, but the incremental logic sees that the sink has been affected, but still skips the update because it believes the driver has also moved. Not sure why just yet:
Moves per temperature: 2
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 0
Recomputing 0[1]
Commiting td cost
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 0
Recomputing 0[1]
Commiting td cost
---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
Tnum Time T Av Cost Av BB Cost Av TD Cost CPD sTNS sWNS Ac Rate Std Dev R lim Crit Exp Tot Moves Alpha
(sec) (ns) (ns) (ns)
---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 0
Recomputing 0[1]
Commiting td cost
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
Commiting td cost
1 0.0 8.8e+00 1.370 0.11 1.2145e-09 1.180 -1.18 -1.180 1.000 0.1962 161.0 1.00 2 0.500
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
Commiting td cost
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
Commiting td cost
2 0.0 4.4e+00 0.647 0.04 1.2145e-09 1.235 -1.23 -1.235 1.000 0.1128 161.0 1.00 4 0.500
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
3 0.0 2.2e+00 1.000 0.02 1.2145e-09 1.235 -1.23 -1.235 0.000 0.0000 161.0 1.00 6 0.950
/usr/local/google/home/keithrothman/cat_x/vtr-verilog-to-routing/vpr/src/place/place.cpp:1987 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == delay' failed (Connection delays should already be updated, 0[1] 1.21452e-09 != 8.1553e-10).
It looks like on iteration 1, both the driver and the sink move, which makes sense. However in further iterations, only the sink moves, but the driven_by_moved_block
still returns true, causing the timing update to be skipped.
The bug is in the "PinType" annotation in the clustered netlist. It looks like the driver pin is being marked as a sink.
I've added an assertion to confirm what I believe is the PinType bug:
VTR_ASSERT_SAFE_MSG((pin_type == PinType::DRIVER && cluster_ctx.clb_nlist.net_driver(net) == pin) || (pin_type == PinType::SINK && cluster_ctx.clb_nlist.net_driver(net) != pin),
"Net driver pin should be a DRIVER pin type, or not the driver!");
which does throw an assertion violation:
Moves per temperature: 2
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId out:do[0] ClusterPinId 0 PinType 1
Driver block 1
Block 0 moved
2. Recomputing net 0[1] block 0[0]
Computing delay 1[1] -> 0[0]
Commiting td cost
Driver block 1
Block 0 moved
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId out:do[0] ClusterPinId 0 PinType 1
Driver block 1
Block 0 moved
2. Recomputing net 0[1] block 0[0]
Computing delay 1[1] -> 0[0]
Commiting td cost
Driver block 1
Block 0 moved
---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
Tnum Time T Av Cost Av BB Cost Av TD Cost CPD sTNS sWNS Ac Rate Std Dev R lim Crit Exp Tot Moves Alpha
(sec) (ns) (ns) (ns)
---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId out:do[0] ClusterPinId 0 PinType 1
Driver block 1
Block 0 moved
2. Recomputing net 0[1] block 0[0]
Computing delay 1[1] -> 0[0]
Commiting td cost
Driver block 1
Block 0 moved
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId di[1] ClusterPinId 1 PinType 1
vtr-verilog-to-routing/vpr/src/place/place.cpp:1625 update_td_delta_costs: Assertion '(pin_type == PinType::DRIVER && cluster_ctx.clb_nlist.net_driver(net) == pin) || (pin_type == PinType::SINK && cluster_ctx.clb_nlist.net_driver(net) != pin)' failed (Net driver pin should be a DRIVER pin type, or not the driver!).
I added the following sanity check to check_netlist
, and it is failing:
diff --git a/vpr/src/base/check_netlist.cpp b/vpr/src/base/check_netlist.cpp
index e0344415e..f60af17a2 100644
--- a/vpr/src/base/check_netlist.cpp
+++ b/vpr/src/base/check_netlist.cpp
@@ -73,6 +73,45 @@ void check_netlist(int verbosity) {
}
}
+ // Verify clustered netlist relationships and assumptions.
+ for (ClusterNetId net_id : cluster_ctx.clb_nlist.nets()) {
+ ClusterPinId driver_pin = cluster_ctx.clb_nlist.net_driver(net_id);
+ for (ClusterPinId pin : cluster_ctx.clb_nlist.net_pins(net_id)) {
+ if (cluster_ctx.clb_nlist.pin_net(pin) != net_id) {
+ VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+ "Net %zu pin %zu are inconsistently mapped\n", size_t(net_id), size_t(pin));
+ }
+
+ PinType pin_type = cluster_ctx.clb_nlist.pin_type(pin);
+ int pin_net_index = cluster_ctx.clb_nlist.pin_net_index(pin);
+ if(pin == driver_pin) {
+ if(pin_type != PinType::DRIVER) {
+ VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+ "Pin %zu for net %zu is the net driver, but pin_type(%d) != PinType::DRIVER",
+ size_t(pin), size_t(net_id), pin_type);
+ }
+
+ if(pin_net_index != 0) {
+ VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+ "Pin %zu for net %zu is the net driver, but pin_net_index(%d) != 0",
+ size_t(pin), size_t(net_id), pin_net_index);
+ }
+ } else {
+ if(pin_type != PinType::SINK) {
+ VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+ "Pin %zu for net %zu is a net sink, but pin_type(%d) != PinType::SINK",
+ size_t(pin), size_t(net_id), pin_type);
+ }
+
+ if(pin_net_index == 0) {
+ VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+ "Pin %zu for net %zu is the net sink, but pin_net_index(%d) == 0",
+ size_t(pin), size_t(net_id), pin_net_index);
+ }
+ }
+ }
+ }
+
error += check_for_duplicated_names();
if (error != 0) {
If that check looks good, we should probably merged that, and I'll create a PR.
I add a call to Netlist::verify
and got a similar assertion violation:
diff --git a/vpr/src/base/check_netlist.cpp b/vpr/src/base/check_netlist.cpp
index e0344415e..ac2b7a963 100644
--- a/vpr/src/base/check_netlist.cpp
+++ b/vpr/src/base/check_netlist.cpp
@@ -39,6 +39,7 @@ void check_netlist(int verbosity) {
/* This routine checks that the netlist makes sense. */
auto& cluster_ctx = g_vpr_ctx.mutable_clustering();
+ cluster_ctx.clb_nlist.verify();
Error 1:
Type: Unrecognized Error
File: vpr/src/base/netlist.tpp
Line: 1699
Message: Driver pin not found at expected index in net
The entire flow of VPR took 0.41 seconds (max_rss 67.2 MiB)
I think this confirms my earlier investigation. Now to determine where the problem is coming from.
Something weird is going on:
Detected 0 constant generators (to see names run with higher pack verbosity)
Adding pin 0 to net 0 that is pin_type 0
Adding pin 1 to net 0 that is pin_type 1
Finished loading packed FPGA netlist file (took 0.017435 seconds).
# Load Packing took 0.02 seconds (max_rss 67.6 MiB, delta_rss +36.3 MiB)
Warning 14: Netlist contains 0 global net to non-global architecture pin connections
Warning 15: Logic block #1 (di) has only 1 input pin 'di.inpad[0]' -- the whole block is hanging logic that should be swept.
Error 1:
Type: Other
File: /usr/local/google/home/keithrothman/cat_x/vtr-verilog-to-routing/vpr/src/base/check_netlist.cpp
Line: 92
Message: Pin 0 for net 0 is the net driver, but pin_type(1) != PinType::DRIVER
The entire flow of VPR took 0.23 seconds (max_rss 67.8 MiB)
So pin 0 is added to net 0, and is pin type = 0 (DRIVER). Then a little while later, pin 0 is pin type 1? Still investigating.
I've located the bug, making a fix now.
PR opened for fix. I still need to make a replicating regression test, which I'll start on now.
Nice bug hunting on this Keith! You probably already thought of this, but just in case: you can turn on sanitizers in the makefile if you suspect memory corruption.
Expected Behaviour
Placer places circuit and writes placement file.
Current Behaviour
Placer aborts with the following error:
Possible Solution
Working on reduced test case. Circuit in this case was very simple, multiple inpad -> outpad connections.
Steps to Reproduce
1.Get input test data:
wget https://storage.googleapis.com/debug_data/vpr_failing_test.tar.bz2
tar -xf vpr_failing_test.tar.bz2
git clone https://github.com/litghost/vtr-verilog-to-routing.git
cd vtr-verilog-to-routing
git checkout -b test
git reset --hard 87c4d426a5096dc6b8ab7e9adb0c6a39b2154bd2
mkdir build
cd build
cmake -DVTR_ASSERT_LEVEL=3 ..
make -j
cd ../../vpr_failing_test
cp ../vtr-verilog-to-routing/vpr/vpr .
./run.sh
Context
Your Environment