Open litghost opened 5 years ago
33a6c955d0f9a9f08ba0ff807c0fb655d977ace8 is the first bad commit
commit 33a6c955d0f9a9f08ba0ff807c0fb655d977ace8
Author: kmurray <k.murray@utoronto.ca>
Date: Wed Aug 8 16:31:47 2018 -0400
vpr: Fix default behaviour of --cluster_seed_type
Previously the seed type incorrectly defaulted to TIMING unless
manually specified, which disagreed with the option description.
It now defaults to BLEND if timing driven, and MAX_INPUTS as described
in the help message.
:040000 040000 370c45d2f4fcf55b42b0fba56ca1562287d50938 cdbc337bbf6e8b09a55b2cfaf883222c8f12049c M vpr
Per the bisect, prior to 33a6c955d0f9a9f08ba0ff807c0fb655d977ace8
--cluster_seed_type timing
afterwards
--cluster_seed_type blend
With --cluster_seed_type timing
, the consistency error does not arise.
I suspect that's not the fundamental cause. The change in cluster seed just changes how the clusters are created (so a different packing result). It should have no effect on the timing calculation during placement.
(However it may be possible the clusters formed with a different seed have some characteristic which exacerbates the incremental vs from-scratch round-off issue which the failing check is addressing during placement.)
Another thing to try would be to leave the cluster seed type unchanged and instead trying a couple of different placement seeds (--seed). Changing the seed will result in a different optimization result during placement. If changing the seed causes the issue to disappear as well it would indicate the issue is more general and not tied to the clustering seed change.
Changing the seed indeed changes the behavior. Both cluster_seed_type's fail with different seed values.
Thanks for confirming that!
That likely means a closer look into the full vs incremental cost calculation code is needed (although I don't think it's changed). More likely the round-off checking test should be re-considered to see if it's just set too sensitively or if we should be the full calculation more often to prevent the accumulation of round-off.
I can confirm that the code prior to 33a6c95 shares the seed behavior.
Thanks for confirming that!
That likely means a closer look into the full vs incremental cost calculation code is needed (although I don't think it's changed). More likely the round-off checking test should be re-considered to see if it's just set too sensitively or if we should be the full calculation more often to prevent the accumulation of round-off.
The question is how much is too much?
I've seen errors on the order to 7% to 15%, which seems "big".
The attached graph diverges very quickly, so I think it may indicate a numerical instability in the incremental cost calculations. Probably worth investigating if a restructuring can help.
So I've taken a look and you seem to have some unrealistic timing-related values on your RR nodes.
For example most CHANX/CHANY wires are something like:
<node capacity="1" direction="INC_DIR" id="416" type="CHANX">
<loc ptc="0" xhigh="8" xlow="1" yhigh="0" ylow="0"/>
<timing C="1" R="1"/>
<segment segment_id="0"/>
</node>
Which specifies a 1 Farad capacitance (which is huge, typical values would be in the picofarads) and 1 Ohm resistance.
As a result you get huge critical path delays. For example:
------- ------- ---------- ---------- ---------- ---------- ------- ---------- -------- ------- ------- ------- ------ --------- ------
T Cost Av BB Cost Av TD Cost Av Tot Del P to P Del CPD sTNS sWNS Ac Rate Std Dev R limit Exp Tot Moves Alpha
------- ------- ---------- ---------- ---------- ---------- ------- ---------- -------- ------- ------- ------- ------ --------- ------
0.293 0.9679 4.7485 32.515 38.294 1.3889 6000000000.000 -6e+09 -6000000000.000 0.9444 0.0221 9.0000 1.000 36 0.900
0.264 1.0671 5.0496 32.296 37.25 1.3889 6000000000.000 -6e+09 -6000000000.000 1.0000 0.0259 9.0000 1.000 72 0.500
0.132 1.0105 5.3049 29 36.765 1.2778 5000000000.000 -5e+09 -5000000000.000 0.9444 0.0202 9.0000 1.000 108 0.900
0.119 0.9758 5.0298 33.213 38.621 1.3889 6000000000.000 -6e+09 -6000000000.000 0.9167 0.0263 9.0000 1.000 144 0.900
0.107 1.0132 5.1083 33.272 37.176 1.3889 6000000000.000 -6e+09 -6000000000.000 0.9444 0.0360 9.0000 1.000 180 0.900
0.096 0.9545 4.8848 30.443 36.531 1.3889 6000000000.000 -6e+09 -6000000000.000 0.8889 0.0301 9.0000 1.000 216 0.900
0.087 1.0113 4.7086 29.464 36.094 1.3519 6000000000.000 -6e+09 -6000000000.000 0.8889 0.0255 9.0000 1.000 252 0.900
0.078 1.0405 4.9132 29.176 36.618 1.2407 6000000000.000 -6e+09 -6000000000.000 0.9444 0.0173 9.0000 1.000 288 0.900
0.070 0.9696 4.8756 32.424 37.235 1.3889 6000000000.000 -6e+09 -6000000000.000 0.9444 0.0211 9.0000 1.000 324 0.900
0.063 0.9970 4.7331 27.575 34.3 1.3519 6000000000.000 -6e+09 -6000000000.000 0.9722 0.0268 9.0000 1.000 360 0.500
0.032 1.0094 4.6616 26.353 33.532 1.2778 6000000000.000 -6e+09 -6000000000.000 0.8611 0.0242 9.0000 1.000 396 0.900
shows a Critical Path Delay (CPD) of 6 billion seconds.
Since these values are so far out of the standard range it's less surprising that the round-off checks fail.
Two potential work arounds:
If you don't have a valid timing model, you may be better off running VPR with --timing_analysis off
which will cause VPR to do a non-timing-driven implementation (which avoids the incremental timing check causing you issues).
Remove the unrealistic timing values:
#Zero out the invalid R/C values
$ cat rr_graph_4x4_dummy.rr_graph.real.xml | sed 's/C="1" R="1"/C="0" R="0"/g' > rr_graph2.xml
$ vpr arch.unique_pack.xml top.eblif --read_rr_graph rr_graph2.xml --device 4x4 --max_router_iterations 500 --route_chan_width 20 --place
Making the timing conform to better values actually doesn't help.
Here is a version of the same routing graph with more "reasonable" timing values, tolerance_issue_v2.zip.
Usually to debug issues like this, I turn MAX_MOVES_PER_RECOMPUTE way down (to 1 if you can tolerate the CPU time hit for a while). If you still fail the test then it definitely isn't round-off. Given the magnitude of the differnence you're seeing, it doesn't look like round off. Also note that this check is there to catch optimization flaws, not correctness issues, so if this is holding you up you can set ERROR_TOL to something big (e.g. 1 or 10) and nothing bad will happen.
Looking at the placement_inner_loop () function, there may be a problem -- with the restructured code for checking the tolerance (put in 6 months ago) we might get a timing analysis between the incremental cost calculation and the recompute_from_scratch, which could change the costs. That could lead to false failures.
I looked at the code more carefully and I don't think it's possible for the timing analysis to occur without the timing costs being recomputed, so I don't think that's what leads to the placement cost recompute error. I suggest setting MAX_MOVES_PER_RECOMPUTE to 1 and seeing if the error fires; then take a look at that move.
Expected Behaviour
Current Behaviour
Steps to Reproduce