verilog-to-routing / vtr-verilog-to-routing

Verilog to Routing -- Open Source CAD Flow for FPGA Research
https://verilogtorouting.org
Other
1k stars 386 forks source link

Incorrect Post-ABC Clock Restoration #384

Closed kmurray closed 5 years ago

kmurray commented 6 years ago

Expected Behaviour

The VTR flow should handle circuits with no, single, or multiple clocks correctly.

Current Behaviour

There are actually two (inter-related) bugs here:

A) Clock restoration fails in certain cases, due to a bug in the clock restoration script. This occurs when the .latches are re-ordered in the ABC output.

B) More seriously, fixing (A) reveals that clock restoration is currently incorrect in many cases, and no warning/error is generated. This has been passing regression tests since the script incorrectly uses the first clock for all latches; and we've only tested these code-paths with single-clock circuits (which coincidentally produces the correct results).

Steps to Reproduce

A) Clock restoration fails if .latches are re-ordered (thanks to Dipika Badri for the initial report!)

  1. Run the command:
    $VTR_ROOT/vtr_flow/scripts/run_vtr_flow.pl  $VTR_ROOT/vtr_flow/benchmarks/vtr_benchmarks_blif/stereovision3.blif  $ARCH/power/k6_N10_I40_Fi6_L4_frac0_ff1_22nm.xml  -power -cmos_tech $VTR_ROOT/vtr_flow/tech/PTM_22nm/22nm.xml -starting_stage abc
  2. Observe the failed output:
    k6_N10_I40_Fi6_L4_frac0_ff1_22nm/stereovision3                                   failed: to restore multi-clock latch info (took 0.31 seconds)
  3. Move to the run directory and manually run the clock restoration script:
    $VTR_ROOT/vtr_flow/scripts/restore_multiclock_latch_information.pl stereovision3.odin.blif stereovision3.raw.abc.blif stereovision3.abc.blif

    Observe the following error:

    Latch top^FF_NODE~50 not found in Odin II BLIF!
    .latch       n363 top^FF_NODE~50  0

    However, the .latch top^FF_NODE~50 does in fact exist in both the reference and post-ABC netlists.

B) Fix file seeking behaviour in restore_multiclock_latch_information.pl. Run vtr_reg_basic and observe new cock restoration failures on previously passing circuits.

Context

ABC strips away clock information when it performs synthesis. This information is required at later stages of the design flow (e.g. VPR) so that multi-clock circuits are handled correctly.

We previously beleived that simply name-matching the .latches would be sufficient to uniquely identify .latches in both the pre and post ABC netlists. However this is not always the case.

For instance, before ABC:

$ grep .latch mkPktMerge.odin.blif 
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~310^MUX_2~4257 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^wp~0_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~370^MUX_2~2877 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^full_r_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~357^MUX_2~2906 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^gb2_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~390^MUX_2~2787 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^empty_r_FF_NODE re top^CLK 3
.latch top^MULTI_PORT_MUX~2257^MUX_2~3776 top^fi0HasPrio_FF_NODE re top^CLK 3
.latch top^MULTI_PORT_MUX~2257^MUX_2~3775 top^fi0Active_FF_NODE re top^CLK 3
.latch top^MULTI_PORT_MUX~2257^MUX_2~3777 top^fi1Active_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~327^MUX_2~2943 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^rp~0_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~327^MUX_2~2944 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^rp~1_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~327^MUX_2~2945 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^rp~2_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~327^MUX_2~2946 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^rp~3_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~796^MUX_2~3519 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^rp~0_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~859^MUX_2~3363 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^empty_r_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~826^MUX_2~3482 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^gb2_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~839^MUX_2~3453 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^full_r_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~796^MUX_2~3520 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^rp~1_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~796^MUX_2~3521 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^rp~2_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~796^MUX_2~3522 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^rp~3_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1248^MUX_2~3951 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^wp~0_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1308^MUX_2~3810 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^full_r_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1295^MUX_2~3892 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^gb2_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1328^MUX_2~3863 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^empty_r_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1248^MUX_2~3952 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^wp~1_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1248^MUX_2~3953 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^wp~2_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1248^MUX_2~3954 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^wp~3_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~310^MUX_2~4258 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^wp~1_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~310^MUX_2~4259 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^wp~2_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~310^MUX_2~4260 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^wp~3_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~779^MUX_2~4289 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^wp~0_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~779^MUX_2~4290 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^wp~1_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~779^MUX_2~4291 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^wp~2_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^MULTI_PORT_MUX~779^MUX_2~4292 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^wp~3_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1265^MUX_2~4311 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^rp~0_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1265^MUX_2~4312 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^rp~1_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1265^MUX_2~4313 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^rp~2_FF_NODE re top^CLK 3
.latch top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^MULTI_PORT_MUX~1265^MUX_2~4314 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^rp~3_FF_NODE re top^CLK 3

and after ABC:

$ grep .latch mkPktMerge.raw.abc.blif 
.latch       li00       lo00  0
.latch      n1703 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^full_r_FF_NODE  0
.latch      n1718 top^fi0HasPrio_FF_NODE  0
.latch      n1768 top.arSRLFIFO_b+fi1.generic_fifo_sc_b+fifo_1^full_r_FF_NODE  0
.latch      n1803 top.arSRLFIFO_c+fo.generic_fifo_sc_c+fifo_1^empty_r_FF_NODE  0
.latch       li05       lo05  0
.latch       li06       lo06  0
.latch       li07       lo07  0
.latch       li08       lo08  0
.latch       li09       lo09  0
.latch       li10       lo10  0
.latch       li11       lo11  0
.latch       li12       lo12  0
.latch       li13       lo13  0
.latch       li14       lo14  0
.latch       li15       lo15  0

indicating that ABC is significantly modifying the latch names and structure in non-trivial ways. (Investigation shows that some latches are optimized away during the ABC command ifraig). This means that simple name matching is insufficient to correctly restore clock information correctly.

Possible Solution

For (A) it is trivial to fix the script to correct this issue by ensuring it seeks to the beginning of the file correctly. However fixing this causes (B) to start generating errors (it was a silent error before).

As shown above, it seems that in some instances ABC completely re-names the connections to/from .latches, which breaks the fundamental assumption that the clock restoration script makes (i.e. that it can match .latches based on output signal names).

It seems to me that the best course of action is to convert all .latches to blackboxes before passing netlists through ABC, and then restoring them to .latches after ABC. This would ensure all clock information was preserved, but at the cost of disabling sequential optimizations within ABC (we already disable most sequential optimizations like re-timing anyway).

For example, given an initial .latch:

.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~310^MUX_2~4257 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^wp~0_FF_NODE re top^CLK 3

a script would convert it to a blackbox as it was sent through ABC:

.subckt latch_re_3 in=top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~310^MUX_2~4257 out=top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^wp~0_FF_NODE
control=top^CLK

where the subckt type (latch_re_3) encodes that it is rising-edge trigged with an initial value set to don't care (3). Since ABC leaves blackboxes alone, a script could then convert the blackbox back into a latch without loosing any information:

.latch top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^MULTI_PORT_MUX~310^MUX_2~4257 top.arSRLFIFO_a+fi0.generic_fifo_sc_a+fifo_1^wp~0_FF_NODE re top^CLK 3

Your Environment

kmurray commented 6 years ago

@vaughnbetz @KennethKent @jeanlego @aarongraham9: Any thoughts?

It seems to we have only two options:

i) Convert .latches to black-boxes as described above.

ii) Synthesize each clock-domain through ABC independently, and then re-stich the netlist together into one netlist later.

jeanlego commented 6 years ago

Odin already has an option builtin to convert these into black boxes thanks to @aarongraham9 https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/ODIN_II/SRC/odin_ii.cpp#L333:L337

. Also he already made a conversion script to get these back. The principle is that Odin puts the clock information within the latch name and the conversion simply takes these and convert them back. Doing so does prevent ABC from doing some optimization and we saw a couple slow down if memory serves right Aaron can explain that in more details

As for splitting by clock domain, it is possible but would require a significant amount of effort. I will leave that to others to decide the route to take. But I think to get it back running asap using Aaron's work makes the most sense

kmurray commented 6 years ago

@jeanlego Thanks for the reminder that some of this is already implemented!

I agree getting Aaron's black-box code back running seems the best approach to fix this.

@aarongraham9 What would be involved in enabling the black-box latches in the VTR flow? Would you be able to look into re-evaluating the QoR/Run-time impact of converting latches to black-boxes? If I recall the last evaluation was done with the old version of ABC; it would be good to know what the impact is with the new version of ABC.

aarongraham9 commented 6 years ago

Looking into this @jeanlego and @kmurray.

I'd like to try and see if the restoration script can be improved and we can fix it that way. I'd rather not Black-Box them if possible as ABC isn't able to optimize just about anything and the QoR values are very bad in comparison to VTR 7 let alone with the updated ABC/VTR 8.

vaughnbetz commented 6 years ago
  1. Aaron, can you post the QoR impact of using black box latches? It sounds like you have this data from a prior run; can you attach a spreadsheet and summary? If you don't have complete data, I think it would be good to rerun this so we have a precise QoR impact estimate, as the black box solution definitely seems easiest. If the QoR is not too bad, maybe it will be the winning candidate.

  2. I think the options are:

A. black box the latches: simplest, but hurts QoR. Need to see by how much.

B. black box the latches only for circuits with more than one clock. This mitigates the QoR loss; all VTR circuits are single clock so they're OK. I think this is likely the best option: make a script that by default converts to black boxes only if there is more than one clock. Arguments to the script can let users override this (always black box, never black box, auto black box).

C. For circuits with more than one clock, split the circuit into multiple clock domains, and synthesize each separately in ABC. This is essentially a netlist covering/colouring problem I think. One sub-netlist per clock domain (all src and dst registers in same clock domain along with the logic between them) plus one netlist for all the inter-clock domain logic and logic reachable / reaching primary I/Os (in this netlist, all flops should be converted to primary inputs and outputs to prevent sequential optimizations, which are unsafe between clock domains).

D. Fix the current script. Kevin doesn't think this is doable, as the latches are being renamed in unpredictable ways; it's not clear how we can match them.

Kevin and I think B is probably the best mix of being robust and mitigating QoR. C is probably the best solution for QoR, but is complex and could delay the VTR 8 release.

KennethKent commented 6 years ago

We agree that the “fix the current script” solution is not viable. Given the unpredictable ways in which latches are being renamed it is not feasible to recall the latches. I have proposed an iterative black box approach to achieve the cirucit splitting solution. That is, we select one clock domain to remain and black box the remaining; run through abc; and repeat for each of the clock domains. This should achieve the reductions without the separation/stitching of clock domains.

Regardless, the approach above and the base solution of using black boxes for ALL latches need to be tested and documented for QoR impact. Both of these are currently under development and test.

On Aug 23, 2018, at 4:13 PM, vaughnbetz notifications@github.com<mailto:notifications@github.com> wrote:

  1. Aaron, can you post the QoR impact of using black box latches? It sounds like you have this data from a prior run; can you attach a spreadsheet and summary? If you don't have complete data, I think it would be good to rerun this so we have a precise QoR impact estimate, as the black box solution definitely seems easiest. If the QoR is not too bad, maybe it will be the winning candidate.

  2. I think the options are:

    • black box the latches: simplest, but hurts QoR. Need to see by how much.
    • For circuits with more than one clock, split the circuit into multiple clock domains, and synthesize each separately in ABC. This is essentially a netlist covering/colouring problem I think. One sub-netlist per clock domain (all src and dst registers in same clock domain along with the logic between them) plus one netlist for all the inter-clock domain logic and logic reachable / reaching primary I/Os (in this netlist, all flops should be converted to primary inputs and outputs to prevent sequential optimizations, which are unsafe between clock domains).
    • Fix the current script. Kevin doesn't think this is doable, as the latches are being renamed in unpredictable ways; it's not clear how we can match them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/384#issuecomment-415536334, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKDEpvQSP64nLtsGnlvIZawyiu4aMsJeks5uTv7zgaJpZM4WB9R9.


Dr. Kenneth B. Kent Professor & Director, IBM Centre for Advanced Studies - Atlantic Faculty of Computer Science, University of New Brunswick 540 Windsor Street, Fredericton, New Brunswick, E3B 5A3 ph. (506) 451-6971 fax. (506) 453-3566

Honorary Professor, Department of Computer Science and Institute of Visual Computing Bonn-Rhein-Sieg University of Applied Sciences Grantham-Allee 20, 53757 Sankt Augustin, Germany

vaughnbetz commented 6 years ago

With the new approach you mention above (black box all but one domain at a time) it seems like you'll get duplicate synthesis of the logic between clock domains. How do you ensure you get only one copy in the end?

Interestingly, none of these approaches, except black boxing all latches, is fully safe for cross-clock domain paths -- the registers involved in those paths shouldn't be duplicated or retimed as they can form synchronizers etc. Since abc doesn't understand that, I think fixing it is beyond the scope of what we want to do at least for the foreseeable future.

jeanlego commented 6 years ago

I've opened a PR with a way to test all three options, it also includes simulation to verify the blif pre vs post ABC.

jeanlego commented 5 years ago

view PR #408 I've included in the flow a simulation before and after abc.

for 1 case odin could not simulate fpu_hard_block_arch: hard_fpu_arch_timing.xml/mm3.v/common (failed to include simulation) OK but this is because the odin siulator doesn't have the hardblock built for the simulator

for 2 cases, odin failed siulation, but that is because they have no input pins, I'll make a workaround so that it generates N cycles for the simulator derived from the ouput vector file length.

here are the culprit: func_formal_flow: k6_frac_N10_40nm.xml/const_true.blif/common (failed: odin Simulation) OK func_formal_flow: k6_frac_N10_40nm.xml/const_false.blif/common (failed: odin Simulation) OK

But all other cases pass odin simulation. The good news is that the iterative approach seem to yield better result in a lot of cases!

to run the flow with the itterative approach you can set the variable in the run_vtr_flow as such:

##########
# ABC flow modifiers
# type 1 is skipping ABC
# type 2 is using only vaniila latches
# type 3 is black box all
# type 4 is itterative black boxing of latches per clock domain
my $abc_flow_type = 4;
my $use_new_latches_restoration_script = 1;
my $odin_run_simulation = 1;
my $disable_simulation_failure = 1;

Also with the itterative approach I think we could enable sequential optimization with ABC. which brings me to this and I'll explore to see if it does indeed work

The .latch \<input> \<output> \<type> \<control> \<init_val> directive in the extended BLIF can list \<type> as an integer number. This number represents the class of a given register. The register class can be used to limit transformations during sequential synthesis. For example, two registers that are proved equivalent cannot be merged if they belong to different class. When computing equivalent registers, computation is limited to one clock-domain at a time while the registers of other domains are treated as primary inputs. In the example below, the second registers is listed as having class 15. *https://people.eecs.berkeley.edu/~alanmi/publications/other/boxes01.pdf

the thought is to use ABC in two passes, before the first pass register all the clock domains in a file and assign them a class rewrite the the blif file to use the classes for the latches run a sequential only optimization through ABC recover the clock domains from the classes file second pass is to run the iterative approach with combinational optimization only.

with all that said here are the result of vtr_reg_strong

VTR_REG_STRONG

122 qor test failures

0 run failures

regression_tests vtr_reg_strong strong_fpu_hard_block_arch

pass

regression_tests vtr_reg_strong strong_fracturable_luts

pass

regression_tests vtr_reg_strong strong_mcnc

pass

regression_tests vtr_reg_strong strong_no_timing

pass

regression_tests vtr_reg_strong strong_timing

pass

regression_tests vtr_reg_strong strong_titan

pass

regression_tests vtr_reg_strong strong_power

pass

regression_tests vtr_reg_strong strong_func_formal_flow

arch circuit changes golden result
k6_frac_N10_40nm always_true.blif num_pre_packed_nets: 1 7
k6_frac_N10_40nm always_true.blif num_post_packed_nets: 1 7
k6_frac_N10_40nm always_true.blif placed_CPD_est: nan 0.734653
k6_frac_N10_40nm always_true.blif placed_setup_TNS_est: 0 -0.734653
k6_frac_N10_40nm always_true.blif placed_setup_WNS_est: 0 -0.734653
k6_frac_N10_40nm always_true.blif routed_wirelength: 0 7
k6_frac_N10_40nm always_true.blif num_io: 0 6
k6_frac_N10_40nm always_false.blif num_pre_packed_nets: 1 7
k6_frac_N10_40nm always_false.blif num_post_packed_nets: 1 7
k6_frac_N10_40nm always_false.blif placed_CPD_est: nan 0.734653
k6_frac_N10_40nm always_false.blif placed_setup_TNS_est: 0 -0.734653
k6_frac_N10_40nm always_false.blif placed_setup_WNS_est: 0 -0.734653
k6_frac_N10_40nm always_false.blif routed_wirelength: 0 7
k6_frac_N10_40nm always_false.blif num_io: 0 6
k6_frac_N10_40nm false_path_mux.blif num_pre_packed_nets: 4 7
k6_frac_N10_40nm false_path_mux.blif num_pre_packed_blocks: 6 8
k6_frac_N10_40nm false_path_mux.blif num_post_packed_nets: 4 5
k6_frac_N10_40nm false_path_mux.blif placed_CPD_est: 0.708653 1.37865
k6_frac_N10_40nm false_path_mux.blif placed_setup_TNS_est: -0.708653 -1.37865
k6_frac_N10_40nm false_path_mux.blif placed_setup_WNS_est: -0.708653 -1.37865
k6_frac_N10_40nm false_path_mux.blif num_io: 3 4
k6_frac_N10_40nm mult_4x4.blif num_post_packed_nets: 21 27
k6_frac_N10_40nm mult_5x5.blif num_pre_packed_nets: 47 43
k6_frac_N10_40nm mult_5x5.blif num_pre_packed_blocks: 57 53
k6_frac_N10_40nm mult_5x5.blif logic_block_area_used: 215576 161682
k6_frac_N10_40nm mult_5x5.blif num_clb: 4 3
k6_frac_N10_40nm rca_3bit.blif num_pre_packed_nets: 12 13
k6_frac_N10_40nm rca_3bit.blif num_pre_packed_blocks: 16 17
k6_frac_N10_40nm rca_4bit.blif num_pre_packed_nets: 15 17
k6_frac_N10_40nm rca_4bit.blif num_pre_packed_blocks: 20 22
k6_frac_N10_40nm rca_4bit.blif placed_CPD_est: 1.04365 1.71365
k6_frac_N10_40nm rca_4bit.blif placed_setup_TNS_est: -4.54826 -6.55826
k6_frac_N10_40nm rca_4bit.blif placed_setup_WNS_est: -1.04365 -1.71365

regression_tests vtr_reg_strong strong_func_formal_vpr

pass

regression_tests vtr_reg_strong strong_bounding_box

arch circuit changes golden result
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_nets: 287 267
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_blocks: 317 297
k6_N10_mem32K_40nm stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm stereovision3.v logic_block_area_total: 2.23746e+06 1.07788e+06
k6_N10_mem32K_40nm stereovision3.v min_chan_width_routing_area_total: 69362.3 49980.0
k6_N10_mem32K_40nm stereovision3.v crit_path_routing_area_total: 90639.5 65453.8
k6_N10_mem32K_40nm stereovision3.v num_clb: 22 19

regression_tests vtr_reg_strong strong_breadth_first

arch circuit changes golden result
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_nets: 287 262
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_blocks: 317 292
k6_N10_mem32K_40nm stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm stereovision3.v logic_block_area_total: 2.23746e+06 1.07788e+06
k6_N10_mem32K_40nm stereovision3.v routing_area_total: 289172. 207176.
k6_N10_mem32K_40nm stereovision3.v num_clb: 22 19

regression_tests vtr_reg_strong strong_echo_files

arch circuit changes golden result
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_nets: 287 267
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_blocks: 317 297
k6_N10_mem32K_40nm stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm stereovision3.v routed_wirelength: 1110 650
k6_N10_mem32K_40nm stereovision3.v logic_block_area_total: 2.23746e+06 1.07788e+06
k6_N10_mem32K_40nm stereovision3.v min_chan_width_routing_area_total: 63135.7 49980.0
k6_N10_mem32K_40nm stereovision3.v crit_path_routed_wirelength: 796 465
k6_N10_mem32K_40nm stereovision3.v crit_path_routing_area_total: 82288.3 65453.8
k6_N10_mem32K_40nm stereovision3.v num_clb: 22 19

regression_tests vtr_reg_strong strong_constant_outputs

pass

regression_tests vtr_reg_strong strong_sweep_constant_outputs

pass

regression_tests vtr_reg_strong strong_fix_pins_pad_file

arch circuit changes golden result
k6_N10_40nm stereovision3.v num_pre_packed_nets: 287 267
k6_N10_40nm stereovision3.v num_pre_packed_blocks: 317 297
k6_N10_40nm stereovision3.v num_clb: 22 19

regression_tests vtr_reg_strong strong_fix_pins_random

arch circuit changes golden result
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_nets: 287 267
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_blocks: 317 297
k6_N10_mem32K_40nm stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm stereovision3.v logic_block_area_total: 2.23746e+06 1.07788e+06
k6_N10_mem32K_40nm stereovision3.v min_chan_width_routing_area_total: 69362.3 49980.0
k6_N10_mem32K_40nm stereovision3.v crit_path_routing_area_total: 90639.5 65453.8
k6_N10_mem32K_40nm stereovision3.v num_clb: 22 19

regression_tests vtr_reg_strong strong_global_routing

arch circuit changes golden result
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_nets: 287 267
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_blocks: 317 297
k6_N10_mem32K_40nm stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm stereovision3.v logic_block_area_total: 2.23746e+06 1.07788e+06
k6_N10_mem32K_40nm stereovision3.v num_clb: 22 19
k6_N10_mem32K_40nm_nonuniform stereovision3.v num_pre_packed_nets: 287 267
k6_N10_mem32K_40nm_nonuniform stereovision3.v num_pre_packed_blocks: 317 297
k6_N10_mem32K_40nm_nonuniform stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm_nonuniform stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm_nonuniform stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm_nonuniform stereovision3.v logic_block_area_total: 2.23746e+06 1.07788e+06
k6_N10_mem32K_40nm_nonuniform stereovision3.v num_clb: 22 19
k6_N10_mem32K_40nm_pulse stereovision3.v num_pre_packed_nets: 287 262
k6_N10_mem32K_40nm_pulse stereovision3.v num_pre_packed_blocks: 317 292
k6_N10_mem32K_40nm_pulse stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm_pulse stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm_pulse stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm_pulse stereovision3.v min_chan_width: 12 16
k6_N10_mem32K_40nm_pulse stereovision3.v logic_block_area_total: 2.23746e+06 1.07788e+06
k6_N10_mem32K_40nm_pulse stereovision3.v num_clb: 22 19

regression_tests vtr_reg_strong strong_manual_annealing

arch circuit changes golden result
k6_frac_N10_40nm stereovision3.v num_pre_packed_nets: 287 262
k6_frac_N10_40nm stereovision3.v num_pre_packed_blocks: 317 292
k6_frac_N10_40nm stereovision3.v num_clb: 16 13

regression_tests vtr_reg_strong strong_pack

arch circuit changes golden result
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_nets: 287 267
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_blocks: 317 297
k6_N10_mem32K_40nm stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm stereovision3.v num_clb: 22 19

regression_tests vtr_reg_strong strong_pack_and_place

arch circuit changes golden result
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_nets: 287 267
k6_N10_mem32K_40nm stereovision3.v num_pre_packed_blocks: 317 297
k6_N10_mem32K_40nm stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm stereovision3.v num_clb: 22 19

regression_tests vtr_reg_strong strong_fc_abs

arch circuit changes golden result
k6_N10_mem32K_40nm_fc_abs stereovision3.v num_pre_packed_nets: 287 262
k6_N10_mem32K_40nm_fc_abs stereovision3.v num_pre_packed_blocks: 317 292
k6_N10_mem32K_40nm_fc_abs stereovision3.v device_width: 8 7
k6_N10_mem32K_40nm_fc_abs stereovision3.v device_height: 8 7
k6_N10_mem32K_40nm_fc_abs stereovision3.v device_grid_tiles: 64 49
k6_N10_mem32K_40nm_fc_abs stereovision3.v logic_block_area_total: 2.23746e+06 1.07788e+06
k6_N10_mem32K_40nm_fc_abs stereovision3.v min_chan_width_routing_area_total: 123323. 88828.2
k6_N10_mem32K_40nm_fc_abs stereovision3.v crit_path_routing_area_total: 144103. 104221.
k6_N10_mem32K_40nm_fc_abs stereovision3.v num_clb: 22 19

regression_tests vtr_reg_strong strong_multiclock

pass

regression_tests vtr_reg_strong strong_custom_pin_locs

pass

regression_tests vtr_reg_strong strong_custom_switch_block

pass

regression_tests vtr_reg_strong strong_custom_grid

pass

regression_tests vtr_reg_strong strong_flyover_wires

pass

regression_tests vtr_reg_strong strong_minimax_budgets

arch circuit changes golden result
k6_frac_N10_frac_chain_mem32K_40nm stereovision3.v logic_block_area_used: 970092 700622
k6_frac_N10_frac_chain_mem32K_40nm stereovision3.v num_clb: 18 13

regression_tests vtr_reg_strong strong_scale_delay_budgets

arch circuit changes golden result
k6_frac_N10_frac_chain_mem32K_40nm stereovision3.v num_clb: 18 15

regression_tests vtr_reg_strong strong_verify_rr_graph

pass

regression_tests vtr_reg_strong strong_analysis_only

pass

regression_tests vtr_reg_strong strong_route_only

pass

kmurray commented 5 years ago

Thanks @jeanlego for putting this all together, it looks like a lot of work!

Regression Failures

for 1 case odin could not simulate fpu_hard_block_arch: hard_fpu_arch_timing.xml/mm3.v/common (failed to include simulation) OK but this is because the odin siulator doesn't have the hardblock built for the simulator

Seems like disabling simulation is the right approach here.

ABC Latch Classes

the thought is to use ABC in two passes, before the first pass register all the clock domains in a file and assign them a class rewrite the the blif file to use the classes for the latches run a sequential only optimization through ABC recover the clock domains from the classes file second pass is to run the iterative approach with combinational optimization only.

Provided the class feature works correctly, why would we need to go through ABC in two passes? Wouldn't one pass be sufficient (would could then restore the clocks from the saved class information)?

I will note that at one point I did look into ABC's latch class feature, but when I tried it the class information was not included in ABC's output BLIF (i.e. it was dropped just like the regular clocks...).

QoR Failures Taking a look at the QoR failures you've included they all seem OK.

They mostly look like improvements caused by fewer netlist primitives (fewer pre-packed blocks and pre-packed nets) for the stereovision3 benchmark.

failed to include simulation I'm seeing a large number of failed to include simulation OK lines when I use run_vtr_flow on this branch.

Can you clarify what this means? I think it means ODIN is not performing simulation to verify that the ODIN and ABC BLIF's are equivalent.

If that is the case, and we don't want to run simulation, then we probably should only print OK (since skipping the simulation is expected). Any thoughts?

Verification Can you clarify what testing this branch has gone through?

How have you verified that the different flow types are functionally correct?

How have you verified that the correct clocks are being restored?

jeanlego commented 5 years ago

ABC latch Classes

given

The register class can be used to limit transformations during sequential synthesis

I understood that I would need to separate the sequential synthesis/optimization into another pass. I have very limited knowledge of ABC so any clarification is welcomed.

Also if classes are indeed respected throughout ABC optimization, I think it would be a good approach to not blackbox the latches and simply let ABC handle this. The script included already provides a way to list all the clock domain and modify them, it would be trivial to tag .latch using it.

Failed to include simulation

This means that trying to run the simulator with a given architecture & verilog file exited with a failure, so we skip verification using the simulator, to know what exactly failed the file is stored within the temp directory, this is a case by case issue, i'd need to know wich one failed to tell you why.

I'll look into deeper, I only had 2 failures to simulate for vtr_reg_strong and only 1 failed to be included at all.

The simulator does have good hard block support (supports ram, multiplier and etc...) but doesnt have the hard block logic for the arch: hard_fpu_arch_timing.xml.

by default the run_vtr_flow script is set to run vanilla (the way we currently use it) but with the simulator included now.

Verification

I've tested the branch using the vtr_reg_basic/strong But again I would not merge this yet as I'd like to have more hand written vector files and benchmark for ODIN as I have regenerate all of the current vectors to now include clock and display both edge of the clock. I'd like to have something very soon ready to be merged.

I havent encountered any problem and the simulator produces the expected output so far, but Id to be sure before overwritting the benchmarking suite vectors.

vaughnbetz commented 5 years ago

Thanks @jeanlego for pulling this together.

Some questions:

kmurray commented 5 years ago

Here are the results I generated from the mulipass_abc_flow branch on the VTR benchmarks.

image

As mentioned, iterative seems to produce fewer primitives, lower channel width, and wirelength but high critical path delay.

However since all the VTR benchmarks (except the tiny stereovision3 benchmark) are all single clock, we would expect the results to be the same as vanilla -- since for the single clock case we'd expect iterative to reduce down to vanilla.

jeanlego commented 5 years ago

The difference could be that i force an initial value on ABC using the init command, and I now force Odin to default to an initial value of 0 rather than X for .latch unless specified otherwise.

I had to do these for the simulator to assure an initial value that matched before and after abc.

I defined "clock domains" to be a triplet of driver, edge sensitivity, and initial value. Each run includes a report.clk file providing the different "clock domain" that are then used to itterate. Each individual itteration include a report of domain that are vanilla and black boxed file that display the number of latches within this clock domain. Because of this a lot of single clock driver circuit can have many "clock domain" for my script. I did try to only use the driver for a clock domain but it would fail to give the right result.

I also had to modify Odin as it used to only display rising edge for latches but now as support for all 5 types. Some result discrepancy could come from there if you compared with the old version.

jeanlego commented 5 years ago

Vanilla is running with the patched restoration script ( typo prevented the script from rewinding the file (that is original) ) Blanket approach is when we black box all the latches and itteration is one "clock domain" triplet (described above) But the biggest takeaway here is that all other than itteration fails to match initial vs post ABC simulation

jeanlego commented 5 years ago

Should this be closed based on #411

KennethKent commented 5 years ago

Yes, please :)

On Feb 18, 2019, at 2:20 AM, jeanlego notifications@github.com<mailto:notifications@github.com> wrote:

Should this be closed based on 411

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/384#issuecomment-464600718, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKDEpvTkEFRehRZdGSlZn9AzzR3lN8_qks5vOkYcgaJpZM4WB9R9.


Dr. Kenneth B. Kent, P.Eng. Professor & Director, IBM Centre for Advanced Studies - Atlantic Faculty of Computer Science, University of New Brunswick 540 Windsor Street, Fredericton, New Brunswick, E3B 5A3 ph. (506) 451-6971 fax. (506) 453-3566

Honorary Professor, Department of Computer Science and Institute of Visual Computing Bonn-Rhein-Sieg University of Applied Sciences Grantham-Allee 20, 53757 Sankt Augustin, Germany

jeanlego commented 5 years ago

closing as PR #411 fixed the issue