VPR Memory Error when input pin equivalence applied to DSPs and RAMs

helen-23 commented 4 years ago

When running the titan_new DLA benchmark variants with the stratixiv_arch.timing.xml architecture under Titan flow, VPR succeeded. However, after input pin equivalence is applied to the M9K block in stratixiv_arch.timing.xml, VPR reported memory access failures during routing.

Expected Behaviour

VPR should pass when input pin equivalence is applied to the DSP and RAM blocks in stratixiv_arch.timing.xml,

Current Behaviour

when input pin equivalence is applied to the M9K block in stratixiv_arch.timing.xml,

For the DLA_BSC and DLA_ELT benchmarks, VPR reported a segmentation fault during prune_route_tree_recurr() in route_tree_timing.cpp. This occurred at the beginning of routing. Please see attached VPR log files and screenshot of command line error messages for details.
For the DLA_LRN benchmark, VPR aborted at assert(node) due to node being NULL in the same function mentioned above. Please see attached VPR log file and screenshot of command line error messages for details.

Possible Solution

Maybe there are char-type variables in the router that are used to specify size, instead of uint16, so some values ran over the top.

Steps to Reproduce

checkout my branch, "vqm2bliff_one_lut_removal", which contains all changes required to run the DLA variants
unzip attached DLA circuits (DLA_BSC, DLA_ELT, and DLA_LRN)
unzip the modified architecture file (stratixiv_arch.timing.experiment2.xml)
Run titan_flow script with the DLA circuits and the modified architecture file. DO NOT run titan_flow.py with sanitizer build turned on, because there is currently integer overflow in the hash function due to multiply-add. Please do note to turn on options --fit and --gen_post_fit_netlist, because the DLA circuits need post-fit netlist for VPR. An example of the command looks like the following:

/scripts/titan_flow.py \\ -q DLA_BSC/quartus2_proj/DLA.qpf \\ -a stratixiv_arch.timing.experiment2.xml \\ --fit \\ --gen_post_fit_netlist \\ --titan_dir \\ --vqm2blif_dir /build/utils/vqm2blif \\ --quartus_dir /tools/intel/install/fpga/18.1/standard/quartus/bin \\ --vpr_dir /vpr 6. unzip vpr.sdc 7. Now with sanitizer build turned on, run VPR with the post-fit BLIF and vpr.sdc. An example of the command looks like the following: /vpr/vpr \\ stratixiv_arch.timing.experiment2.xml \\ DLA_stratixiv_post_fit.blif \\ --sdc_file vpr.sdc \\ --route_chan_width 300 \\ --max_router_iterations 400 \\ --timing_analysis on \\ --timing_report_npaths 1000 #### Context Trying out different architecture experiments to make VPR Fmax results more comparable to that of Quartus II for large circuits that are RAM-extensive. #### Your Environment * VTR revision used: 8.0 * Operating System and version: Linux Ubuntu 18.04.4 LTS (Bionic Beaver) * Compiler version: #### Files [DLA_BSC_vpr_stdout.log.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044073/DLA_BSC_vpr_stdout.log.zip) [DLA_ELT_vpr_stdout.log.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044072/DLA_ELT_vpr_stdout.log.zip) ![DLA_BSC_and_DLA_ELT_seg_fault_error_message](https://user-images.githubusercontent.com/25372596/89693493-93a9b980-d8dc-11ea-8657-9cd3c8a3e553.PNG) [DLA_LRN_vpr_stdout.log.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044071/DLA_LRN_vpr_stdout.log.zip) ![DLA_LRN_assertion_error_message](https://user-images.githubusercontent.com/25372596/89693489-91dff600-d8dc-11ea-9840-11f323f6d06f.PNG) [DLA_BSC.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044076/DLA_BSC.zip) [DLA_ELT.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044075/DLA_ELT.zip) [DLA_LRN.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044074/DLA_LRN.zip) [vpr.sdc.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044070/vpr.sdc.zip) [stratixiv_arch.timing.experiment2.xml.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044077/stratixiv_arch.timing.experiment2.xml.zip)

aman26kbm commented 4 years ago

I posted it on the VTR user group as well (https://groups.google.com/g/vtr-users/c/3Fcuo4r4wz4), in case someone else sees it. Please respond there if you can. Appreciate any help.

vaughnbetz commented 4 years ago

Adding Xifan in case he can share his partial crossbar for memory blocks, as that may help. From looking at the arch file it's not apparent to me what 's causing the segfault. Alas, we need more user-friendly errors on arch file parsing for cases like this.

Best,

Vaughn

On Sun, Sep 6, 2020 at 4:16 PM Aman Arora notifications@github.com wrote:

I posted it on the VTR user group as well ( https://groups.google.com/g/vtr-users/c/3Fcuo4r4wz4), in case someone else sees it. Please respond there if you can. Appreciate any help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/1475#issuecomment-687889885, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNDPJ6I4YT6EHWCSRAVLTDSEPUYDANCNFSM4PYBXDKQ .

aman26kbm commented 4 years ago

Thanks, @vaughnbetz

@tangxifan , it'll be very helpful if you can share your XML arch with the crossbar description.

tangxifan commented 4 years ago

@aman26kbm You can find the LINE 12165 in the XML about how I create crossbars for part of DSP inputs. Hope that it can help.

stratixiv_arch_DSP_pin_eq.timing.xml.zip

aman26kbm commented 4 years ago

Thanks, Xifan. I looked at it and it is very similar to how I'm doing it. The weird thing is that it works for DSP slice. But it doesn't work for memory (and for another hard block I am adding to the architecture). So somehow the crossbar on memory is triggered some specific code path in VPR that has a bug.

I am attaching a zip file that has a arch and a design file. If you run them, you'll see the segmentation fault.

Desktop.zip

aman26kbm commented 4 years ago

@vaughnbetz , regarding this point that you had mentioned:

I think that's overly conservative; I would use either the lower delay and no pin equivalence or the higher delay and pin equivalence.

I have some confusion here. Wanted to enumerate the options we have to get some clarity.

Option#1: Express the crossbar (using "complete" keyword) + specify delay of the local mux from COFFE (using the "delay_constant" tag) + specify pin equivalence ("equivalent = full") -> This is the right thing, but is crashing.

Option#2: No crossbar specification (use "direct" keyword) + specify delay of the local mux from COFFE for these direct wires (using the "delay_constant" tag) + specify pin equivalence ("equivalent = full") -> I think this is the quick and dirty thing mentioned above. Haven't tried this, may crash or may work.

Option #3: No crossbar specification (use "direct" keyword) + specify delay of the local mux from COFFE + no pin equivalence -> This is the overly conservative thing mentioned above.

Option#4: No crossbar specification (use "direct" keyword) + 0 delay + no pin equivalence. A variation of option #4 is (option 4.5): No crossbar specification (use "direct" keyword) + some fraction of the local mux delay obtained from coffe + no pin equivalence.

Do you have a recommendation of what the reasonable thing to do is, for now, until the crash issue is fixed?

vaughnbetz commented 4 years ago

I would do #2 if it works (used to have a bug, but Helen fixed that so hopefully it now works). It is almost the same as #1. Note that for large blocks the lookahead is not great right now (always tries to go to the lower-left corner of the block), so a lower astar_fac (.75, maybe even .5 if the cpu time is not crazy) may be helpful.

If #2 and #1 don't work, #4 is next best as it is self-consistent (but harder to route). With #4 you could also leave out the local crossbar area, as you are not using one (more self-consistent still).

Vaughn

On Wed, Sep 9, 2020 at 11:12 AM Aman Arora notifications@github.com wrote:

@vaughnbetz https://github.com/vaughnbetz , regarding this point that you had mentioned:

I think that's overly conservative; I would use either the lower delay and no pin equivalence or the higher delay and pin equivalence.

I have some confusion here. Wanted to enumerate the options we have to get some clarity.

Option#1: Express the crossbar (using "complete" keyword) + specify delay of the local mux from COFFE (using the "delay_constant" tag) + specify pin equivalence ("equivalent = full") -> This is the right thing, but is crashing.

Option#2: No crossbar specification (use "direct" keyword) + specify delay of the local mux from COFFE for these direct wires (using the "delay_constant" tag) + specify pin equivalence ("equivalent = full") -> I think this is the quick and dirty thing mentioned above. Haven't tried this, may crash or may work.

Option #3 https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/3: No crossbar specification (use "direct" keyword) + specify delay of the local mux from COFFE + no pin equivalence -> This is the overly conservative thing mentioned above.

Option#4: No crossbar specification (use "direct" keyword) + 0 delay + no pin equivalence. A variation of option #4 https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/4 is (option 4.5): No crossbar specification (use "direct" keyword) + some fraction of the local mux delay obtained from coffe + no pin equivalence.

Do you have a recommendation of what the reasonable thing to do is, for now, until the crash issue is fixed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/1475#issuecomment-689628044, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNDPJ72I6FADXYNQKNNLHDSE6LPFANCNFSM4PYBXDKQ .

aman26kbm commented 4 years ago

Thanks. I'll try #2 tomorrow to see what happens.

verilog-to-routing / vtr-verilog-to-routing