verilog-to-routing / vtr-verilog-to-routing

Verilog to Routing -- Open Source CAD Flow for FPGA Research
https://verilogtorouting.org
Other
1.01k stars 389 forks source link

VPR Memory Error when input pin equivalence applied to DSPs and RAMs #1475

Closed helen-23 closed 4 years ago

helen-23 commented 4 years ago

When running the titan_new DLA benchmark variants with the stratixiv_arch.timing.xml architecture under Titan flow, VPR succeeded. However, after input pin equivalence is applied to the M9K block in stratixiv_arch.timing.xml, VPR reported memory access failures during routing.

Expected Behaviour

VPR should pass when input pin equivalence is applied to the DSP and RAM blocks in stratixiv_arch.timing.xml,

Current Behaviour

when input pin equivalence is applied to the M9K block in stratixiv_arch.timing.xml,

  1. For the DLA_BSC and DLA_ELT benchmarks, VPR reported a segmentation fault during prune_route_tree_recurr() in route_tree_timing.cpp. This occurred at the beginning of routing. Please see attached VPR log files and screenshot of command line error messages for details.

  2. For the DLA_LRN benchmark, VPR aborted at assert(node) due to node being NULL in the same function mentioned above. Please see attached VPR log file and screenshot of command line error messages for details.

Possible Solution

Maybe there are char-type variables in the router that are used to specify size, instead of uint16, so some values ran over the top.

Steps to Reproduce

  1. checkout my branch, "vqm2bliff_one_lut_removal", which contains all changes required to run the DLA variants
  2. unzip attached DLA circuits (DLA_BSC, DLA_ELT, and DLA_LRN)
  3. unzip the modified architecture file (stratixiv_arch.timing.experiment2.xml)
  4. Run titan_flow script with the DLA circuits and the modified architecture file. DO NOT run titan_flow.py with sanitizer build turned on, because there is currently integer overflow in the hash function due to multiply-add. Please do note to turn on options --fit and --gen_post_fit_netlist, because the DLA circuits need post-fit netlist for VPR. An example of the command looks like the following:
/scripts/titan_flow.py \\ -q DLA_BSC/quartus2_proj/DLA.qpf \\ -a stratixiv_arch.timing.experiment2.xml \\ --fit \\ --gen_post_fit_netlist \\ --titan_dir \\ --vqm2blif_dir /build/utils/vqm2blif \\ --quartus_dir /tools/intel/install/fpga/18.1/standard/quartus/bin \\ --vpr_dir /vpr 6. unzip vpr.sdc 7. Now with sanitizer build turned on, run VPR with the post-fit BLIF and vpr.sdc. An example of the command looks like the following: /vpr/vpr \\ stratixiv_arch.timing.experiment2.xml \\ DLA_stratixiv_post_fit.blif \\ --sdc_file vpr.sdc \\ --route_chan_width 300 \\ --max_router_iterations 400 \\ --timing_analysis on \\ --timing_report_npaths 1000 #### Context Trying out different architecture experiments to make VPR Fmax results more comparable to that of Quartus II for large circuits that are RAM-extensive. #### Your Environment * VTR revision used: 8.0 * Operating System and version: Linux Ubuntu 18.04.4 LTS (Bionic Beaver) * Compiler version: #### Files [DLA_BSC_vpr_stdout.log.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044073/DLA_BSC_vpr_stdout.log.zip) [DLA_ELT_vpr_stdout.log.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044072/DLA_ELT_vpr_stdout.log.zip) ![DLA_BSC_and_DLA_ELT_seg_fault_error_message](https://user-images.githubusercontent.com/25372596/89693493-93a9b980-d8dc-11ea-8657-9cd3c8a3e553.PNG) [DLA_LRN_vpr_stdout.log.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044071/DLA_LRN_vpr_stdout.log.zip) ![DLA_LRN_assertion_error_message](https://user-images.githubusercontent.com/25372596/89693489-91dff600-d8dc-11ea-9840-11f323f6d06f.PNG) [DLA_BSC.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044076/DLA_BSC.zip) [DLA_ELT.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044075/DLA_ELT.zip) [DLA_LRN.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044074/DLA_LRN.zip) [vpr.sdc.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044070/vpr.sdc.zip) [stratixiv_arch.timing.experiment2.xml.zip](https://github.com/verilog-to-routing/vtr-verilog-to-routing/files/5044077/stratixiv_arch.timing.experiment2.xml.zip)
vaughnbetz commented 4 years ago

Hi Helen, If you rerun vpr with --router_high_fanout_threshold -1 it should turn off the code that uses that function. That will let us see if that function is the problem, or if it is just the first location that crashes when memory has been corrupted somewhere else. Can you run that test and post the results here?

vaughnbetz commented 4 years ago

Checked CBRR, t_rr_node_storage, t_rr_node_route_inf, and t_edge_size. All have the ability to store at least short/uint16 values in everthing (pin numbers, edge counts, etc.) so no overflow in them with large IPIN or OPIN equivalence classes with more than 256 members. So it doesn't look like it's overflow, unless it's happening in some rr-graph building code temporary structures (which is still possible; will look at them next).

vaughnbetz commented 4 years ago

Checked the rr-graph building code and rr_node_indices and again don't see anything that is using char for storage of rr-related data. So I don't see an issue with overflow in the rr-graph.

vaughnbetz commented 4 years ago

Also no problems I can see in the routing data structures (t_rt_tree and t_trace). No chars in them, nothing that should overflow. @tangxifan : adding you in case you're seeing (or not seeing a similar) problem on your branch when you make M9K or other blocks have equivalent inputs. I think doing M9Ks alone is probably a good first step; they are not as large as the other blocks so one less thing that is changing (don't exercise as many code paths).

tangxifan commented 4 years ago

Hi @vaughnbetz I have tried on my side and see the same router failure. It seems that in my test, it is an assertation error at route_tree_timing.cpp:938, which says the node id is invalid.

/research/ece/lnis/USERS/tang/github/vtr-verilog-to-routing/vpr/src/route/route_tree_timing.cpp:938 prune_route_tree_recurr: Assertion 'node' failed.

I attached my vpr.out FYI neuron_vpr.zip

vaughnbetz commented 4 years ago

Thanks Xifan. That's in the same routine: prune_route_tree_recurr. Suggested things to try:

  1. rerun vpr with --router_high_fanout_threshold -1 (as I mentioned to Helen above) to turn off that code and see if it completes. If it does, we've narrowed it down.

  2. If that crashes too, then I think trying an experiment with only M9Ks having input pin equivalence (assuming that's not what you did) is a good idea, to narrow down the change. This case crashes for Helen, so I expect it to crash.

  3. Then look at the rr-graph created for the M9K and see if there is something strange in the IPIN connectivity (especially to the sink, and that there is one sink for the equivalent inputs) and that net_rr_terminals is correct (all connections ending at a single M9K RAM go to that same sink).

helen-23 commented 4 years ago

Hi @vaughnbetz @tangxifan ,

I've tried running VPR with --router_high_fanout_threshold -1, but it didn't seem to have turned off the function. The run still aborted in the same routine: prune_route_tree_recurr. Is it another option that turns the function off?

In the previous run, DLA_BSC and DLA_ELT aborted when trying to access the 'child' attribute of 'node', while DLA_LRN aborted at an assertion error, same as what Xifan saw in his run. This time, however, all three designs ended at trying to access "edge->child".

I also took a look into the logic of the code. prune_route_tree_recurr() is called by prune_route_tree(), which is called from route_timing.cpp. Before each time prune_route_tree() is called, there is a function that checks whether the tree is valid: check_tree_valid

But this assertion would only take affect if VTR assertion level is set to 3. I'm thinking about maybe trying it out. If the assertion is never triggered but VPR keeps failing in prune_route_tree, then it's probably not the node pointers and edge pointers themselves going out of bound, but rather (I suspect) something odd in the structure of the tree or how the nodes and edges are deleted during pruning.

helen-23 commented 4 years ago

Found the option that turns route tree pruning off: --min_incremental_reroute_fanout \<int>. I set \<int> to 10000 so that it is definitely higher than the largest fanout of any net.

Turns out prune_route_tree_recurr was indeed the first location that crashed due to memory corruption elsewhere. After disabling that function, VPR failed at another location, free_route_tree(t_rt_node*), which recursively frees the nodes and edges in the route tree in a depth-first traversal. Different DLA variants failed for different reasons (same as our previous observations).

DLA_BSC (failed at trying to access attribute of NULL pointer node): min_incre_reroute_1000

DLA_ELT (failed because address of an edge is misaligned) min_incre_reroute_1000_elt

helen-23 commented 4 years ago

Also tried adding assertions to check the validity of the node tree at various locations of the code: VTR_LOG("Checking if routing skeleton tree is valid..."); VTR_ASSERT(is_valid_skeleton_tree(rt_root)); It turns out the tree was still valid before going into free_route_tree (rt_root), so there should not be any NULL or misaligned pointers in the original tree, otherwise, is_valid_skeleton_tree(rt_root) would have aborted during traversal.

(Wild guess) Is it possible that the input pin equalization somehow introduced multiple edges between the same pair of nodes?

vaughnbetz commented 4 years ago

Nice detective work Helen!

  1. Yes, I think turning the VPR_ASSERT level up to the max and running is a very good idea.
  2. I think checking the rr-graph is a very good idea. Probably the best thing to do is to dump the rr_graph.echo file (see the developer guide for how) and checking the generated rr-graph around the IPINs and SINKs of the M9K RAM (we should only turn on input pin equivalence for it). Should do a small design / device so the file is not so big. Do we properly connect all the IPINs to a new SINK that represents all connections to an M9K RAM? Do we properly set that sink in net_rr_terminals as the node we should route to when routing to an M9K input?
helen-23 commented 4 years ago

Thanks, Vaughn, sounds like a plan. I've tried it on a smaller circuit (the dot product circuit I had earlier to test the DLA designs), and it passed with M9K input pin equivalence enabled. This is probably because the design used very few, if any, RAM blocks. Trying it again with DSP input pin equivalence enabled and I'll update again once that's done.

helen-23 commented 4 years ago

Also trying just enabling M9K input pin equivalence on a smaller Titan benchmark.

tangxifan commented 4 years ago

Hi @vaughnbetz

vaughnbetz commented 4 years ago

Thanks Xifan and Helen. Xifan, Helen told me today that she has a candidate bug location identified: it looks like the route_tree code does not properly remove connections the look like: IPIN->SINK ->SINK ->SINK

and so on. I think this may never have been exercised in the code before because logic blocks (when they have input equivalence on) build a clustered netlist that just brings a signal in once (connects to one input). So there wouldn't be a reason to construct a route tree like this. I suspect the general (RAM, DSP, ...) clustered netlist building code is instead building one connection per input pin to a RAM or DSP block, despite the pins being equivalent. That is legal, so the root bug really is in the router not properly handling this case (i.e. I'd rather fix the router than change the netlist builder, although this might be fixable in the clustered netlist builder as well).

Helen is going to try fixing the route_tree bug and hopefully that will fix this.

helen-23 commented 4 years ago

Thank you, Vaughn, for explaining it all!

Another issue came up though.

For the case that we've observed (as Vaughn mentioned above), here is an example of such tree structure that failed during delete_route_tree() (did print_node_tree() to the log file):

print_tree_segfault

During deletion, a depth-first approach is used to delete each node from the leafs and upwards. Once node 33371 is deleted from the first edge that reaches it, subsequent deletions of node 33371 from the other 10 edges would result in segfault. Same thing would happen if the prune_route_tree() routine is called.

I fixed the delete_route_tree() so that it only removes the first instance of a child node if there are multiple of the same child node under one parent node, and it worked:

print_tree_segfault_fixed

However, as the test went on further, another type of tree structure showed up, where a child node is associated with multiple parent nodes. This is not considered a valid tree structure, thus an assertion error was reported with check_valid_skeleton_tree().:

print_tree_assertion error

vaughnbetz commented 4 years ago

Thanks Helen. I think that is another bug -- a SINK should be allowed to be part of more than one tree branch when it has capacity > 1. So I think relaxing the error check for that case is fine, and hopefully fixes this bug.

helen-23 commented 4 years ago

Thanks, Vaughn,

Fixed delete_route_tree() and prune_route_tree() to only free a node once in the tree. Also modified verify_traceback_route_tree_equivalent(), and collect_route_tree_connections() to use std::multiset instead of std::set since there are now more multiple tuples containing the same values {prev_node, prev_switch, to_node}

One more concern though: an rt_node struct has a parent_node and a parent_switch member associated with it. If a node is allowed to be part of more than one tree branch, its parent_node member now only points to one of its parents. I had to reconstruct the collect_route_tree_connections() logic because it was making use of the node's parent_node member, which was pointing at the wrong node in some cases. That was an easy fix, but I wasn't sure how to deal with parent_switch. As of now, I am assuming that a node's parent_switch is the same across all of its parent nodes.

My small dot product circuit test finally passed most of routing with input pin equivalence applied to the DSP blocks, but failed with the following error:

Message: in check_sink: node 33371 does not connect to any terminal of net dot_product:generate_dsps[0].DP|dsp_block:generate_dsps[0].Dlast|datab[31] #7.
This error is usually caused by incorrectly specified logical equivalence in your architecture file.
You should try to respecify what pins are equivalent or turn logical equivalence off.

I'll run some other titan circuits that use M9Ks and only apply pin equivalence to the M9K blocks, and see how that goes with my code modifications. I'm still worried if the parent node issue might cause problems in other parts of the logic. What would you recommend as the best way to deal with them?

helen-23 commented 4 years ago

Please see branch "allow_ram_dsp_input_equivalence" https://github.com/verilog-to-routing/vtr-verilog-to-routing/compare/allow_ram_dsp_input_equivalence

vaughnbetz commented 4 years ago

Hi Helen,

I can't see that branch. Maybe it isn't pushed to github yet (i.e. is only local on your machine)?

For the non-tree errors:

I think there are two basic ways we could attack it.

  1. We could try to special case sinks, putting the same rtnode into the tree multiple times when we reach a SINK multiple times in a single route. I think this way is likely going to become problematic, as you can't have multiple parent pointers etc. from the single rt_node, so we can't really make the rt_tree consistent.

  2. Right now, we will have two separate rt_nodes in a route tree if we go to the same sink twice. The rt_nodes will (in this logical equivalence case) have the same data (rr_node, parent_node, parent_switch), but there will be two of them. So that is fine. There is one problem I can see in that case: we have a mapping from an rr_node (index) to a (single) rt_node: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/b726e4471d1a87f3ca7f3999381ba63fcdc95cbe/vpr/src/route/route_tree_timing.cpp#L277

This mapping is only used in route_tree_timing.cpp to rapidly figure out where rt_nodes that we are branching off of (for nets with fanout > 1) are, so we can quickly attach the routing of a new connection to the route_tree. We'll never branch off a SINK (it is the end of a connection) so storing only one of the rr_node -> rt_node mappings for a SINK (if the SINK is inserted multiple times) should be OK. But that should be commented in this file.

If that wasn't OK we could change the mapping to store a vector or list for each entry, but I think we don't need to (can just update the comment).

I think with this approach (#2) we can still use the parent pointers, parent_switch info etc. in the rt_tree so that should be OK.

vaughnbetz commented 4 years ago

I think there is a bug in check_sink. It should be marking netlist pins as sinks that could match them are found, but the code isn't correct if you can connect to the same sink more than once. https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/ed158403a987301be30965c2010283daf835d6ca/vpr/src/route/check_route.cpp#L205

I've fixed the code below (haven't compiled or tested though, but hopefully this works). Code is also simpler :).

// Check all the pins on this net, and find a pin that has not yet been reached that could match this SINK.
// Mark that found pin as done in the pin_done array.
     for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id)) {
           int pin_index = tile_pin_index(pin_id);
           int iclass = type->pin_class[pin_index];
           if (iclass == ptc_num) {
                    /* Could connect to same pin class on the same cluster-level block more than once.  Only   
                     * update pin_done for a pin that hasn't been reached yet.                 
                     */
                  if (pin_done[pin_id] == false) {
                      ifound++;
                      pin_done[ipin] = true;
                      break;   
                  }
            }
    }

While you are at it, the code below looks old and out of date, and should be replaced with an VTR_ASSERT (ifound <= 1);

// Delete this code and replace with VTR_ASSERT (ifound <= 1);
    if (ifound > 1 && is_io_type(type)) {
        VPR_FATAL_ERROR(VPR_ERROR_ROUTE,
                        "in check_sink: found %d terminals of net %d of pad %d at location (%d, %d).\n", ifound, size_t(net_id), ptc_num, i, j);
    }
helen-23 commented 4 years ago

Hi Vaughn,

Thank you for the detailed explanation! Yes I can see how #2 would work the best. Currently working on it and will update whether it works or not.

I think I got the link to my branch wrong, it should be the following: https://github.com/verilog-to-routing/vtr-verilog-to-routing/tree/allow_ram_dsp_input_equivalence

Now, I'm noticing a bunch of regtest failures in this branch due to items like: warning: no previous declaration for ‘void pathfinder_update_cost_from_route_tree(const t_rt_node*, int, float)’

I didn't touch that function so I'm not sure what's going on. I'll have to look into that later as well.

vaughnbetz commented 4 years ago

We have regtests that set compiler warnings as errors so somebody gets rid of the warnings. Not sure if the declaration for that function was accidentally deleted (which would trigger the warning)?

helen-23 commented 4 years ago

Hi Vaughn,

Good news: it finally worked! After investigating into the code further, I realized that the problem is actually simpler than what I originally thought. There are 2 ways a route tree could be created:

  1. If a route tree is initialized from init_route_tree_to_source and is updated by add_subtree_to_route_tree, any repeated node would be added as a separate rt_node already, so if a single route hits node 33371 twice, two separate rt_nodes are made.

  2. If a route tree is initialized from a previous iteration, it is initialized through traceback_to_route_tree. Now this is the function that only creates one node of every inode ID while reading through the traceback.

So it turned out that only two functions need to be modified : traceback_to_route_tree so that it can create different rt_nodes when it hits the same SINK node more than once, and verify_traceback_route_tree_equivalent() to use std::multiset instead of std::set. Along with the fix you've provided for check_sink, VPR finally completed successfully for the dot product test circuit. Please note that I've applied input pin equivalence to the DSP blocks for this design. I'll try running the other circuits as well and check to see if the results make sense.

In case needed, please find my branch: allow_ram_dsp_input_equivalence https://github.com/verilog-to-routing/vtr-verilog-to-routing/tree/allow_ram_dsp_input_equivalence

vaughnbetz commented 4 years ago

Yaay! Nice work Helen; this was a tricky thing to figure out in a large code base! Great job reading code and figuring out algorithms.

tangxifan commented 4 years ago

@vaughnbetz @helen-23 Thanks for the bug fixing. I will port this to my local branch once it is merged to master, and continue my experiments.

Actually, I am curious about why the routing tree creation has such a problem only when we set pin equivalence to M9K, DSP etc. For CLBs with input pin equivalence, it works fine. Do you have any insights about this?

helen-23 commented 4 years ago

Hi @vaughnbetz @tangxifan , Actually, one more thing. The Carpat design and a bunch of regtests failed in check_sink due to "node xxx does not connect to any terminal of net". I tried running the regtests locally, when the check_sink function is reverted back to the original, the regtests passed. I think there is still something wrong in the check_sink function. It looks alright to me though, Vaughn, could you please take a quick look at that function my branch and see if I did anything wrong?

helen-23 commented 4 years ago

I added back the iblk forloop and the the if statement, and it worked for both the regtests and the dot product design. So basically the only difference between my branch and master is adding the "break" in the loop. Frankly though, this is a situation where the code works but I'm not entirely sure why it does.

helen-23 commented 4 years ago

Update: the Carpat design also completed successfully. I ran this design with input pin equivalence applied to the M9K block.

vaughnbetz commented 4 years ago

The old check was a weaker check

It would be good to get the stronger check to work. Helen can you give some info on the net that is failing? Fanout, type of block it is connecting to (suspect it is an IO), ...

vaughnbetz commented 4 years ago

OK, I think I've got it (well, have an idea anyway :). I forgot that Kevin refactored the netlists so the pin_id is no longer per net, but instead per netlist (i.e. it no longer goes from 1 to num_sinks for a net with fanout = num_sinks). So the new usage of pin_done won't work as it assumes ids from 1 to num_sinks. I also had a bug in it, as I used ipin as the index into pin_done, but I think removed the code that incremented ipin.

So below is a fixed check that is still a strong check. Basic change is that pin_done becomes a set (so you'll have to change its allocation to create an empty set in the caller). Other than that the logic is essentially the same.


// Check all the pins on this net, and find a pin that has not yet been reached that could match this SINK.
// Mark that found pin as done in the pin_done set.
     for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id)) {
           int pin_index = tile_pin_index(pin_id);
           int iclass = type->pin_class[pin_index];
           if (iclass == ptc_num) {
                    /* Could connect to same pin class on the same cluster-level block more than once.  Only   
                     * update pin_done for a pin that hasn't been reached yet.                 
                     */
                  if (!pin_done.find(pin_id)) {
                      pin_done.insert(pin_id);
                      ifound++;
                      break;   
                  }
            }
    }
vaughnbetz commented 4 years ago

For Xifan's earlier question: why didn't we hit this before?
I'm pretty sure it's because the clustered netlist builder for logic blocks doesn't connect a net more than once to a logic block if it goes to multiple equivalent inputs; this makes the netlist a little more efficient. Seems other blocks don't have the same code (and don't need it; there are cases where you would want to connect multiple times to equivalent inputs, such as an and-gate logic element) so they trigger this bug.

helen-23 commented 4 years ago

Hi Vaughn, I tried what you suggested, but although VPR passed for the dot product circuit, many of the regtests still failed for the same reason. An example from one of the regtests is the following: Message: in check_sink: node 3840 does not connect to any terminal of net diffeq_paj_convert^x_var~0_FF #36 The net has 2 fanouts.

Here's part of my code:

    // Check all the pins on this net, and find a pin that has not yet been reached that could match this SINK. 
    // Mark that found pin as done in the pin_done set. 
    for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id)) {
          int pin_index = tile_pin_index(pin_id);
          int iclass = type->pin_class[pin_index];
          if (iclass == ptc_num) {
                /* Could connect to same pin class on the same cluster-level block more than once.  Only   
                * update pin_done for a pin that hasn't been reached yet.                 
                */
                if (pin_done.find(pin_id) == pin_done.end()) {
                      pin_done.insert(pin_id);
                      ifound++;
                      break;   
                 }
           }
    }

In the caller function, I made pin_done a set and inserted the first net in the set, which follows the logic of the original code (pin_done[0]=true;): std::set<vtr::StrongId<cluster_pin_id_tag>> pin_done;

auto first_net_id = cluster_ctx.clb_nlist.net_pins(net_id).begin();
pin_done.insert(*first_net_id);

And for checking whether net is connected to pins:

 for (auto pin_id : cluster_ctx.clb_nlist.net_pins(net_id)) {
         if (pin_done.find(pin_id) == pin_done.end()) {
                int pin_index = tile_pin_index(pin_id);
                VPR_FATAL_ERROR(VPR_ERROR_ROUTE,
                                "in check_route: net %zu does not connect to pin %d.\n", size_t(net_id), pin_index);
         }
 }
tangxifan commented 4 years ago

Thanks @vaughnbetz for the explanation. I understand better where the bug comes from. Currently, we are forcing a greedy and false pin-equivalence to the M9K or DSP inputs. In a fully equivalent port, each pin should be mapped to a unique port. But now, the local routing of M9K is partially fully connected, which the logic block router cannot uniquify nets and then map the same net to 2+ pins in a port. If this is the case, I can try to help Helen in improving the Titan architecture by creating the true equivalent ports. This may bypass the problem we see here.

vaughnbetz commented 4 years ago

Thanks Xifan. That makes sense. In parallel, I'd like to get this bug fixed too, since I think it is still a bug (but I agree, modifying the architecture file would change the netlist so that this case doesn't happen for the architecture we're creating).

Helen, that is odd the code doesn't work -- it looked so good :). One alternative is to not check the driver at all, by changing the check loop to only go over sinks: for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id))

But I am not sure as I thought your code above would work too!

helen-23 commented 4 years ago

Hi Vaughn, I think I might have figured out why our code didn't work. In the original code, it loops through all the blocks associated with the SINK node's xlow() and ylow(), and it loops through all the sink pins on the net. Only sink pins that are associated with the blocks are checked to see if they could match the SINK node.

The total number of sink pins on some nets are actually more than the number of those associated with the appropriate blocks, so after removing this loop for (int iblk = 0; iblk < type->capacity; iblk++) and this check if (cluster_ctx.clb_nlist.pin_block(pin_id) == bnum) Perhaps some pins that were not supposed to be taken in an earlier iteration ended up taken (pin_done=true) because they were allowed to be checked.

I tested this hypothesis on the master branch (without any of the route_tree_timing.cpp changes), and printed out the total number of sink pins vs number of sink pins that were actually checked: Total number of sink pins: 2 Number of sink pins checked: 1

When I removed the two lines. The same failures occurred in master.

Bringing back the loop and the check, the following code actually worked for all regtests and all our tests used for testing pin equivalence:

for (int iblk = 0; iblk < type->capacity; iblk++) {
        ClusterBlockId bnum = place_ctx.grid_blocks[i][j].blocks[iblk]; /* Hardcoded to one cluster_ctx block*/
        unsigned int ipin = 1;
        for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id)) {
            if (cluster_ctx.clb_nlist.pin_block(pin_id) == bnum) {
                int pin_index = tile_pin_index(pin_id);
                int iclass = type->pin_class[pin_index];
                if (iclass == ptc_num) {
                    /* Could connect to same pin class on the same clb more than once.  Only   *
                     * update pin_done for a pin that hasn't been reached yet.                 */
                    if (pin_done[ipin] == false) {
                        ifound++;
                        pin_done[ipin] = true;
                        break;
                    }
                }
            }
            ipin++;
        }
    }
tangxifan commented 4 years ago

@helen-23 I just checked these codes and I think you are right.

helen-23 commented 4 years ago

Thank you @tangxifan. Ah I see. I've replaced

for (int iblk = 0; iblk < type->capacity; iblk++) {
        ClusterBlockId bnum = place_ctx.grid_blocks[i][j].blocks[iblk];

with the following: for (auto bnum : place_ctx.grid_blocks[i][j].blocks) {

vaughnbetz commented 4 years ago

I think another factor in the code above working is that it isn't really checking that we find a sink matching each pin in the net; it just checks that we find N sinks for a fanout N net, and each sink appears to connect to a block that is reasonable (matches some pin on this net). That's a weaker check than the one I was trying to code: match a sink to exactly one pin on the net, and then don't match a future pin to that same net.

The ipin code in both the above and the original degenerates to a simple count of sinks I believe, as it just fills in the next entry in pin_done each time. It would be better to replace pin_done and ipin with a set over pin_id as it would give us the test that each sink implements a different net connection. Unfortunately, I think that's the code that's failing, for reasons I don't fully understand (maybe the iclass check is failing sometimes? That check is a bit less fundamental.).

So I suggest before leaving this one:

  1. Try the set with this code -- if it works great! If not, perhaps commenting out the iclass = ptc_num check works?
  2. If we can't make the set work, we can simplify the code above by getting rid of pin_done and just counting how many sinks are connected and ensuring its the right number. The ipin/pin_done code is a slower and more complicated bit of code that degenerates to a count (it didn't originally, but has clearly been changed to adapt to the more complex device grid and new netlist and the check degenerated to a count).
helen-23 commented 4 years ago

Hi @vaughnbetz Ohh okay, right, I see what you mean by the ipin being an issue. Yes I agree that changing pin_done to a set would make more sense. I guess when I tried different things around I accidentally reverted back the set I had before.

I've changed pin_done back to a set and it worked for the regtests and the two small test designs (dot product & Carpat).

for (auto bnum : place_ctx.grid_blocks[i][j].blocks) {
        for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id)) {
            if (cluster_ctx.clb_nlist.pin_block(pin_id) == bnum) {
                int pin_index = tile_pin_index(pin_id);
                int iclass = type->pin_class[pin_index];
                if (iclass == ptc_num) {
                    /* Could connect to same pin class on the same clb more than once.  Only   *
                     * update pin_done for a pin that hasn't been reached yet.                 */
                    if (pin_done.find(pin_id) == pin_done.end()) {
                        ifound++;
                        pin_done.insert(pin_id);
                        break;
                    }
                }
            }
        }
    }
vaughnbetz commented 4 years ago

OK, I like that code better, thanks :). Just to double check though: taking out the for loop over bnum and the if statement testing bnum fails? I.e. deleting these lines: for (auto bnum : place_ctx.grid_blocks[i][j].blocks) { and if (cluster_ctx.clb_nlist.pin_block(pin_id) == bnum) {

makes the test fail? If so, baffling, but I'll take it :). Without those lines I think I understand everything; still can't figure out why they are needed.

tangxifan commented 4 years ago

@vaughnbetz No worries. I agree that this is a bug that should be fixed as well. FYI, I have adapted the Titan architecture XML to grant DSP input and M144K output pin equivalence. You can find detailed results in the pull request #1448

helen-23 commented 4 years ago

Hi @vaughnbetz

Yes, confirming (double checked to make sure) that with the modified code, removing those two lines still makes the test fail.

With regards to the issue we discussed on Friday, where VPR failed to route the DLA circuits due to some OPIN-type nodes having a net delay of 0, the problem is actually not in the route tree net delay updates. In fact, when it is checking for SINK nodes that have a net delay of 0, OPIN nodes shouldn't even show up as part of the check.

The vector remaining_targets is used to store the IDs of remaining SINK nodes awaiting to be reached during pruning. All node IDs in the vector are then converted into corresponding SINK pin IDs through connections_inf.convert_sink_nodes_to_net_pins(remaining targets) in route_timing.cpp.

It then loops through all remaining targets (for each target_pin in remaining_targets) and updates the rt_node_of_sink array through rt_node_of_sink[target_pin] = update_route_tree(&cheapest, ((high_fanout) ? &spatial_rt_lookup : nullptr));

Problem is that all repeated nodes have the same node ID, so when node IDs are converted into pin IDs, the same pin ID would appear multiple times in remaining_targets ( e.g. remaining_targets = [1,1,3,3,5,5] ). Consequently, when rt_node_of_sink[target_pin] is updated, the same element could be updated multiple times while others are not updated at all. As a result, the elements that are not updated could be storing some random pin IDs. This is why when we reached some random OPIN nodes when we are checking through all SINK pins:

for (unsigned ipin = 1; ipin < cluster_ctx.clb_nlist.net_pins(net_id).size(); ++ipin) {
      if (net_delay[ipin] == 0) { // should be SOURCE->OPIN->IPIN->SINK
             VTR_ASSERT(device_ctx.rr_nodes[rt_node_of_sink[ipin]->parent_node->parent_node->inode].type() == OPIN);
      }
}

This means that although the smaller circuits passed, it's likely that they passed by chance (i.e. only random SINK nodes were stored in the elements that were not updated so that even though it was wrong, it did not trigger the assertion). Currently trying out fix for this.

@tangxifan Thank you for the fix in the architecture! This is super helpful as it would bypass any issues that could arise with multiple copies of the same nodes.

helen-23 commented 4 years ago

Actually, the smaller circuits passed because they have very low fanout (less than min_incremental_reroute_fanout) and no pruning is required. In that case, remaining_targets is populated simply through:

for (unsigned int sink_pin = 1; sink_pin <= num_sinks; ++sink_pin)
        connections_inf.toreach_rr_sink(sink_pin);

No node-to-sink conversion is needed in this case, so there's no problem. For larger circuits like DLA, however, we would want to prune the tree. During pruning, remaining_targets is first populated by node IDs and then converted into pin IDs. This is where the problem occurs.

vaughnbetz commented 4 years ago

Update: Helen is working on a fix where she stores the net_pin (connection id) of each connection on the route tree and traceback, so there is no ambiguity. Xifan is working on a more complete partial crossbar architecture that will hopefully also avoid this issue (won't connect to the same sink twice), but both fixes are useful and proceeding. Also, Aman at UT Austin appears to be hitting a similar issue.

aman26kbm commented 4 years ago

I see a segmentation fault when routing starts, when I add a sparse 50% populated input crossbar to a hardblock in the architecture. This is the same observation that Helen and Xifan have, as discussed in this issue.

There are 4 main blocks in the arch I have: CLB, DSP, BRAM and matmul (custom hardblock I've added). CLB already had the crossbar. I added a crossbar to the other 3. The crossbar to DSP works. I don't see the segmentation fault with that. But when I add a crossbar to either the BRAM or the matmul, I get the segmentation fault.

I don't understand a lot of the VPR internals that are discussed here. But having different cases that fail and pass may be helpful in the debug process. So, I am attaching my arch files and the design.

issue_1475_aman_utaustin.zip

The key for the names of the arch file is: "_yes" means that that block has a local crossbar specified. "_no" means that that block doesn't have a local crossbar specified. The arch "stratix10.matmul_no.dsp_yes.bram_no.xml" is the one that doesn't run into the segmentation fault. The other two do.

aman26kbm commented 4 years ago

A couple more things...

Since there's some discussion on different ways of expressing the crossbar... if you want to see the crossbars in my arch files, best thing to do is to diff stratix10.matmul_no.dsp_yes.bram_no.xml file with the other two.

And.. If you guys have a temp fix, please let me know. I can try to get that in my fork, and build and run and see how it goes.

aman26kbm commented 4 years ago

I have a question about working around this issue until it is fixed.

For now, I am specifying the connections using "direct" instead of "complete" and specifying the local crossbar delay (obtained from COFFE) in those "direct" interconnect statements. That feels like paying double penalty (router will have difficulty routing because there is no local crossbar AND the delay specified is reasonably high).

@vaughnbetz , is there a better way to model this until this issue is fixed? or is what I am doing okay.

vaughnbetz commented 4 years ago

Hi Aman,

I think that's overly conservative; I would use either the lower delay and no pin equivalence or the higher delay and pin equivalence. I have just merged in the fix for pin equivalence on DSP and RAM blocks. This makes the quick and dirty pin equivalence where you just mark pins as equivalent without putting in the details of the crossbar in the arch file work. Putting the details in the arch file already should have worked.

For large blocks we have found another issue though -- the router lookahead is always heading to the lower-left corner of the block, making routing to other pins suboptimal. Making pins that are in distinct locations on a large block (different x,y locations on the block) seems to exacerbate this by giving the router more flexibility. Xifan is working on a fix for this, but it will take a while. In the interim, if you see poor routing to large blocks you can use a lower astar_fac (e.g. 0.75 or perhaps even 0.5) but it will slow the router down. See https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/1508 for some discussion on this (near the end).

aman26kbm commented 4 years ago

Thanks, @vaughnbetz !

aman26kbm commented 4 years ago

Coming back to this.. I am still seeing a segmentation fault. I tried on the latest master today. Note that I am providing the full details of the crossbar.

I am pasting here snapshots of the the differences in two arch files. Left one has no local crossbar on the memory block. Right one has local crossbar on the memory block.

When I use the left one, things work. When I use the right one, I see a seg fault. Am I doing something incorrectly?

In these arch files, I have local crossbar on the DSP block in the same way and that works fine.

first second third fourth