verilog-to-routing / vtr-verilog-to-routing

Verilog to Routing -- Open Source CAD Flow for FPGA Research
https://verilogtorouting.org
Other
1.01k stars 388 forks source link

Seemingly incorrect optimization of multipliers in ODIN #2008

Closed aman26kbm closed 2 years ago

aman26kbm commented 2 years ago

Expected Behaviour

There should be >4096 multipliers in the design.

Current Behaviour

I'm running a design with the ODIN+Yosys flow. I was expecting atleast 4096 multipliers in the design. In the yosys log, I do see this at the end in the statistics section:

$mul 4160

In the odin log also, I see this: Total # of multipliers = 4160

But then towards the end of the odin log, it shows: Number of node: 272

And then the VPR log also shows 272 multipliers.

Somehow the number of multipliers seems to have changed from 4160 to 272. And it seems to have happened in the ODIN stage.

Possible Solution

Steps to Reproduce

Arch file: k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml Design file: all.zip Additional args: --const_gen_inference comb --timing_report_detail aggregated --route_chan_width 300 -elaborator yosys -fflegalize

Context

This design is one of the new Koios benchmarks.

Your Environment

sdamghan commented 2 years ago

@aman26kbm - I have checked the design, the number of multipliers is 4160 in the Yosys output and after partial mapping. However, it is reduced to 272 after removing the unused logic phase in Odin-II. This phase actually traverses the netlist from the output nodes and marks every node seeing in a path from an output node to an input node. So technically, nodes that are not in any path that ends in an output node are considered unused logic. I would recommend double-checking the design and making sure all mulptiplication results are connected to at least one output of the top module.

aman26kbm commented 2 years ago

I see. Thanks for checking, @sdamghan . Let me try to find if something isn't connected correctly. This benchmark is from a tool called DNNWeaver from UCSD. The github page mentions that they ran it through Xilinx Vivado.

sdamghan commented 2 years ago

Oh, didn't know that. So we should dive more into this since this optimization is actually an old part of Odin-II. We may need to check from where the Yosys coarse-grained netlist is generated until after the unused logic removal.

@alirezazd - can you please have a look at this as well. We got 4160 hard multipliers once Yosys generated the coarse-grained netlist. During the Odin-II partial mapping, we got all of these multipliers; however, Odin-II detects 3888 of 16x16 multipliers as unused logic in the unused logic removal phase and in the final output BLIF file we have only 272 of 16x16 hard multipliers. The main point here is to investigate why those multipliers are recognized as unused logic. Technically, Odin-II specifies unused logic based on their presence in a path that ends to an actual output node, so that means if we got a node that is not connected to an output node (directly or not directly), it should be unused one since it does not have any effect on the circuit output. You may get help from @poname in the Yosys techmap side, which is generating the coarse-grained netlist; he has been recently working on that part.

alirezazd commented 2 years ago

@sdamghan Sure, I'll check the source and post it here if I pinpoint the problem.

aman26kbm commented 2 years ago

Hi @alirezazd , were you able to find something?

alirezazd commented 2 years ago

@aman26kbm hello, I'm working on it and going through the source since the day that this issue was creaed. Be sure that I'll inform you if I find the bug.

aman26kbm commented 2 years ago

Thanks, @alirezazd

aman26kbm commented 2 years ago

I'm seeing another issue that might be related to this.

Running this benchmark: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vtr_flow/benchmarks/verilog/koios/tpu_like.small.v

With this architecture: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vtr_flow/arch/COFFE_22nm/k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml

When I run with ODIN only flow, I see odin.out: Total # of multipliers = 288 Number of node: 288

And then I see 288 multipliers in vpr.out. This is expected.

However, when I run with ODIN+yosys flow, I see in odin.out: Total # of multipliers = 33 Number of node: 6

And then I see 6 multipliers in vpr.out. Something is getting optimized out in the yosys+odin flow.

You can use this design for debug, because this is a relatively smaller design, and also this design works with the ODIN only flow so you will have something to compare to.

sdamghan commented 2 years ago

@aman26kbm - I have just noticed that the address width and data width for single/dual port rams have not been specified for single/dual_port_rams in the Koios benchmarks. While Odin-II automatically assigns the address/data width of a memory block in the AST elaboration phase, we must explicitly specify these values for hard memory block instantiation using the Yosys+Odin-II synthesizer. I have added the following lines above the dual-port instantiation in the tpu_like.small.v benchmark and synthesized it with COFFE_22nm/k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml architecture file using Yosys+Odin-II. The number of multipliers is 288, equal to Odin-II at the end of both Odin-II and VTR flow.

4919| defparam u_dual_port_ram.ADDR_WIDTH = `AWIDTH;
4920| defparam u_dual_port_ram.DATA_WIDTH = `DESIGN_SIZE*`DWIDTH;

So, I suspect the all.v benchmark has faced the same problem. I will let you know once I verify the issue; however, we need to add the lines mentioned above to the Verilog files of the Koios benchmarks.

aman26kbm commented 2 years ago

Aah. Interesting. Good find, Seyed. Thanks a lot.

aman26kbm commented 2 years ago

The all.v benchmark infers RAMs behaviorally. So this is likely not the issue in that one.

sdamghan commented 2 years ago

@alirezazd - do you have any update on this issue?

sdamghan commented 2 years ago

@aman26kbm - according to our previous discussion, I remember you said you would see a decrease in the number of multipliers once you connect memory instances to the design. I tried to use dual_port_ram hard memory blocks instead of implicit memory, defined in the all.v benchmark. This is mainly because we map Yosys memory blocks to single_port_ram/dual_port_ram in the Odin-II partial mapping phase. So I tried to do this manually in the Verilog design. However, the result was again the same as before, it ended up with 272 multipliers. I have attached all results in the zip file below. I suspect the issue is coming from the HLS design. Please let me know your thoughts. all-with-DPRAM.zip

aman26kbm commented 2 years ago

I made some more modifications and ran the design through Xilinx Vivado. Just to reduce runtime, I changed the systolic array size to 16x16. I see 272 DSPs in the synthesis report:

 9819 Report Cell Usage:
 9820 +------+-----------+------+
 9821 |      |Cell       |Count |
 9822 +------+-----------+------+
 9823 |1     |CARRY4     |  1337|
 9824 |2     |DSP48E1_3  |    16|
 9825 |3     |DSP48E1_4  |   185|
 9826 |4     |DSP48E1_5  |    71|
 9827 |5     |LUT1       |   327|
....

When I run this design with VTR, the same value shows up in the yosys log: $mul 272

And also in the ODIN log in the beginning: Total # of multipliers = 272

But later in the ODIN log, it reduces to: Number of <MULTIPLY> node: 80

So, it seems like the design is okay, but the issue is in the optimization code in ODIN. Is it possible to figure out which multipliers it is optimizing out and why?

Here's the updated file btw: all_from_seyed.modif.sv.zip

alirezazd commented 2 years ago

@sdamghan @aman26kbm Hello, This what I have acquired till now: There is a function in Odin which marks the "unused" nodes for cleanup; First, The function starts by marking top output nodes and recursively traverses backwards till there is no adjacent nodes and it reaches the ends of the netlist. Then, it starts from top input nodes and recursively forward checks for previous markings by back traverse function. During this check if a node was not marked in back traverse, it marks it for removal; This is based on the assumption that if a node does not a has a path to an output node, it is unused. I did add a similar function right after the elaboration by Yosys, before any optimizations by Odin to count the unused multiplier nodes right after elaboration. I can confirm that it marked 3888 nodes as unused. Also, I applied some optimization passes on all.v solely in Yosys and I did not remove any of the 4160 multipliers during these passes. I'm planing to write a function that extracts properties of the marked unused multiplier nodes to further analyze them. I will inform you once I come up with results.

aman26kbm commented 2 years ago

Thanks, @alirezazd . I suggest using the design file attached in my last message for any further experiments. It is a smaller design (and has some other fixes to ensure it runs on Vivado as well). It will be easier to use.

alirezazd commented 2 years ago

@aman26kbm you're welcome, noted.

sdamghan commented 2 years ago

@alirezazd - any update on this issue? have you found the discontinuity reason?

alirezazd commented 2 years ago

@sdamghan Sorry for the delay, I'm working on it and I think I'm getting close... For now I have found nets connecting to to input nodes without fanouts (Before any optimizations by Odin) which makes them useless. I'll post the results once I have something that you might find useful.

alirezazd commented 2 years ago

@aman26kbm @sdamghan I have added a function to extract the netlist info right after elaboration by yosys. (Link to the branch: https://github.com/CAS-Atlantic/vtr-verilog-to-routing/tree/issue-2008) I'm feeding the "all_from_seyed.modif.sv" design and I'm detecting multiple nets with no fan-outs or fan-outs not connecting to any node. Here is the .csv file that contains list of these nets and their nodes with some additional info: debug.csv This might be reason that some nodes get deleted. I have used Odin's cleanup approach for traversing the netlist.

sdamghan commented 2 years ago

Thanks @alirezazd - this debugging information sounds interesting, but have you come up with any solution to resolve this issue yet? According to what Aman mentioned, this benchmark works fine with Xilinx Vivado; therefore, the chance of a broken benchmark is negligible! If there are nets without fan-outs, first check are they in the removed unused multipliers path or not? you can figure this out using the name of nets and pins. Then when you make sure that they are related to the removed multipliers you should be looking for a reason. If you see something unconnected, there should be a reason for that either in the Yosys output BLIF or in the Odin-II side. There are absolutely unused logic in every netlist, and thats why there is unused logic removal function. But you should figure out why these multipliers are not connected to the output! Please make a new PR and push your proposed solution.

sdamghan commented 2 years ago

@aman26kbm - Since the design file is a bit large, figuring out the schematic design is a little bit time-consuming. It would be great if you could provide us with a high-level schematic of the design. Something that can give us an idea about the connectivity of the processing elements in the systolic array and the logic included in these units, in addition to how other high-level modules like memory blocks are connected in the design.

aman26kbm commented 2 years ago

Ok lemme try.

alirezazd commented 2 years ago

@sdamghan As you mentioned the design is large and manually checking all multiplier nodes in long paths is hard. What I'm gonna do as the next step is to stick to one of the multiplier nodes that gets removed and dig deeper into the issue. I checked every removed multiplier using above approach and they did not have any unconnected nets or outputs. I'll make a PR once I come up with the solution.

aman26kbm commented 2 years ago

@alirezazd , just to confirm... you've been able to identify multipliers that were removed by odin/yosys, but these multipliers have no unconnected outputs/nets, right? As in.. these multipliers are feeding to the output of the design, but still being removed by odin/yosys? If my understanding is right, this is great. Then, we are quite close to identifying what's going on, right?

alirezazd commented 2 years ago

@aman26kbm Yes, the removed multiplier nodes seem normal and have connections both on inputs and output. I think the problem is that somewhere in their path, there might be unconnected nodes (or with some other issue) that results in the multipliers getting marked as useless. I also think that we are close to identifying the reason behind this issue.

aman26kbm commented 2 years ago

I am able to see the schematic in Vivado. Maybe I can share the Vivado project folder with you guys, or you can run Vivado at your end on this design and open the schematic. However, I do see several synthesis warnings in Vivado. I'll look through them carefully to see if they point to some issues.

aman26kbm commented 2 years ago

Here's the Vivado project just in case you need it. It was generated in Vivado version 2019.2: https://utexas.sharepoint.com/:u:/s/LCATeam3-VTR-ML-Benchmarks/EfltJxa5-4lOrkjY7fAu-DYBUfiVCO6sccEaqvSnYHF4xw?e=dB5bEh

aman26kbm commented 2 years ago

I went through the warnings shown by Vivado and made some modifications to the design. I fixed the warnings that seemed a little serious. After these changes the warnings in Vivado are only related to dead code (port not used, reg not used, etc).

Unfortunately, after running the updated file in Vivado and VTR, the results stay the same (Vivado shows 272 DSPs, but ODIN log has 80. I'm attaching the updated file, just in case you want to look.

all_from_seyed.modif.after_vivado.sv.zip .

aman26kbm commented 2 years ago

Btw, please take a look at line 8929 of the file attached in the previous message.

assign mem_read_data[(m%GROUP_SIZE)*DATA_WIDTH+:DATA_WIDTH] = local_mem_read_buf_id_dly == buf_id ? local_mem_read_data : 'bz;

This is inside a generate block and is on the output path. There is a tristating happening. Maybe this needs to be replaced by a mux. When I replaced 'bz with 'b0, then I got warnings pointing to multiple drivers. So, I think the muxing structure is supposed to be across iterations of the generate loop.

Anyway, I'm not sure how this line of code gets handled in Yosys. This could be causing some hardware to get optimized out.

sdamghan commented 2 years ago

Thanks @aman26kbm; these are good clues that can ease finding out the main issue. @alirezazd - FYI, as we discussed, line 8929 is a good starting point to determine whether the disconnectivity comes from memories/registers or not. Please track the output path up until this point and consider the number of removed multipliers. This could give us an overview of why the specific chains of components are removed from the output path connected to each processing element. NOTE: usually in such cases we remove a chain of components starting with a multiplexer due to the value of the mux selector that is connected to a constant value.

alirezazd commented 2 years ago

@aman26kbm Thanks for the info. I'll be looking into it.

alirezazd commented 2 years ago

@aman26kbm I have extracted a list of Removed and Not-Removed multipliers (extracted form their output net names), I'm checking the systolic array to see if I can find a pattern. I will attach the list so you can also take a look at it. list.txt Update: I get the same results after optimization and before it (after elaboration)

sdamghan commented 2 years ago

@alirezazd - do we have any update on this thread?

alirezazd commented 2 years ago

@sdamghan Yes, I started to analyze the coarse-grained netlist which is available right after Yosys elaboration. Here is what I have found:

  1. There are many unconnected buffer nodes (which might be a normal thing)
  2. I'm detecting 12 MUX nodes that have discontinuity. The interesting thing about MUX nodes is that these nodes share same index with multipliers that not being removed from the sys_arrray.

I think we are really close to finding the cause of the removed chain of nodes. I have traced vivado schematic but I did not find anything abnormal. I'm tracing MUX selectors in the coarse-grained blif file which due to its huge size it's taking some time. I will attach list of discontinuities that I have found after yosys elaboration. dbg.csv

UPDATE: I traced the coarse-grained blif file but I could not find multiplexers that have the same index as the removed MUL nodes. (e.g I can find the MUX related to u_bf_wrap.obuf_mem.buf_ram.LOOP_M[13].buf_ram.read_a_mux.data_out[0] but when I search for u_bf_wrap.obuf_mem.buf_ram.LOOP_M[1].buf_ram.read_a_mux.data_out[0] there is no results) The issue is probably related to Yosys elaboration. My next step is to feed Yosys the design and manually check the blif file after each pass to pinpoint the pass that is removing the chain

alirezazd commented 2 years ago

@aman26kbm @sdamghan Here is a detailed description of what I have found: The problem of nodes getting removed is not rising from Odin. Odin just removes a chain of nodes that do not have a path to output (which is normal). After checking the coarse-grained netlist acquired from Yosys, I detected couple of unconnected multiplexers. Since these muxes themselves did not have a traceable name, I extracted their output net names:

cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[13].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[11].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[0].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[13].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[11].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[0].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[13].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[0].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[11].buf_ram.read_a_mux.data_out~0

As you can see, there is a pattern in LOOP indices (13, 11, 10, 0). These numbers were also present in the list of multipliers that were not removed from systolic array (I have attached the list in previous posts by the name of "list.txt"). I checked the coarse-grained netlist to find other multiplexers other than ones with 13,11,10,0 indices (for example I searched for _u_bf_wrap.obuf_mem.buf_ram.LOOP_M[1].buf_ram.read_a_mux.dataout[0]) and there was not a single result (if you replace M[1] with M[0], M[10], M[11], M[13] you'll get 9 results for each). After checking Yosys elaboration log of the design, I found that couple of multiplexer cells with similar names got removed. (check lines 22855~22950 of elaboration.yosys.log). I tried to trace the problem in the yosys synthesis script and checking for deleted multiplexers after each pass. I discovered after running _opt_expr -mux_undef -muxbool -fine;;; pass, the mux nodes get removed from the blif netlist. IMHO the design should get re-checked to see if it has any issues with Yosys.

alirezazd commented 2 years ago

@aman26kbm @sdamghan

I checked the multiplexer nodes with issues; here is the detailed explanation: After the design is elaborated with Yosys, we get a design with multiple multiplexers having the same output names (There are 4 duplicate names for each of the output pins). This results in all of the corresponding output pins getting attached to the first target node, leaving the rest of the outputs dangling. These multiplexers are in the output path of the systolic array and since they have no path to output, the chain of multipliers in these paths get removed. here is a sample that I extracted for the first pins of the problematic multiplexers:

$mux~1273 -> Y[0]=u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out[0] $mux~1305- > Y[0]=u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out[0] $mux~1338- > Y[0]=u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out[0] $mux~1381- > Y[0]=u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out[0]

As you can see, we have duplicate output names for different multiplexers. You can also check this by searching the output names in the generated "coarsen_netlist.yosys.blif". Moreover, the FLATTEN pass of Yosys techmap fails with this design which is also a clue for a problematic/incompatible design. I will attach the Yosys log.

yosys.out.log

aman26kbm commented 2 years ago

Thanks a lot, @alirezazd . I'll take a look at the design.

aman26kbm commented 2 years ago

I looked at the design to understand what was going on and this issue is because of the following code (around line 8929):

assign mem_read_data[(m%GROUP_SIZE)*DATA_WIDTH+:DATA_WIDTH] = local_mem_read_buf_id_dly == buf_id ? local_mem_read_data : 'bz;

There is a complicated temporal and spatial muxing structure generated using generate statements. I believe yosys doesn't infer that correctly, but Xilinx Vivado is able to handle it.

I made some manual changes to the code. I've attached it. With that, I see 272 multipliers in odin.out and 144 DSPs used in vpr.out.

all_from_seyed.modif.after_vivado.mod_jun29.sv.zip

aman26kbm commented 2 years ago

Closing this issue. Thanks a lot, @sdamghan and @alirezazd for debugging this.

alirezazd commented 2 years ago

@aman26kbm Glad to see it's resolved. You're welcome.