Closed aman26kbm closed 2 years ago
@aman26kbm - I have checked the design, the number of multipliers is 4160 in the Yosys output and after partial mapping. However, it is reduced to 272 after removing the unused logic phase in Odin-II. This phase actually traverses the netlist from the output nodes and marks every node seeing in a path from an output node to an input node. So technically, nodes that are not in any path that ends in an output node are considered unused logic. I would recommend double-checking the design and making sure all mulptiplication results are connected to at least one output of the top module.
I see. Thanks for checking, @sdamghan . Let me try to find if something isn't connected correctly. This benchmark is from a tool called DNNWeaver from UCSD. The github page mentions that they ran it through Xilinx Vivado.
Oh, didn't know that. So we should dive more into this since this optimization is actually an old part of Odin-II. We may need to check from where the Yosys coarse-grained netlist is generated until after the unused logic removal.
@alirezazd - can you please have a look at this as well. We got 4160 hard multipliers once Yosys generated the coarse-grained netlist. During the Odin-II partial mapping, we got all of these multipliers; however, Odin-II detects 3888 of 16x16 multipliers as unused logic in the unused logic removal phase and in the final output BLIF file we have only 272 of 16x16 hard multipliers. The main point here is to investigate why those multipliers are recognized as unused logic. Technically, Odin-II specifies unused logic based on their presence in a path that ends to an actual output node, so that means if we got a node that is not connected to an output node (directly or not directly), it should be unused one since it does not have any effect on the circuit output. You may get help from @poname in the Yosys techmap side, which is generating the coarse-grained netlist; he has been recently working on that part.
@sdamghan Sure, I'll check the source and post it here if I pinpoint the problem.
Hi @alirezazd , were you able to find something?
@aman26kbm hello, I'm working on it and going through the source since the day that this issue was creaed. Be sure that I'll inform you if I find the bug.
Thanks, @alirezazd
I'm seeing another issue that might be related to this.
Running this benchmark: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vtr_flow/benchmarks/verilog/koios/tpu_like.small.v
With this architecture: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vtr_flow/arch/COFFE_22nm/k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml
When I run with ODIN only flow, I see odin.out:
Total # of multipliers = 288
Number of
And then I see 288 multipliers in vpr.out. This is expected.
However, when I run with ODIN+yosys flow, I see in odin.out:
Total # of multipliers = 33
Number of
And then I see 6 multipliers in vpr.out. Something is getting optimized out in the yosys+odin flow.
You can use this design for debug, because this is a relatively smaller design, and also this design works with the ODIN only flow so you will have something to compare to.
@aman26kbm - I have just noticed that the address width and data width for single/dual port rams have not been specified for single/dual_port_rams
in the Koios benchmarks. While Odin-II automatically assigns the address/data width of a memory block in the AST elaboration phase, we must explicitly specify these values for hard memory block instantiation using the Yosys+Odin-II synthesizer. I have added the following lines above the dual-port instantiation in the tpu_like.small.v
benchmark and synthesized it with COFFE_22nm/k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml
architecture file using Yosys+Odin-II. The number of multipliers is 288, equal to Odin-II at the end of both Odin-II and VTR flow.
4919| defparam u_dual_port_ram.ADDR_WIDTH = `AWIDTH;
4920| defparam u_dual_port_ram.DATA_WIDTH = `DESIGN_SIZE*`DWIDTH;
So, I suspect the all.v
benchmark has faced the same problem. I will let you know once I verify the issue; however, we need to add the lines mentioned above to the Verilog files of the Koios benchmarks.
Aah. Interesting. Good find, Seyed. Thanks a lot.
The all.v benchmark infers RAMs behaviorally. So this is likely not the issue in that one.
@alirezazd - do you have any update on this issue?
@aman26kbm - according to our previous discussion, I remember you said you would see a decrease in the number of multipliers once you connect memory instances to the design. I tried to use dual_port_ram
hard memory blocks instead of implicit memory, defined in the all.v
benchmark. This is mainly because we map Yosys memory blocks to single_port_ram/dual_port_ram in the Odin-II partial mapping phase. So I tried to do this manually in the Verilog design. However, the result was again the same as before, it ended up with 272 multipliers.
I have attached all results in the zip file below. I suspect the issue is coming from the HLS design. Please let me know your thoughts.
all-with-DPRAM.zip
I made some more modifications and ran the design through Xilinx Vivado. Just to reduce runtime, I changed the systolic array size to 16x16. I see 272 DSPs in the synthesis report:
9819 Report Cell Usage:
9820 +------+-----------+------+
9821 | |Cell |Count |
9822 +------+-----------+------+
9823 |1 |CARRY4 | 1337|
9824 |2 |DSP48E1_3 | 16|
9825 |3 |DSP48E1_4 | 185|
9826 |4 |DSP48E1_5 | 71|
9827 |5 |LUT1 | 327|
....
When I run this design with VTR, the same value shows up in the yosys log:
$mul 272
And also in the ODIN log in the beginning:
Total # of multipliers = 272
But later in the ODIN log, it reduces to:
Number of <MULTIPLY> node: 80
So, it seems like the design is okay, but the issue is in the optimization code in ODIN. Is it possible to figure out which multipliers it is optimizing out and why?
Here's the updated file btw: all_from_seyed.modif.sv.zip
@sdamghan @aman26kbm Hello, This what I have acquired till now: There is a function in Odin which marks the "unused" nodes for cleanup; First, The function starts by marking top output nodes and recursively traverses backwards till there is no adjacent nodes and it reaches the ends of the netlist. Then, it starts from top input nodes and recursively forward checks for previous markings by back traverse function. During this check if a node was not marked in back traverse, it marks it for removal; This is based on the assumption that if a node does not a has a path to an output node, it is unused. I did add a similar function right after the elaboration by Yosys, before any optimizations by Odin to count the unused multiplier nodes right after elaboration. I can confirm that it marked 3888 nodes as unused. Also, I applied some optimization passes on all.v solely in Yosys and I did not remove any of the 4160 multipliers during these passes. I'm planing to write a function that extracts properties of the marked unused multiplier nodes to further analyze them. I will inform you once I come up with results.
Thanks, @alirezazd . I suggest using the design file attached in my last message for any further experiments. It is a smaller design (and has some other fixes to ensure it runs on Vivado as well). It will be easier to use.
@aman26kbm you're welcome, noted.
@alirezazd - any update on this issue? have you found the discontinuity reason?
@sdamghan Sorry for the delay, I'm working on it and I think I'm getting close... For now I have found nets connecting to to input nodes without fanouts (Before any optimizations by Odin) which makes them useless. I'll post the results once I have something that you might find useful.
@aman26kbm @sdamghan I have added a function to extract the netlist info right after elaboration by yosys. (Link to the branch: https://github.com/CAS-Atlantic/vtr-verilog-to-routing/tree/issue-2008) I'm feeding the "all_from_seyed.modif.sv" design and I'm detecting multiple nets with no fan-outs or fan-outs not connecting to any node. Here is the .csv file that contains list of these nets and their nodes with some additional info: debug.csv This might be reason that some nodes get deleted. I have used Odin's cleanup approach for traversing the netlist.
Thanks @alirezazd - this debugging information sounds interesting, but have you come up with any solution to resolve this issue yet? According to what Aman mentioned, this benchmark works fine with Xilinx Vivado; therefore, the chance of a broken benchmark is negligible! If there are nets without fan-outs, first check are they in the removed unused multipliers path or not? you can figure this out using the name of nets and pins. Then when you make sure that they are related to the removed multipliers you should be looking for a reason. If you see something unconnected, there should be a reason for that either in the Yosys output BLIF or in the Odin-II side. There are absolutely unused logic in every netlist, and thats why there is unused logic removal function. But you should figure out why these multipliers are not connected to the output! Please make a new PR and push your proposed solution.
@aman26kbm - Since the design file is a bit large, figuring out the schematic design is a little bit time-consuming. It would be great if you could provide us with a high-level schematic of the design. Something that can give us an idea about the connectivity of the processing elements in the systolic array and the logic included in these units, in addition to how other high-level modules like memory blocks are connected in the design.
Ok lemme try.
@sdamghan As you mentioned the design is large and manually checking all multiplier nodes in long paths is hard. What I'm gonna do as the next step is to stick to one of the multiplier nodes that gets removed and dig deeper into the issue. I checked every removed multiplier using above approach and they did not have any unconnected nets or outputs. I'll make a PR once I come up with the solution.
@alirezazd , just to confirm... you've been able to identify multipliers that were removed by odin/yosys, but these multipliers have no unconnected outputs/nets, right? As in.. these multipliers are feeding to the output of the design, but still being removed by odin/yosys? If my understanding is right, this is great. Then, we are quite close to identifying what's going on, right?
@aman26kbm Yes, the removed multiplier nodes seem normal and have connections both on inputs and output. I think the problem is that somewhere in their path, there might be unconnected nodes (or with some other issue) that results in the multipliers getting marked as useless. I also think that we are close to identifying the reason behind this issue.
I am able to see the schematic in Vivado. Maybe I can share the Vivado project folder with you guys, or you can run Vivado at your end on this design and open the schematic. However, I do see several synthesis warnings in Vivado. I'll look through them carefully to see if they point to some issues.
Here's the Vivado project just in case you need it. It was generated in Vivado version 2019.2: https://utexas.sharepoint.com/:u:/s/LCATeam3-VTR-ML-Benchmarks/EfltJxa5-4lOrkjY7fAu-DYBUfiVCO6sccEaqvSnYHF4xw?e=dB5bEh
I went through the warnings shown by Vivado and made some modifications to the design. I fixed the warnings that seemed a little serious. After these changes the warnings in Vivado are only related to dead code (port not used, reg not used, etc).
Unfortunately, after running the updated file in Vivado and VTR, the results stay the same (Vivado shows 272 DSPs, but ODIN log has 80. I'm attaching the updated file, just in case you want to look.
Btw, please take a look at line 8929 of the file attached in the previous message.
assign mem_read_data[(m%GROUP_SIZE)*DATA_WIDTH+:DATA_WIDTH] = local_mem_read_buf_id_dly == buf_id ? local_mem_read_data : 'bz;
This is inside a generate block and is on the output path. There is a tristating happening. Maybe this needs to be replaced by a mux. When I replaced 'bz with 'b0, then I got warnings pointing to multiple drivers. So, I think the muxing structure is supposed to be across iterations of the generate loop.
Anyway, I'm not sure how this line of code gets handled in Yosys. This could be causing some hardware to get optimized out.
Thanks @aman26kbm; these are good clues that can ease finding out the main issue. @alirezazd - FYI, as we discussed, line 8929 is a good starting point to determine whether the disconnectivity comes from memories/registers or not. Please track the output path up until this point and consider the number of removed multipliers. This could give us an overview of why the specific chains of components are removed from the output path connected to each processing element. NOTE: usually in such cases we remove a chain of components starting with a multiplexer due to the value of the mux selector that is connected to a constant value.
@aman26kbm Thanks for the info. I'll be looking into it.
@aman26kbm I have extracted a list of Removed and Not-Removed multipliers (extracted form their output net names), I'm checking the systolic array to see if I can find a pattern. I will attach the list so you can also take a look at it. list.txt Update: I get the same results after optimization and before it (after elaboration)
@alirezazd - do we have any update on this thread?
@sdamghan Yes, I started to analyze the coarse-grained netlist which is available right after Yosys elaboration. Here is what I have found:
I think we are really close to finding the cause of the removed chain of nodes. I have traced vivado schematic but I did not find anything abnormal. I'm tracing MUX selectors in the coarse-grained blif file which due to its huge size it's taking some time. I will attach list of discontinuities that I have found after yosys elaboration. dbg.csv
UPDATE: I traced the coarse-grained blif file but I could not find multiplexers that have the same index as the removed MUL nodes. (e.g I can find the MUX related to u_bf_wrap.obuf_mem.buf_ram.LOOP_M[13].buf_ram.read_a_mux.data_out[0] but when I search for u_bf_wrap.obuf_mem.buf_ram.LOOP_M[1].buf_ram.read_a_mux.data_out[0] there is no results) The issue is probably related to Yosys elaboration. My next step is to feed Yosys the design and manually check the blif file after each pass to pinpoint the pass that is removing the chain
@aman26kbm @sdamghan Here is a detailed description of what I have found: The problem of nodes getting removed is not rising from Odin. Odin just removes a chain of nodes that do not have a path to output (which is normal). After checking the coarse-grained netlist acquired from Yosys, I detected couple of unconnected multiplexers. Since these muxes themselves did not have a traceable name, I extracted their output net names:
cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[13].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[11].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[0].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[13].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[11].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[0].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[13].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[0].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out~0 cl_wrapper^u_bf_wrap.obuf_mem.buf_ram.LOOP_M[11].buf_ram.read_a_mux.data_out~0
As you can see, there is a pattern in LOOP indices (13, 11, 10, 0). These numbers were also present in the list of multipliers that were not removed from systolic array (I have attached the list in previous posts by the name of "list.txt"). I checked the coarse-grained netlist to find other multiplexers other than ones with 13,11,10,0 indices (for example I searched for _u_bf_wrap.obuf_mem.buf_ram.LOOP_M[1].buf_ram.read_a_mux.dataout[0]) and there was not a single result (if you replace M[1] with M[0], M[10], M[11], M[13] you'll get 9 results for each). After checking Yosys elaboration log of the design, I found that couple of multiplexer cells with similar names got removed. (check lines 22855~22950 of elaboration.yosys.log). I tried to trace the problem in the yosys synthesis script and checking for deleted multiplexers after each pass. I discovered after running _opt_expr -mux_undef -muxbool -fine;;; pass, the mux nodes get removed from the blif netlist. IMHO the design should get re-checked to see if it has any issues with Yosys.
@aman26kbm @sdamghan
I checked the multiplexer nodes with issues; here is the detailed explanation: After the design is elaborated with Yosys, we get a design with multiple multiplexers having the same output names (There are 4 duplicate names for each of the output pins). This results in all of the corresponding output pins getting attached to the first target node, leaving the rest of the outputs dangling. These multiplexers are in the output path of the systolic array and since they have no path to output, the chain of multipliers in these paths get removed. here is a sample that I extracted for the first pins of the problematic multiplexers:
$mux~1273 -> Y[0]=u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out[0]
$mux~1305- > Y[0]=u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out[0]
$mux~1338- > Y[0]=u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out[0]
$mux~1381- > Y[0]=u_bf_wrap.obuf_mem.buf_ram.LOOP_M[10].buf_ram.read_a_mux.data_out[0]
As you can see, we have duplicate output names for different multiplexers. You can also check this by searching the output names in the generated "coarsen_netlist.yosys.blif". Moreover, the FLATTEN pass of Yosys techmap fails with this design which is also a clue for a problematic/incompatible design. I will attach the Yosys log.
Thanks a lot, @alirezazd . I'll take a look at the design.
I looked at the design to understand what was going on and this issue is because of the following code (around line 8929):
assign mem_read_data[(m%GROUP_SIZE)*DATA_WIDTH+:DATA_WIDTH] = local_mem_read_buf_id_dly == buf_id ? local_mem_read_data : 'bz;
There is a complicated temporal and spatial muxing structure generated using generate statements. I believe yosys doesn't infer that correctly, but Xilinx Vivado is able to handle it.
I made some manual changes to the code. I've attached it. With that, I see 272 multipliers in odin.out and 144 DSPs used in vpr.out.
Closing this issue. Thanks a lot, @sdamghan and @alirezazd for debugging this.
@aman26kbm Glad to see it's resolved. You're welcome.
Expected Behaviour
There should be >4096 multipliers in the design.
Current Behaviour
I'm running a design with the ODIN+Yosys flow. I was expecting atleast 4096 multipliers in the design. In the yosys log, I do see this at the end in the statistics section:
$mul 4160
In the odin log also, I see this: Total # of multipliers = 4160
But then towards the end of the odin log, it shows: Number of node: 272
And then the VPR log also shows 272 multipliers.
Somehow the number of multipliers seems to have changed from 4160 to 272. And it seems to have happened in the ODIN stage.
Possible Solution
Steps to Reproduce
Arch file: k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml Design file: all.zip Additional args: --const_gen_inference comb --timing_report_detail aggregated --route_chan_width 300 -elaborator yosys -fflegalize
Context
This design is one of the new Koios benchmarks.
Your Environment