Parmys fails to properly handle multipliers with unequal input widths

WhiteNinjaZ commented 4 months ago

It appears that Pamys is unable to properly handle multipliers with unequal port widths (25x18, 13x18, etc.) in some situations.

Current Behavior

When dealing with architectures that have multipliers with input ports of unequal widths (i.e. 25x18) and running the VTR benchmarks, the benchmarks are consistently failing synthesis either by an abort or by a segmentation fault. Further, some benchmarks are able to run through the flow while others fail (i.e. a 25x18 multiplier will succeed on all benchmarks except diffeq2). The following are the two errors I have found while running multipliers of different sizes:

error[AST]:  tried adding input pins for width 0 <= 0 $mul~6-2-1-0

vtr-verilog-to-routing/parmys/parmys-plugin/netlist/netlist_utils.cc:132: void allocate_more_input_pins(nnode_t*, int): Fatal error
Command terminated by signal 6
    Command being timed: "vtr-verilog-to-routing/build/bin/yosys -c synthesis.tcl"

The above usually happens if the top level multiplier block contains only one multiplier whose inputs are of unequal widths (i.e. 9x3).

24.9. Executing OPT_EXPR pass (perform const folding).
Optimizing module diffeq_f_systemC.

24.10. Finished OPT passes. (There is nothing left to do.)

25. Starting parmys pass.
Command terminated by signal 11

The above usually occurs when the top level multiplier block has two or more multipliers inside.

I have tried the following multiplier sizes and got one the two above errors:

25x18
18x25
10x18
18x10
8x12
7x9
4x8
3x9
9x3

Expected Behavior

Multipliers with an unequal number of input ports should be properly synthesized.

Steps to Reproduce

The test_multiplier_size.xml is a modified version of k6_frac_N10_frac_chain_mem32K_40nm.xml with two 25x18 multipliers per block.
Run this architecture on diffeq2.v benchmark using the most recent VPR commit with $VTR_ROOT/vtr_flow/scripts/run_vtr_flow.py diffeq2.v test_multiplier_size.xml -temp_dir $PWD/temp --route_chan_width 190
Observe the output
Change the multiplier sizes within architecture to different widths to see different errors discussed above.

vaughnbetz commented 4 months ago

@Amirarjmand93

WhiteNinjaZ commented 4 months ago

Here is a little more detail into what I have found: Using Valgrind and several print statements, it looks like the issue stems from split_multiplier() function in multiplier.cc. From what I can tell, the problem is that when this function tries to map a 32x32 multiply from diffeq2.v onto the 18x25 multiplier. In this scenario, node->num_output_pins=32+32=64, b0=25 and addbig->num_output_pins=addsmall->num_output_pins=a1b0->num_output_pins + 1=(32-18)+25+1=40. In the final for loop that remaps the pins to the new node the elements b0-1 to b0-1+addbig->num_output_pins are accessed. Since b0-1+addbig->num_output_pins = 64 > node->num_output_pins - 1 we exceed our array bounds by one. If I did things right, I believe the root of the problem is that the assumption (stated in the header for the split_multiplier function) that the multiplication can be balanced to remove an addition only holds if the multiplier to be mapped to contains inputs with equal port widths. Removing this assumption by inserting an if statement and an extra multiplier when the multiplier ports where unequal, I was able to successfully compile diffeq2.v. However, I think I may have connected the nodes together incorrectly as some of the larger benchmarks (i.e. mcml.v) fail the parmys pass. @amirarjmand93 if you could take a look that would be great.

Here are my changes to the multplier.cc file. Basically all my changes are in split_multiplier funciton: multiplier.txt

amirarjmand93 commented 4 months ago

I am exactly approaching split_multiplier function as well as you. also, I have found if the bit width of the desired multiplier (verilog file) becomes lower than the minimum bit width of the designed hard block(architecture file), it turns out correct synthesis. for example, when trying to map 16x16 multiplied (modified diffeq2.v) by an 18x25 multiplier (architecture file), it is OK. but the error turns out when trying a greater number like 20x20 or 32x32 (modified diffeq2.v) into an 18x25 multiplier (architecture file).

my other concern falls in the modified arch file. I tried any manipulation in the test_multiplier_size.xml but I didn't get any error related to modified architecture. Also, Is the modified version of 'k6_frac_N10_frac_chain_mem32K_40nm.xml' a valid one?

amirarjmand93 commented 4 months ago

As a quick update, I want to clarify some parts of the issue:

The core of problems fall in *_iterate_multipliers(netlist_t netlist)_** in multiplier.cc file.

Arch file : test_multiplier_size.xml (fixed 25x18)

Design file: diffeq2.v (modified)

Success with Large Multipliers:

I checked the multiplier.txt. I think we have been approaching with the same idea on replacing "Concatinating" methodology((a1 b1) . (a0 b0)) with "addsmall2" ((a1 b1) + (a0 b0)). Now, calling _splitmultiplier(node, a0, b0, a1, b1, netlist) function can handle Multiplicand greater than max hard block bit width (25). good job. so 26x26, 27x27, ... , 35x35 multipliers can be passed well. (35x35 is maximum allowed -> see section 3 ) Blank diagram

Issues with Smaller Multipliers:

The problem still persssit for multiplier bit width less than 25(max hard block bit width). so designs with multipliers' bit widths lesser than 25 like 24x24,23x23,...,18x18, cannot be passed through the 25*18 hard block correctly. this error stems from calling _padmultiplier(node, netlist); . inside this function, we have a variable by the name "diffb" which goes to a negative value and the oassert function terminates the program by error.

Issues with Handling Very Large Multipliers:

I think the mcml.v file contains 64x64 multiplication . this large bit width cannot be positioned correctly inside a 25*18 hard block. I think the reason is the Splitter split inputs at once( just for one time) and the split Multiplicants(inputs) must be lesser or equal to min hard block bit width(18). unless Spliiter would have a recursive method to break down new Multiplicants(inputs) again to get fitted in the hard block (D & Q algorithm). So it should follow: min hard block bit width > multiplier bit width / 2. (here, in mcml.v and 25x18 arch, min hard block bit width is 18 and multiplier bit width / 2 is 32. so 35x35 is maximum allowance value for input multiplier )

amirarjmand93 commented 3 months ago

Issues with Smaller Multipliers:

I've made some progress in handling multiplier input size less than "max hard block bit width"(25) and greater "min hard block bit width(18)". The changes are located within the _padmultiplier function. Now it can pass a variety of multipliers of any input size in the design file(diffeq2.v) and turns out OK. but i'm not sure about Pin mapping and node connection functionality. please take a look @WhiteNinjaZ . multiplier_v1.txt

WhiteNinjaZ commented 3 months ago

I have run the changes through several different hard multiplier sizes with unequal input widths and all the ones I ran worked on diffeq2.v. Nice work! As you mentioned mcml is still broken because of the 64x64multiplier. Looking at parmys's generated netlist it also looks like a few of the multipliers input pins from the split multiplier a/b functions are completely unconnected. I am currently looking into this and will let you know what I find.

amirarjmand93 commented 3 months ago

Thank you Joshua, I have some suggestions that may be helpful. please ignore changes in the Padding function and work on your code. next, test on verilog mul design (<18) and (>25). care about the max boundary(not more than 35). in other words, refuse middle range (18~25) multiplier size. keep the arch file intact(1825). see the netlist connection status. ( I think the mcml works on baseline arch (k6...) because of mono 36 36 and dual of 18 * 18 mul hard blocks and satisfies the mentioned inequality. maybe!)

WhiteNinjaZ commented 1 month ago

@amirarjmand93 I know we talked about handing this back off to you in the VTR meeting a few weeks ago. Any updates?

WhiteNinjaZ commented 1 month ago

@vaughnbetz @amirarjmand93 Any updates on the multiplier issue?

amirarjmand93 commented 1 month ago

I haven't done it yet. I'm still working on updating Yosys.

verilog-to-routing / vtr-verilog-to-routing