About running COFFE with local optimization instead of global

yc2367 commented 2 years ago

Hi,

I am currently using coffe to run automatic sizing for a full adder. I only care about the area and delay trade-off for this subcircuit without considering the global mode. Hence, I specify "-o local" when running coffee. However, it continuously popped up an error saying that the _area_list in transizing.py, line 1386 was empty. I looked into this issue and found that the reason was that the _get_eval_area() function in tran_sizing.py, line 237_ doesn't return anything when the opt_type is "local":

https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L237-L243

And the same problem exists for the _get_final_area() function in transizing.py, line 254:

https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L254-L257

Could you take a look at this?

vaughnbetz commented 2 years ago

You're right. I think in both cases the area that was calculated should be returned (it is put in a local variable area, but not returned). If you make a pull request with that change and test it gives reasonable results we can merge it in.

Thanks for flagging this.

yc2367 commented 2 years ago

Hi,

After carefully looking at the whole tran_sizing.py code. I found additional issues regarding the cost function for area-delay trade-offs:

For the sizing code for all components, the _costfunction() function is invoked with hard-coded global and _fpga_inst.sbmux parameters for all circuit blocks. Please see the following codes as examples, but this issue appears for all blocks: https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L2330 https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L2363 https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L2397

This won't be a problem for the "global" mode. If you look at the _get_evalarea() function invoked within the _costfunction() and follow the _get_evalarea() definition: https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L237

When the _opttype parameter is hard-coded as "global", and the _is_ramcomponent and _is_cccomponent parameters are 0 as in the case for most logic tile components, the _get_evalarea() function will return _fpga_inst.areadict["tile"] which is okay. But I think a better practice is that in the sizing code for each component as shown in the above three examples, the _fpga_inst.sbmux parameter should be replaced by the name of each component being sized. For example, in this example: https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L2363 The code is trying to size the connection block mux, then it's better to put _fpga_inst.cbmux instead of _fpga_inst.sbmux. This allows the user to run COFFE in local mode without using the wrong area (which is always the area of _fpga_inst.sbmux in the current COFFE).

The second issue is that for some components, the current COFFE is trying to assign the _currentcost parameter twice. Please see the following three examples (there are more than three): https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L2575-L2586 https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L2611-L2622 https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L2646-L2657 https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L2681-L2692 In the first example which sizes general ble output transistors, _current_cost = cost_function(get_eval_area(fpga_inst, "global", fpga_inst.sb_mux, 1, 1), get_current_delay(fpga_inst, 0), area_opt_weight, delay_optweight) is trying to assign the _currentcost using _get_current_delay(fpgainst, 0). However, in the second assignment, _current_cost = cost_function(get_eval_area(fpga_inst, "global", fpga_inst.sb_mux, 1, 1), get_current_delay(fpga_inst, 1), area_opt_weight, delay_optweight) is trying to assign the _currentcost using _get_current_delay(fpgainst, 1). And this second wrong assignment will be used as the _pastcost in the next iteration.

If you look at the _get_currentdelay() function: https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L406 The second parameter _is_ramcomponent specifies whether this component is a RAM component, this parameter should be 0 when sizing the above three examples, but COFFE assigns the cost_function twice with the second assignment using _is_ramcomponent = 1. According to the _get_currentdelay() function, if _is_ramcomponent = 1, the function will return the delay of RAM which gives a wrong cost. https://github.com/vaughnbetz/COFFE/blob/09fb0d420d16b14c9a95be266c894ad70f7fcb7b/coffe/tran_sizing.py#L483-L486

After modifying the code with my understanding based on the above two points, I reran the test_top_level.py file to test whether I did something wrong. The tests passed for most components since I am running in the global mode. But for some carry chains and lut drivers, the tests failed with 5% ~ 15% error in delay. But I think this is because that the current COFFE uses a wrong current cost for these components as described in my second point (assigning current_cost twice).

If you want to look at my modifications, you can refer to my branch here (only changed the tran_sizing.py file): https://github.com/yc2367/COFFE/tree/Yuzong_Test

Please let me know if you prefer me to create a pull request for comparison. Thanks!

vaughnbetz commented 2 years ago

Thanks for the detailed analysis! @sadegh68 @aman26kbm : any thoughts? These look like good things to fix, but getting your opinion on the code in question and what tests should be run would be great.

aman26kbm commented 2 years ago

Good observations, @yc2367. Thanks for the details.

I agree that these are all bugs we should fix. Most likely they are copy-paste errors, and because we usually don't run the local mode.

I went through the changes in your branch. I agree with all changes. There are a couple things I want to mention:

On line 3125 in tran_sizing.py, looks like you missed makig the change. Instead of sb_mux, it should be configurable decoder
On line 2250 in tran_sizing.py, I am not sure why we're using sb_mux as the component. It seems to be a call to just initialize the cost, so may be it's okay.

I do see you've added a new test file as well, which runs the existing 4 tests with the local switch. That's good. Please also add the reference results files for these tests. You can generate them using the generate_reference method in tests.py.

sadegh68 commented 2 years ago

I agree with the changes. These bugs exist due to the fact that the local mode was not tested during parts of COFFE's development. A few tests focusing on the local mode should reduce the likelihood of such future bugs in the future.

yc2367 commented 2 years ago

Hi @vaughnbetz, @aman26kbm and @sadegh68 ,

Thanks a lot for your reply.

You are right that I didn't change line 3125 in tran_sizing.py, since the decoder has some if-else conditions for different fun-ins, e.g., nand2, nand3. I didn't modify this since I was not sure whether my current modifications for other blocks are correct. It seems that my understanding is correct based on your reply, so I will modify it soon.
For line 2250 in tran_sizing.py, I agree that sb_mux is used to initialize the cost. I copied the examples wrongly.

Another question, I actually reran the original COFFE with tests_top_level.py. It also gave some delay errors for components like lut_driver, carry_chain etc. Could you help me clarify this point?

I will try to add a reference file for the local mode soon. I think the benefit of local mode is that if some users want to develop additional components like computing in-memory blocks, e.g. an adder. They may want to reduce the area overhead by running the local mode just for that computing in-memory block, and this will be much faster since the rest of the sizing code can even be commented out when running the local mode for just one block.

aman26kbm commented 2 years ago

Aren't these failures expected? We are changing the code that affects the sizing of these components (lut driver and carry chain), even in the global mode. No?

yc2367 commented 2 years ago

Thanks for the reply! I mean the original COFFE without my changes, which is the current COFFFE on this repository, also failed some tests using the current test reference file.

aman26kbm commented 2 years ago

Ah I see. Could be because of the HSPICE version? The original reference files may have been generated with an older version of HSPICE.

yc2367 commented 2 years ago

Might be, I am using the 2022 version Hspice. But just want to clarify if you will have get some 5%~15% errors using yiur Hspice?

vaughnbetz commented 2 years ago

I'm not sure of the reason. The changes are pretty small, thankfully. Adding @StephenMoreOSU in case he has any ideas.

yc2367 commented 1 year ago

Hi, May I ask if someone has helped me verify if they also had some small error when running the tests_top_level.py? Thanks a lot!

aman26kbm commented 1 year ago

Sorry I won't have the time to help with the verification.

vaughnbetz / COFFE

About running COFFE with local optimization instead of global #40