tancheng / CGRA-Flow

CGRA-Flow is an integrated framework for CGRA compilation, exploration, synthesis, and development.
113 stars 15 forks source link

How to calculate critical path in heterogeneous CGRA #23

Open 1bing2 opened 1 year ago

1bing2 commented 1 year ago

Hi,Tan: Recently, I read the paper ''AURORA: Automated Refinement of Coarse-Grained Reconfigurable Accelerators''. In the paper, you mentioned a performance model. The calculation of Tcomp in this model bothers me a bit. If I take a heterogeneous cgra and map a task into this heterogeneous cgra, how do I get the Tcomp? You mentioned Tcomp =(II × #iter) ÷timing. But how do I get the Timing in this formula? Because I tried several different heterogeneous cgras to map the same task, but the map ii I got was the same. Thank you! Have a nice Day!

tancheng commented 1 year ago

Hi 1bing2,

The timing can be estimated by your synthesis technology (through some pre experiment on each basic operation before model an entire CGRA). Say, 45nm, mul might be 0.6ns, add maybe 0.3ns, then a fused mac is 0.9ns. Xbar is 0.3ns, so a tile could be run at (1/1.2) GHz. We can roughly estimate it in this way.

For the II doesn't change, are you using docker? Probably sth is wrong with the docker's mapper or maybe your configuration in the GUI is not correct. Let me know how you model the heterogeneous CGRA, I will take a look. Thanks~!

1bing2 commented 1 year ago

Hi, Thank you for your quick reply.I use the docker to model the heterogeneous CGRA. I use the GUI interface for construction. For an initialized tile, all functional units in it are checked. So when I constructed a heterogeneous CGRA, I unchecked certain functional units in certain tiles and mapped a task to get the mapper's II. But the mapped map ii is the same as the initial one. Additionally, in the case of a generated Data Flow Graph (DFG), does it include the computational tasks that must be completed within a single clock cycle (Map II)?

1bing2 commented 1 year ago

Hi 1bing2,

The timing can be estimated by your synthesis technology (through some pre experiment on each basic operation before model an entire CGRA). Say, 45nm, mul might be 0.6ns, add maybe 0.3ns, then a fused mac is 0.9ns. Xbar is 0.3ns, so a tile could be run at (1/1.2) GHz. We can roughly estimate it in this way.

For the II doesn't change, are you using docker? Probably sth is wrong with the docker's mapper or maybe your configuration in the GUI is not correct. Let me know how you model the heterogeneous CGRA, I will take a look. Thanks~! So when using mac, its time is 0.9ns, while using add is 0.3, mul is 0.6. But for the above formula, Tcomp = (II × #iter) ÷timing, how does timing change specifically. Can we think that this timing is a period, the frequency will be slower when using a mac, and the time speed and frequency will be higher when using adders and multipliers. But how to tell from CGRA-flow which operations are integrated?

1bing2 commented 1 year ago

If I remove the adders and multipliers from each tile and just keep the mac it seems unmappable for a task? In addition, for a heterogeneous CGRA, how do I get information about its performance changes?

tancheng commented 1 year ago

Can you pull the latest mapper in the docker and try it again? You can make a 2x2 CGRA and then try to uncheck some functionality. Then the performance change might be more obvious. Let me know whether it works.

The MAC or other fused operation (within single-cycle) is not enabled by the GUI. You can follow the code here and play with it in the terminal to enable that feature.