Error during placement with hard blocks with chains

aman26kbm commented 2 years ago

<Moving the discussion from email with @MohamedElgammal to a Github issue> In a design where we have DSP blocks instantiated and connected using chains, we see an error during placement. The error is that the placer can't find an empty dsp_top for multiple blocks of the mult_add_int although the packer earlier in the flow sized the FPGA to have enough sites. Initial debug suggests a bug in the initial placement related to a recent change.

Expected Behaviour

The placement error shouldn't happen.

Current Behaviour

I'm working on updating an existing Koios benchmark to add cascade chain connections between DSP slices. These are scanin-scanout connections (input chains), instead of the common chainin-chainout connections (output chains), but for VPR, it doesn't matter.

I'm seeing a failure at the placement stage. Here's the error: 1501 Type: Placement 1502 File: /home/data1/aman/vtr_aman/vtr-verilog-to-routing/vpr/src/place/initial_placement.cpp 1503 Line: 123 1504 Message: 12 blocks could not be placed during initial placement, no spaces were available for them on the grid. 1505 If VPR was run with floorplan constraints, the constraints may be too tight.

Possible Solution

Steps to Reproduce

I'm attaching a zip file that has the benchmark, the architecture file and the task config file, that can be used to reproduce the issue. for_mohamed.zip

Context

We've come across this issue while working on Koios 2.0 benchmarks.

Your Environment

VTR revision used: Master
Operating System and version:
Compiler version:

MohamedElgammal commented 2 years ago

@aman26kbm Thanks for raising this issues. I have looked into it and it's failing to find a location for multiple DSP blocks in this design as the utilization of the DSPs is 100%. It might be something related to the exhaustive search function in initial_placement. We might have a shift of one column for example and the problem appeared in this design as we have full utilization. @sfkhalid Can you have a look at this function?

aman26kbm commented 2 years ago

Hey @MohamedElgammal , based on our discussion earlier today, I ran this design with an architecture that had a bigger grid such that the DSP usage is < 100% and it passed. Just wanted to let you know.

aman26kbm commented 2 years ago

Hey @sfkhalid, did you get a chance to look into this?

saitama0300 commented 2 years ago

Hello I am working with Aman on a new benchmark and I am coming across errors at Placement stage in VTR:

Line: 123 Message: 184 blocks could not be placed during initial placement, no spaces were available for them on the grid. If VPR was run with floorplan constraints, the constraints may be too tight.

Please find the design file, vpr.out, arch file and config file below: brainwave_issue.zip

@sfkhalid @MohamedElgammal

aman26kbm commented 2 years ago

Adding a few points to @saitama0300 's comment above...

The error is the same as we have been discussing in this issue.
The resource usage of any resource is not 100% though.
The design has many DSP chains of length 8.
The layout is auto. So, the FPGA does get sized based on the resource requirement.

aman26kbm commented 2 years ago

Hi @sfkhalid , @MohamedElgammal , were you able to get time to look into this?

MohamedElgammal commented 2 years ago

@saaramahmoudi

aman26kbm commented 2 years ago

Pasting conversation form email here to keep the issue updated:

On Mon, Jul 18, 2022 at 4:26 PM Mohamed Elgammal [mohamed.elgammal@mail.utoronto.ca](mailto:mohamed.elgammal@mail.utoronto.ca) wrote: Hi Aman,

We can still use the auto layout which is automatically sizing the device based on the limiting block type (the one with the highest utilization) and reports the overall device utilization (search for Device Utilization in vpr_stdout.log)

Hence, after the first run for a failing design, we can check the output device utilization that the tool reported. Then, rerun this design with a lower device utilization using the option --target_utilization when running vpr.

This option will enforce the tool to increase the device size to achieve the lower target utilization allowing more empty spaces for the placer to place all the chains. This has nothing to do with the architecture file as VPR is able to generate different device sizes from the same architecture description file.

I have already done this experiment for brainwave design and the highest utilization that run successfully was 0.28

Just for reference, the designs that have these issues are: tpu_like.small brainwave DLA-large

Please let me know if any other designs had this issue as well.

Best, Mohamed From: Aman Arora [aman.kbm@utexas.edu](mailto:aman.kbm@utexas.edu) Sent: 18 July 2022 17:01 To: Mohamed Elgammal [mohamed.elgammal@mail.utoronto.ca](mailto:mohamed.elgammal@mail.utoronto.ca) Cc: Andrew Boutros [andrew.boutros@mail.utoronto.ca](mailto:andrew.boutros@mail.utoronto.ca) Subject: Re: DLA-large

Thanks, Mohamed and Sara, for looking into this.

We can go with the first solution. I don't think we need to wait for the second solution for Koios 2.0.

But I have a question.. for Koios 1.0, we ran all the benchmarks with auto layout. Will using the first solution mean that we'll need to use a fixed layout instead? For this, we need to figure out how to handle the problem for all benchmarks. Different benchmarks have different resource usage. For uniform results, we will need to run all benchmarks on a really large fixed layout that fits all our designs, right? This means even smaller designs will take quite a long time to run.

Alternatively, we could run each design with auto layout and find the minimum grid size. Then we scale up the grid size to say 1.25x of the minimum required for that design. And run again with this grid size. This will need a separate arch file for each design though. This has manual steps involved, but maybe there is an option to tell VPR to pick a larger "auto layout" than is required?

Please let us know what you think.

Thanks, Aman

PS: We have seen this issue in the past for a different project as well, but it showed up slightly differently. When we would have smaller chains (like chains of less than 4 blocks), we'd see VPR succeed. But when we had longer chains (longer than 4 AFAIR), VPR would fail with an error. The workaround we chose at the time was to break the longer chains into smaller chains. We were using a fixed layout at the time (all comparisons used the same fixed layout and we chose the grid size such that it was slightly larger than the minimum required size for all benchmarks).

On Mon, Jul 18, 2022 at 3:37 PM Mohamed Elgammal [mohamed.elgammal@mail.utoronto.ca](mailto:mohamed.elgammal@mail.utoronto.ca) wrote: Hi Aman,

We -- I and Sara -- have found the root cause of this issue which is chaining. The initial placement of VPR is done randomly and if the random fails, we try to find a place by exhaustive search.

Let's say we have a device of a 3x3 grid (9 locations)

3x3_grid

and the netlist is 4 chains, each is 2 blocks (total of 8 blocks) and the first 3 chains were initially placed randomly in the shown locations 3x3_chains

Although there are still 3 empty locations to place only 2 blocks, the placer will not be able to find enough locations to place the last chain either randomly or exhaustively by iterating through the device location by location.

The fast easy solution is to increase the device size to give more space for the placer to find enough spaces for all the chains. For example, the brainwave was placed successfully at device utilization of (0.28) while the automatic sizing led to utilization of (0.38).

We are currently working on a new initial placement algorithm that densely places chains to fit chained designs in smaller devices. While this is a better solution, we can get the designs running using the first solution (specifying lower device_utilization for each design of the not working ones) as the algorithm solution might take more time.

Best regards, Mohamed

saaramahmoudi commented 2 years ago

The initial placement issue has been resolved. All the mentioned designs can now pass without increasing device size.

aman26kbm commented 2 years ago

Awesome. Thanks, @saaramahmoudi.

aman26kbm commented 2 years ago

Hi @saaramahmoudi , I'm still seeing this issue in one of the designs: https://github.com/aman26kbm/vtr-verilog-to-routing/blob/master/vtr_flow/benchmarks/verilog/koios/brainwave_like.fixed.large.v

Here's the arch file: https://github.com/aman26kbm/vtr-verilog-to-routing/blob/master/vtr_flow/arch/COFFE_22nm/k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml

Here's the task config file: https://github.com/aman26kbm/vtr-verilog-to-routing/blob/master/vtr_flow/tasks/koios/exp1a_yosodin_hard/agilex.brainwave_like.fixed.large/config/config.txt

Can you please take a look?

saaramahmoudi commented 2 years ago

Hi @aman26kbm , The design has 64 macros with dsp_top type, and each macro length is 8. (Design also has some dsp_type blocks, which are not part of any macro, so they do not create any problems.) Overall design dsp_top blocks are reported to be 562. Architecture has 12 columns with dsp_top type, and each can place 47 blocks. Overall we have 12 * 47 = 564 location to place 562 blocks. Technically, we should be able to place them all as numbers suggest.

This figure shows what happens in the placement algorithm. Each column places five macros (40 locations), and the remaining 7 locations will remain unused since our macros' lengths are 8. Overall, 12 * 5 = 60 macros are placed successfully, and 64 - 60 = 4 macros can not be placed anywhere in our grid because we need eight consecutive locations to place them.

Initial placement will report that it has failed with 32 unplaced blocks (which is exactly four macros), and I guess there is nothing to do except increase the device size to pass this design.

aman26kbm commented 2 years ago

Thanks, Sara. I thought in such cases, the device size will get increased automatically to fit the requirements, since auto_layout is used.. For now, I'm working around this by providing the target_utilization from the command line though.

saaramahmoudi commented 2 years ago

I just checked with @MohamedElgammal for your question. The automate process only checks if we have enough location for blocks with a same type or not. In this case, 564 locations is enough to place 562 blocks if we didn't have chains in our design. So, the device size will not get increased any further.

aman26kbm commented 2 years ago

Got it.

Perhaps this could be an enhancement for the future. The auto sizing should take into account initial placement results as well. It should iteratively increase the size until initial placement is successful.

saaramahmoudi commented 2 years ago

Yes, the algorithm should be improved as you said. I am closing this issue and opening another one targeting sizing algorithm directly for a future improvement. Please let me know If you find a new design with a similar issue.

aman26kbm commented 2 years ago

Wonderful. Thanks, Sara.

verilog-to-routing / vtr-verilog-to-routing