google / CFU-Playground

Want a faster ML processor? Do it yourself! -- A framework for playing with custom opcodes to accelerate TensorFlow Lite for Microcontrollers (TFLM). . . . . . Online tutorial: https://google.github.io/CFU-Playground/ For reference docs, see the link below.
http://cfu-playground.rtfd.io/
Apache License 2.0
468 stars 119 forks source link

hps_accel sometimes fails to build with Oxide #246

Open alanvgreen opened 3 years ago

alanvgreen commented 3 years ago

Sometimes the Oxide build fails with nextpnr failing to complete. In these cases it is stuck in "Running main router loop..." for many hours.

The attached file hps-build.tar.gz contains

tcal-x commented 3 years ago

Hi @gatecat , can you take a look when you get a chance? Are we getting too close to 100% utilization of some resource, or is there something you can address, and/or is there a workaround for the time being? Thanks!

gatecat commented 3 years ago

I will investigate, but unfortunately this is a bit of an ongoing issue in nextpnr in general with routing congestion - particularly nextpnr-nexus - so I can't promise a quick fix.

alanvgreen commented 3 years ago

Running O(50) nextpnrs in parallel seems to work, albeit takes a while: https://github.com/google/CFU-Playground/pull/251

alanvgreen commented 3 years ago

Here's a graph showing how the "overuse" by iteration for each run: 4gnZmyNNhwDK4fJ

alanvgreen commented 3 years ago

I have merged #251 but the underlying issues remain.

alanvgreen commented 3 years ago

@kgugala Could you please take a look at this? There are probably no easy fixes, but anything to speed up builds or make them more likely to succeed would be helpful.

alanvgreen commented 3 years ago

See also https://github.com/google/CFU-Playground/issues/281

kgugala commented 3 years ago

@mkurc-ant can you grab this one?

mkurc-ant commented 3 years ago

Sure, I'll take this one

mkurc-ant commented 3 years ago

I started from reproducing the result. Indeed nextpnr failed to finish routing with the command and attached files.

For nextus it uses the "router2" algorithm. I collected some data to investigate what is going on.

By adding the --placed-svg placement.svg option I managed to dump the design placement. The whole fabric is filled quite uniformly which may lead to issues with congestion. image

I wrote some code to dump the wire overuse map to check whether there are any bottlenecks for which the router struggles. I created a "video clip" for ~90 iterations. It looks like there isn't a single bottleneck point - the congestion is high throughout the whole fabric: https://user-images.githubusercontent.com/47315577/134305207-2e373a48-efd4-4462-8fb1-fc8913a4b45d.mp4

I have also looked into the code of router2. In principle it tires to route each arc (a source to sink pair) starting from the source and the sink until they meet somewhere in between. This happens inside a bounding box that limits the search range. Each time a net fails to route for 3 consecutive iterations the bounding box is increased. This is seen in the video: immediately after the beginning congestion is focused in the middle of the fabric, then it spreads out as more and more nets are routed with larger bounding boxes. When the bounding box covers the whole fabric nothing more can be done.

I made a plot of how the number of overused wires evolves over time. It drops rapidly ad the beginning and reaches its minimum, then slowly increased. image

I tried a few different seeds but didn't get significantly different results. I also tried switching to the "router1" algoritm but it also didn't converge in approx. the same runtime as router2.

mkurc-ant commented 3 years ago

I tried to play with various placer options but they didn't seem to have any significant effect on the result.

Then I tried removing bounding-box expansion as it is disabled for mistral (CycloneV). With it disabled the overused wire count does not increase that much in the end but the solution still cannot be found. See the graph.

image

mkurc-ant commented 3 years ago

I managed to implement the same design using Radiant. To do this I had to convert JSON to verilog using Yosys:

read_json hps_proto2_platform.json
synth_nexus -run :coarse
write_verilog radiant/top.v

And then build it in Radiant:

prj_create -name "design" -impl "impl" -dev LIFCL-17-8UWG72C -synthesis "synplify" 
prj_set_impl_opt {include path} {""}
prj_add_source "top.v" -work work
prj_add_source "../hps_proto2_platform.pdc" -work work
prj_set_impl_opt top "hps_proto2_platform" 
prj_save
prj_run Synthesis -impl impl -forceOne
prj_run Map -impl impl
prj_run PAR -impl impl
prj_run Export -impl impl -task Bitgen
prj_close

This proves that the design that cannot be implemented by nextpnr-nexus can actually be implemented using Radiant.

Radiant reports the following usage:

Device utilization summary:

   VHI                   1/1           100% used
   SIOLOGIC              2/72            2% used
   EBR                  24/24          100% used
   MULT9                24/48           50% used
   MULT18                4/24           16% used
   REG18                24/48           50% used
   PREADD9              24/48           50% used
   LRAM                  5/5           100% used
   SEIO18                2/48            4% used
                         2/24            8% bonded
   SEIO33                5/71            7% used
                         5/15           33% bonded
   OSC                   1/1           100% used
   SLICE              5931/6912         85% used
     LUT             10703/13824        77% used
     REG              4764/13824        34% used

, while nextpnr reports:

Info: Device utilisation:
Info:                 OXIDE_FF:  4771/13824    34%
Info:               OXIDE_COMB: 10642/13824    76%
Info:                     RAMW:    36/ 1728     2%
Info:              SEIO33_CORE:     7/   23    30%
Info:              SEIO18_CORE:     2/   44     4%
Info:            DIFFIO18_CORE:     0/   22     0%
Info:                 OSC_CORE:     1/    1   100%
Info:                OXIDE_EBR:    24/   24   100%
Info:             PREADD9_CORE:    24/   48    50%
Info:               MULT9_CORE:    24/   48    50%
Info:              MULT18_CORE:     4/   24    16%
Info:               REG18_CORE:    24/   48    50%
Info:           MULT18X36_CORE:     0/   12     0%
Info:              MULT36_CORE:     0/    6     0%
Info:               ACC54_CORE:     0/   12     0%
Info:                      DCC:     1/   62     1%
Info:                  VCC_DRV:     1/   74     1%
Info:                LRAM_CORE:     5/    5   100%
Info:                  IOLOGIC:     0/   44     0%
Info:                 SIOLOGIC:     0/   22     0%

This is more-less consistent (Radiant can do some optimizations during the flow).

mkurc-ant commented 3 years ago

Radiant placement. Looks similar to nextpnr. Of course it cannot be told which cells go where: image

mkurc-ant commented 3 years ago

I tested router1 again with --router router1. It managed to successfully route the design but it took about 1h 15min to complete.

gatecat commented 3 years ago

Some of the router1 slowness is probably around Vcc routing on the nexus which is a specific bug I'll look at once I've fixed the DSP placement one.

I'm also planning to add LUT permutation to the nexus arch soon which should reduce congestion on certain wire types.

gatecat commented 3 years ago

https://github.com/gatecat/prjoxide/pull/15 and https://github.com/YosysHQ/nextpnr/pull/822 gets the router1 runtime down to about 40min. I'm looking at some further tweaks for router1 performance though, as this has never been optimised for nexus before.

gatecat commented 3 years ago

There might be scope for more fine-tuning, but https://github.com/YosysHQ/nextpnr/pull/823 on top of those gets router1 down to 20min. It might now be possible to do some comparison between router1 and router2 to find out why the latter gets stuck.

mkurc-ant commented 3 years ago

@gatecat The problem is that router2 didn't find the solution. From my observation it fights the congestion but constantly rips up and re-routes nets in the same area. I'll try again with the recent changes you've made (regarding VCC) maybe it will help.

gatecat commented 3 years ago

Still working on an actually functionally correct implementation but it looks like LUT permutation should help a lot, although we might have to switch nexus back to router1 for a bit (but that should get the route time down to 5 minutes or so). I'll hopefully have something usable in the next day or two though.

gatecat commented 3 years ago

I need to sort out the existing patches before I make it into a PR, and also fix a separate issue that has been triggered in router2 (this also sets the nexus router back to router1), but if you want to try my LUT permutation support, it's the following branches:

This gets the route time down to about 10 minutes, although I think there should be some considerable room for improvement with some further fixes and getting router2 working again.

tcal-x commented 3 years ago

@gatecat -- maybe related, maybe not, let me know if I should open another issue. We have an "easy" Litex/VexRiscv design (just using a trivial CFU), which often completed in less than two minutes. But occasionally it seems to get stuck routing a single net -- it will be at "overuse=1" (with an occasional "overuse=2") for thousands of iterations. Usually it does complete, but after up to 10,000 iterations.

danc86 commented 3 years ago

I hit a strange error when I was trying to build proj/hps_accel + PR#298 + --cpu-variant=slimperf+cfu:

14:09:45 ERROR: Carry cell 'Cfu.pp.ppp.rdbpot.$109_CCU2_S0_7_A0_LUT4_Z_A_LUT4_Z_B_LUT4_Z_D_CCU2_S1$ccu2_comb[1]$' has multiple fanout on FCO

This was with seed 4213. The other seeds didn't seem to hit the problem so it seems to be somehow random. My nextpnr and prjoxide versions were from a few months ago though. I can't reproduce the problem with newer nextpnr and prjoxide. It's probably just some old bug that has already been fixed, I wanted to paste the error just in case it gives someone some clues as to what might be causing these problems. I note that the "carry cell" it's referring to is generated from the hps_accel post-processing logic.

In other news, I have tried the lutperm patches from this comment:
https://github.com/google/CFU-Playground/issues/246#issuecomment-925066129
and they seem to be really promising. Routing takes 5-10 minutes and completes successfully, and I am seeing fmax of 77MHz. A huge improvement over what we were getting before.

mkurc-ant commented 3 years ago

@tcal-x @danc86 Could you attach the exact files and nextpnr seeds that cause the problem?

gatecat commented 3 years ago

lutperm has now been merged; however I'd recommend keeping --router router1 on this project for now for more reliable timing and convergence (at the expense of increased runtime).

danc86 commented 3 years ago

Regarding the has multiple fanout on FCO problem, I've been changing too much stuff, I couldn't reproduce the exact configuration of everything that triggered that error anymore. Probably best to just ignore that problem unless it reappears.

piotr-binkowski commented 3 years ago

I've encountered the same problem with has multiple fanout on FCO on my side. Here are files that were used by nextpnr hps.hps_accel.tar.gz Run that gave that error used seed 1207.

Finally I'm using following tool revisions

gatecat commented 3 years ago

Unfortunately the problem here is an abc9 bug, I think, which I have seen on occasion for the Yosys Cyclone V support; where abc9 touches a carry chain that it's supposed to leave alone. Generating Verilog with yosys -p "write_verilog -norename hps_proto2_syn.v" hps_proto2_platform.json, you can see the problem:

  CCU2 #(
    .INIT0("0x96AA"),
    .INIT1("0x96AA"),
    .INJECT("NO")
  ) \VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D_CCU2_COUT  (
    .A0(vns_shared_err),
    .A1(vns_shared_err),
    .B0(\VexRiscv.dataCache_1_io_mem_cmd_payload_address [9]),
    .B1(\VexRiscv.dataCache_1_io_mem_cmd_payload_address [10]),
    .C0(vns_shared_err),
    .C1(vns_shared_err),
    .CIN(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D_CCU2_COUT_CIN ),
    .COUT(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D ),
    .D0(soc_vexriscv_cfu_bus_rsp_payload_response_ok),
    .D1(soc_vexriscv_cfu_bus_rsp_payload_response_ok),
    .S0(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D_CCU2_COUT_S0 ),
    .S1(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D_CCU2_COUT_S1 )
  );

...

  (* module_not_derived = 32'd1 *)
  (* src = "/usr/local/bin/../share/yosys/nexus/cells_map.v:91.44-92.44" *)
  LUT4 #(
    .INIT("0xa0fd")
  ) \VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z  (
    .A(\VexRiscv.dataCache_1.stageB_flusher_valid ),
    .B(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D ),
    .C(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_C ),
    .D(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_D ),
    .Z(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D [1])
  );
  (* module_not_derived = 32'd1 *)
  (* src = "/usr/local/bin/../share/yosys/nexus/cells_map.v:91.44-92.44" *)
  LUT4 #(
    .INIT("0xa0fd")
  ) \VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_1  (
    .A(\VexRiscv.dataCache_1.stageB_flusher_valid ),
    .B(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D ),
    .C(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_1_C ),
    .D(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_1_D ),
    .Z(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D [2])
  );

The COUT net \VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D is only supposed to drive the CIN of another CCU2 yet somehow abc9's optimisations are rarely breaking this constraint and in this case feeding several LUTs with it.

Unfortunately I've never been able to reproduce this on smaller designs so it's very difficult to suggest a good way to proceed here.

tcal-x commented 3 years ago

Hi @gatecat , it looks like #306 might be another case of what you describe above, cout with illegal fanout. The assertion this time is:

Info: Packing carries...
terminate called after throwing an instance of 'nextpnr_nexus::assertion_failure'
  what():  Assertion failure: fco->users.at(0).port == id_CIN (/media/tim/GIT/YosysHQ/nextpnr/nexus/pack.cc:1364)
build_hps_proto2_platform.sh: line 4: 824243 Aborted                 (core dumped) nextpnr-nexus --json hps_proto2_platform.json --pdc hps_proto2_platform.pdc --fasm hps_proto2_platform.fasm --device LIFCL-17-8UWG72C --seed 1

and also, we don't see the assertion if we remove -abc9 from the yosys options.

We started seeing this after a change to the CFU, but unfortunately for analysis, the CFU is more than half of the overall design (larger than the VexRiscv). I'll send you a zip file of the build directory including Verilog source in case it's useful.