Open alanvgreen opened 3 years ago
Hi @gatecat , can you take a look when you get a chance? Are we getting too close to 100% utilization of some resource, or is there something you can address, and/or is there a workaround for the time being? Thanks!
I will investigate, but unfortunately this is a bit of an ongoing issue in nextpnr in general with routing congestion - particularly nextpnr-nexus - so I can't promise a quick fix.
Running O(50) nextpnrs in parallel seems to work, albeit takes a while: https://github.com/google/CFU-Playground/pull/251
Here's a graph showing how the "overuse" by iteration for each run:
I have merged #251 but the underlying issues remain.
@kgugala Could you please take a look at this? There are probably no easy fixes, but anything to speed up builds or make them more likely to succeed would be helpful.
@mkurc-ant can you grab this one?
Sure, I'll take this one
I started from reproducing the result. Indeed nextpnr failed to finish routing with the command and attached files.
For nextus it uses the "router2" algorithm. I collected some data to investigate what is going on.
By adding the --placed-svg placement.svg
option I managed to dump the design placement. The whole fabric is filled quite uniformly which may lead to issues with congestion.
I wrote some code to dump the wire overuse map to check whether there are any bottlenecks for which the router struggles. I created a "video clip" for ~90 iterations. It looks like there isn't a single bottleneck point - the congestion is high throughout the whole fabric: https://user-images.githubusercontent.com/47315577/134305207-2e373a48-efd4-4462-8fb1-fc8913a4b45d.mp4
I have also looked into the code of router2. In principle it tires to route each arc (a source to sink pair) starting from the source and the sink until they meet somewhere in between. This happens inside a bounding box that limits the search range. Each time a net fails to route for 3 consecutive iterations the bounding box is increased. This is seen in the video: immediately after the beginning congestion is focused in the middle of the fabric, then it spreads out as more and more nets are routed with larger bounding boxes. When the bounding box covers the whole fabric nothing more can be done.
I made a plot of how the number of overused wires evolves over time. It drops rapidly ad the beginning and reaches its minimum, then slowly increased.
I tried a few different seeds but didn't get significantly different results. I also tried switching to the "router1" algoritm but it also didn't converge in approx. the same runtime as router2.
I tried to play with various placer options but they didn't seem to have any significant effect on the result.
Then I tried removing bounding-box expansion as it is disabled for mistral (CycloneV). With it disabled the overused wire count does not increase that much in the end but the solution still cannot be found. See the graph.
I managed to implement the same design using Radiant. To do this I had to convert JSON to verilog using Yosys:
read_json hps_proto2_platform.json
synth_nexus -run :coarse
write_verilog radiant/top.v
And then build it in Radiant:
prj_create -name "design" -impl "impl" -dev LIFCL-17-8UWG72C -synthesis "synplify"
prj_set_impl_opt {include path} {""}
prj_add_source "top.v" -work work
prj_add_source "../hps_proto2_platform.pdc" -work work
prj_set_impl_opt top "hps_proto2_platform"
prj_save
prj_run Synthesis -impl impl -forceOne
prj_run Map -impl impl
prj_run PAR -impl impl
prj_run Export -impl impl -task Bitgen
prj_close
This proves that the design that cannot be implemented by nextpnr-nexus can actually be implemented using Radiant.
Radiant reports the following usage:
Device utilization summary:
VHI 1/1 100% used
SIOLOGIC 2/72 2% used
EBR 24/24 100% used
MULT9 24/48 50% used
MULT18 4/24 16% used
REG18 24/48 50% used
PREADD9 24/48 50% used
LRAM 5/5 100% used
SEIO18 2/48 4% used
2/24 8% bonded
SEIO33 5/71 7% used
5/15 33% bonded
OSC 1/1 100% used
SLICE 5931/6912 85% used
LUT 10703/13824 77% used
REG 4764/13824 34% used
, while nextpnr reports:
Info: Device utilisation:
Info: OXIDE_FF: 4771/13824 34%
Info: OXIDE_COMB: 10642/13824 76%
Info: RAMW: 36/ 1728 2%
Info: SEIO33_CORE: 7/ 23 30%
Info: SEIO18_CORE: 2/ 44 4%
Info: DIFFIO18_CORE: 0/ 22 0%
Info: OSC_CORE: 1/ 1 100%
Info: OXIDE_EBR: 24/ 24 100%
Info: PREADD9_CORE: 24/ 48 50%
Info: MULT9_CORE: 24/ 48 50%
Info: MULT18_CORE: 4/ 24 16%
Info: REG18_CORE: 24/ 48 50%
Info: MULT18X36_CORE: 0/ 12 0%
Info: MULT36_CORE: 0/ 6 0%
Info: ACC54_CORE: 0/ 12 0%
Info: DCC: 1/ 62 1%
Info: VCC_DRV: 1/ 74 1%
Info: LRAM_CORE: 5/ 5 100%
Info: IOLOGIC: 0/ 44 0%
Info: SIOLOGIC: 0/ 22 0%
This is more-less consistent (Radiant can do some optimizations during the flow).
Radiant placement. Looks similar to nextpnr. Of course it cannot be told which cells go where:
I tested router1 again with --router router1
. It managed to successfully route the design but it took about 1h 15min
to complete.
Some of the router1 slowness is probably around Vcc routing on the nexus which is a specific bug I'll look at once I've fixed the DSP placement one.
I'm also planning to add LUT permutation to the nexus arch soon which should reduce congestion on certain wire types.
https://github.com/gatecat/prjoxide/pull/15 and https://github.com/YosysHQ/nextpnr/pull/822 gets the router1
runtime down to about 40min. I'm looking at some further tweaks for router1 performance though, as this has never been optimised for nexus before.
There might be scope for more fine-tuning, but https://github.com/YosysHQ/nextpnr/pull/823 on top of those gets router1
down to 20min. It might now be possible to do some comparison between router1 and router2 to find out why the latter gets stuck.
@gatecat The problem is that router2 didn't find the solution. From my observation it fights the congestion but constantly rips up and re-routes nets in the same area. I'll try again with the recent changes you've made (regarding VCC) maybe it will help.
Still working on an actually functionally correct implementation but it looks like LUT permutation should help a lot, although we might have to switch nexus back to router1 for a bit (but that should get the route time down to 5 minutes or so). I'll hopefully have something usable in the next day or two though.
I need to sort out the existing patches before I make it into a PR, and also fix a separate issue that has been triggered in router2 (this also sets the nexus router back to router1), but if you want to try my LUT permutation support, it's the following branches:
This gets the route time down to about 10 minutes, although I think there should be some considerable room for improvement with some further fixes and getting router2 working again.
@gatecat -- maybe related, maybe not, let me know if I should open another issue. We have an "easy" Litex/VexRiscv design (just using a trivial CFU), which often completed in less than two minutes. But occasionally it seems to get stuck routing a single net -- it will be at "overuse=1" (with an occasional "overuse=2") for thousands of iterations. Usually it does complete, but after up to 10,000 iterations.
I hit a strange error when I was trying to build proj/hps_accel + PR#298 + --cpu-variant=slimperf+cfu
:
14:09:45 ERROR: Carry cell 'Cfu.pp.ppp.rdbpot.$109_CCU2_S0_7_A0_LUT4_Z_A_LUT4_Z_B_LUT4_Z_D_CCU2_S1$ccu2_comb[1]$' has multiple fanout on FCO
This was with seed 4213. The other seeds didn't seem to hit the problem so it seems to be somehow random. My nextpnr and prjoxide versions were from a few months ago though. I can't reproduce the problem with newer nextpnr and prjoxide. It's probably just some old bug that has already been fixed, I wanted to paste the error just in case it gives someone some clues as to what might be causing these problems. I note that the "carry cell" it's referring to is generated from the hps_accel post-processing logic.
In other news, I have tried the lutperm patches from this comment:
https://github.com/google/CFU-Playground/issues/246#issuecomment-925066129
and they seem to be really promising. Routing takes 5-10 minutes and completes successfully, and I am seeing fmax of 77MHz. A huge improvement over what we were getting before.
@tcal-x @danc86 Could you attach the exact files and nextpnr seeds that cause the problem?
lutperm has now been merged; however I'd recommend keeping --router router1
on this project for now for more reliable timing and convergence (at the expense of increased runtime).
Regarding the has multiple fanout on FCO
problem, I've been changing too much stuff, I couldn't reproduce the exact configuration of everything that triggered that error anymore. Probably best to just ignore that problem unless it reappears.
I've encountered the same problem with has multiple fanout on FCO
on my side.
Here are files that were used by nextpnr hps.hps_accel.tar.gz
Run that gave that error used seed 1207.
Finally I'm using following tool revisions
d9a71083e1a081f89e1aa4c357bc3b828eea6709
with write-timing-report
branch rebased on top of it7d56d6392d859e18155e43818a62ea3f5789946a
50be8fd0c23856be7afa28527fe4f30dcc975c87
Unfortunately the problem here is an abc9 bug, I think, which I have seen on occasion for the Yosys Cyclone V support; where abc9 touches a carry chain that it's supposed to leave alone. Generating Verilog with yosys -p "write_verilog -norename hps_proto2_syn.v" hps_proto2_platform.json
, you can see the problem:
CCU2 #(
.INIT0("0x96AA"),
.INIT1("0x96AA"),
.INJECT("NO")
) \VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D_CCU2_COUT (
.A0(vns_shared_err),
.A1(vns_shared_err),
.B0(\VexRiscv.dataCache_1_io_mem_cmd_payload_address [9]),
.B1(\VexRiscv.dataCache_1_io_mem_cmd_payload_address [10]),
.C0(vns_shared_err),
.C1(vns_shared_err),
.CIN(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D_CCU2_COUT_CIN ),
.COUT(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D ),
.D0(soc_vexriscv_cfu_bus_rsp_payload_response_ok),
.D1(soc_vexriscv_cfu_bus_rsp_payload_response_ok),
.S0(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D_CCU2_COUT_S0 ),
.S1(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D_CCU2_COUT_S1 )
);
...
(* module_not_derived = 32'd1 *)
(* src = "/usr/local/bin/../share/yosys/nexus/cells_map.v:91.44-92.44" *)
LUT4 #(
.INIT("0xa0fd")
) \VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z (
.A(\VexRiscv.dataCache_1.stageB_flusher_valid ),
.B(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D ),
.C(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_C ),
.D(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_D ),
.Z(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D [1])
);
(* module_not_derived = 32'd1 *)
(* src = "/usr/local/bin/../share/yosys/nexus/cells_map.v:91.44-92.44" *)
LUT4 #(
.INIT("0xa0fd")
) \VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_1 (
.A(\VexRiscv.dataCache_1.stageB_flusher_valid ),
.B(\VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D ),
.C(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_1_C ),
.D(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D_LUT4_Z_1_D ),
.Z(\VexRiscv.dataCache_1.stageB_mmuRsp_physicalAddress_FD1P3IX_Q_D [2])
);
The COUT
net \VexRiscv.dataCache_1.stageB_flusher_valid_FD1P3IX_Q_D_LUT4_Z_D
is only supposed to drive the CIN
of another CCU2 yet somehow abc9's optimisations are rarely breaking this constraint and in this case feeding several LUTs with it.
Unfortunately I've never been able to reproduce this on smaller designs so it's very difficult to suggest a good way to proceed here.
Hi @gatecat , it looks like #306 might be another case of what you describe above, cout with illegal fanout. The assertion this time is:
Info: Packing carries...
terminate called after throwing an instance of 'nextpnr_nexus::assertion_failure'
what(): Assertion failure: fco->users.at(0).port == id_CIN (/media/tim/GIT/YosysHQ/nextpnr/nexus/pack.cc:1364)
build_hps_proto2_platform.sh: line 4: 824243 Aborted (core dumped) nextpnr-nexus --json hps_proto2_platform.json --pdc hps_proto2_platform.pdc --fasm hps_proto2_platform.fasm --device LIFCL-17-8UWG72C --seed 1
and also, we don't see the assertion if we remove -abc9
from the yosys options.
We started seeing this after a change to the CFU, but unfortunately for analysis, the CFU is more than half of the overall design (larger than the VexRiscv). I'll send you a zip file of the build directory including Verilog source in case it's useful.
Sometimes the Oxide build fails with nextpnr failing to complete. In these cases it is stuck in "Running main router loop..." for many hours.
The attached file hps-build.tar.gz contains
json and pdc input files used to run this command:
The log file
seed-1.log
, which was still running after ~10.5 hoursThe log file
seed-122.log
which is for the same command, except using--seed 122