google / CFU-Playground

Want a faster ML processor? Do it yourself! -- A framework for playing with custom opcodes to accelerate TensorFlow Lite for Microcontrollers (TFLM). . . . . . Online tutorial: https://google.github.io/CFU-Playground/ For reference docs, see the link below.
http://cfu-playground.rtfd.io/
Apache License 2.0
470 stars 120 forks source link

High-fanout nets in HPS design #331

Open tcal-x opened 3 years ago

tcal-x commented 3 years ago

I used the following script to print out high-fanout nets in the HPS design. It is used as the --pre-place script in nextpnr.

for net, netinfo in ctx.nets:
   fanout = len(netinfo.users)
   if fanout > 100:
       print("%d fanout:  net %s driver %s:%s" % (fanout, net, netinfo.driver.cell.name, netinfo.driver.port))

This is what it produced (I sorted the list afterwards):

4708 fanout:  net por_clk$glb_clk driver por_clk$glb_clk$drv_DCC:CLKO
3402 fanout:  net memdat_1[8] driver Cfu.filter_flow_restrictor.initial_VLO_Z:F
3198 fanout:  net $CONST_VCC_DRV_NET_ driver $CONST_VCC_DRV_DRV_:Z
2335 fanout:  net sys_rst driver FD1P3BX_1:Q
484 fanout:  net Cfu.macc_operands__ready driver Cfu.operands_buffer.output__ready_LUT4_Z:F
394 fanout:  net VexRiscv.CfuPlugin_bus_rsp_valid_WIDEFN9_A0_Z driver VexRiscv.CfuPlugin_bus_rsp_valid_WIDEFN9_A0$widefn_comb[0]$:OFX
213 fanout:  net VexRiscv._zz_dBus_rsp_valid_LUT4_B_Z_LUT4_C_Z driver VexRiscv._zz_dBus_rsp_valid_LUT4_B_Z_LUT4_C:F
194 fanout:  net VexRiscv.decode_to_execute_MEMORY_MANAGMENT_LUT4_B_Z_WIDEFN9_SEL_Z_LUT4_C_1_Z driver VexRiscv.decode_to_execute_MEMORY_MANAGMENT_LUT4_B_Z_WIDEFN9_SEL_Z_LUT4_C_1:F
145 fanout:  net soc_reset_re_LUT4_B_Z driver soc_reset_re_LUT4_B:F
131 fanout:  net Cfu.pp.fifo.wrapped.unbuffered.consume[2] driver Cfu.pp.fifo.wrapped.unbuffered.consume_FD1P3IX_Q_3:Q
131 fanout:  net Cfu.pp.fifo.wrapped.unbuffered.consume[1] driver Cfu.pp.fifo.wrapped.unbuffered.consume_FD1P3IX_Q_4:Q
131 fanout:  net Cfu.pp.fifo.wrapped.unbuffered.consume[0] driver Cfu.pp.fifo.wrapped.unbuffered.consume_FD1P3IX_Q_5:Q
131 fanout:  net Cfu.operands_buffer.buffering_inputs driver Cfu.operands_buffer.buffering_inputs_FD1P3IX_Q:Q
131 fanout:  net Cfu.operands_buffer.buffering_filters driver Cfu.operands_buffer.buffering_filters_FD1P3IX_Q:Q
130 fanout:  net Cfu.pp.fifo.wrapped.unbuffered.consume[3] driver Cfu.pp.fifo.wrapped.unbuffered.consume_FD1P3IX_Q_2:Q
130 fanout:  net Cfu.macc.operands__valid_LUT4_C_Z_LUT4_B_Z driver Cfu.macc.operands__valid_LUT4_C_Z_LUT4_B:F
130 fanout:  net Cfu.macc.operands__valid_LUT4_C_Z_LUT4_B_1_Z driver Cfu.macc.operands__valid_LUT4_C_Z_LUT4_B_1:F
129 fanout:  net VexRiscv.IBusCachedPlugin_fetchPc_booted_LUT4_D_C_LUT4_D_C_LUT4_B_Z driver VexRiscv.IBusCachedPlugin_fetchPc_booted_LUT4_D_C_LUT4_D_C_LUT4_B:F
128 fanout:  net Cfu.input_store.data_output__valid_LUT4_D_Z_LUT4_C_Z driver Cfu.input_store.data_output__valid_LUT4_D_Z_LUT4_C:F
128 fanout:  net Cfu.filter_store.data_output__valid_LUT4_D_Z_LUT4_C_Z driver Cfu.filter_store.data_output__valid_LUT4_D_Z_LUT4_C:F
124 fanout:  net soc_vexriscv_cfu_bus_cmd_payload_function_id[0] driver VexRiscv.decode_to_execute_INSTRUCTION_FD1P3IX_Q_17:Q
104 fanout:  net VexRiscv.memory_arbitration_isValid driver VexRiscv.memory_arbitration_isValid_FD1P3IX_Q:Q
tcal-x commented 3 years ago
3402 fanout:  net memdat_1[8] driver Cfu.filter_flow_restrictor.initial_VLO_Z:F

-- this seems to be providing 1'b0 i.e. constant zero i.e. GND.

It's driven by an OXIDE_COMB with INIT = 32'h00000000.

danc86 commented 3 years ago

Oh, nice find! It's good to know that one is not really specific to the FlowRestrictor module in spite of its name.

Does it mean it's working as intended -- we expect the design to have some global constant zero with high fan-out, whatever its name happens to be?

Or does this mean it's not properly using the global routing and we need to fix something?

tcal-x commented 3 years ago

I don't know the details yet about constant handling. I've dealt with it before on other FPGA architectures. One would think it's trivial, but it's not. Sometimes routing muxes or LUT input muxes have inputs hardcoded to 1'b0 and/or 1'b1 to provide sources of constants. Sometimes you need to burn a LUT to generate a constant. It might be cheaper to route a constant from the closest net providing that value. It doesn't make sense to partition the constant nets before placement. I guess you could special-case the router; provide a number of constant sources spread around the fabric, then have the router connect them to sinks as needed using a minimum spanning trees. I was just about to dig into what nextpnr does...

tcal-x commented 3 years ago

Here's some discussion relating to constant nets with FPGA interchange, between gatecat and litghost: https://github.com/YosysHQ/nextpnr/pull/591

And here's the current code related to packing constant nets for Nexus: https://github.com/YosysHQ/nextpnr/blob/master/nexus/pack.cc#L302-L321

And here's an interesting comment in router2 related to high-fanout const nets on Nexus: https://github.com/YosysHQ/nextpnr/blob/master/common/router2.cc#L516-L517

The takeaway seems to be that high-fanout constant nets have been considered.