Utilization and Fmax numbers

hukenovs / intfftk

Fully pipelined Integer Scaled / Unscaled Radix-2 Forward/Inverse Fast Fourier Transform (FFT) IP-core for newest Xilinx FPGAs (Source language - VHDL / Verilog). GNU GPL 3.0.

GNU General Public License v3.0

75 stars 25 forks source link

Utilization and Fmax numbers #2

Closed gabriel-tenma-white closed 4 years ago

gabriel-tenma-white commented 5 years ago

Would you mind publishing some FPGA resource usage and Fmax numbers and the configuration/part number used? I tried synthesizing (in Vivado) for various 7 series and Ultrascale parts, but could never seem to get timing closure above about 320MHz on Ultrascale (NFFT=12, DATA_WIDTH=24, TWDL_WIDTH=16, truncation mode, XSERIES set to the correct value).

I also tried your other implementation, intfft_spdf (posting issues seems to be disabled on that repository), but it seems all the ram blocks never synthesize to BRAM and end up being implemented as LUTRAM. Looking at the RTL it looks like you are using two read ports and a write port, which from what I know would only be supported in Ultrascale, but targeting Ultrascale/Ultrascale+ didn't help and I was still seeing 11k+ LUT utilization for a 4096 FFT. What device part number are you targeting and are there special constraints needed?

gabriel-tenma-white commented 5 years ago

These are what I'm getting on Vivado 2018.3:

* int_fft_single_path
* 24 bit data
* 16 bit twiddle
* truncation mode
* scaled
* XSERIES="OLD"

intfftk 1024, xc7z010-1
    5060 lut
    3952 ff
    2.5 bram
    52 dsp
    289 MHz

intfftk 4096, xc7z010-1
    4968 lut
    4856 ff
    18 bram
    66 dsp
    283 MHz

intfftk 1024, xc7k160tfbg676-1
    5058 lut
    3964 ff
    2.5 bram
    52 dsp
    373 MHz

intfftk 4096, xc7k160tfbg676-1
    5207 lut
    4714 ff
    17.5 bram
    66 dsp
    346 MHz

intfftk 8192, xc7k160tfbg676-1
    5455 lut
    5268 ff
    32.5 bram
    74 dsp
    350 MHz

About half the LUT usage is LUTRAM. The timing bottleneck seem to be the block ram in xIN_BUF, and Vivado doesn't seem to be able to absorb an output register into the BRAM. I think this can be fixed with more register stages at the BRAM output.

hukenovs commented 5 years ago

Hi @gabriel-tenma-white ! Here is my implementation results and some conclusions:

FPGA parts:

xcku040-fbva900-3
xcku040-fbva900-1
xc7k325tfbv900-3
xc7k325tfbv900-1 (!)
xc7a200tfbg484-3

All of them (except xc7k325tfbv900-1) give the same utilization report after implementation (see below).

Input parameters:
    DATA_WIDTH      = 24
    TWIDDLE_WIDTH   = 16
    MODE            = SCALED, TRUNCATE
    NFFT            = 1024
    VIVADO          = 2017.4

Resource usage:

FFT block:
    ARITHMETIC/DSP     52
    BLOCKRAM/BRAM      10
    CLB/LUT            2553
    CLB/LUTRAM         512
    CLB/SRL            440
    REGISTER/SDR       3938

Input buffer block:
    BLOCKRAM/BRAM      2
    CLB/LUT            51
    REGISTER/SDR       101

Output buffer block:
    BLOCKRAM/BRAM      1
    CLB/LUT            99
    REGISTER/SDR       138

Total usage:
    ARITHMETIC/DSP     52
    BLOCKRAM/BRAM      15
    CLB/LUT            2747
    CLB/LUTRAM         512
    CLB/SRL            440
    REGISTER/SDR       4262

Timing report: I've tested them with only a single constraint for 5 FPGA parts (Freq = 333.333 MHz): create_clock -period 3.000 -name CLK -waveform {0.000 1.500} [get_ports CLK]

Timing report: Failed routes = 0, timings met.

But for Kintex-7 (xc7k325tfbv900-1) part I've received strange utilization results: RAMBs usage decreased and LUTRAMs increased.

FFT block:
    ARITHMETIC    52
    BLOCKRAM      6
    CLB           4213
    REGISTER      4131

Input buffer block:
    CLB           1414
    REGISTER      238

Output buffer block has the same results.

So only xc7k325tfbv900-1 has the different results (see pic. below). And artix has bad timings :(

Solution: I've tried to use attribute ram_style = "block" for the input buffer component. It works! So you can use ram_style attribute if you would like to manage synthesis behaviour without FPGA part dependence. I suppose for some FFT stages I also can change ram style. For example if (stage > 9) then ram style would be "block" else (stage < 10) "distributed".

And you were right about internal regs of BRAMs! For max performance we need to use additional internal registers. I'm going to fix it in the next release!

Thanks for feedback!

BR, Alexander

hukenovs commented 5 years ago

@gabriel-tenma-white Also I'd like to say some useful info about intfft_spdf project. It is the experimental part with high cost of resources. If you want to improve it you need to read "Parallel Extensions to Single-Path Delay-Feedback FFT Architectures" article. You need to change my butterfly (BF) architecture for DIT and DIF. I used a simple butterfly without double multiplexing, but we can improve it by using BF1 and BF2 architectures (see Fig 3 and Fig 4 into the article) and replacing twiddle multipliers.

DSP resource usage for several FFT lengths (Data width = 16, Twiddle width = 16, Unscaled):

Add/Sub (BF math logic):

N              16    32    64    128    256    512    1K    2K    4K
  SingleBF      4     5     6      7      9    11     13    15    17
  BF1-BF2       8    10    12     16     18    22    24     28    32

Twiddle multipliers::

N              16    32    64    128    256    512    1K    2K    4K
  SingleBF      8    12    16     20     24     28    32    36    44
  BF1-BF2       4     8     8     12     12     16    16    20    20

For example, N = 4096: Total DSP48 usage: 52 (BF1-BF2 architecture) and 61 (My project).

Xilinx IP-core uses BF1-BF2 architecture. But it doesn't have decimation-in-time option, so you should use bit-reverse (digit-reverse) converters after all FFT components when you are mapping FFT+IFFT cores together (For example: fast convolution processing).

P.S. I'm doing Radix-4 FFT project right now. It is also would be open source solution. It has some advantages for parallel calculation. For example, if you have complex signal from ADC with 1600 MHz sampling frequency and down sampling it into FPGA from one data stream on 1600MHz to 400 MHz and 4 parts of data flow. Also you can use 200MHz freq and 8 parts of data flow, but Radix-8 scheme isn't simple on FPGA architecture ;)

BR, Alexander

gabriel-tenma-white commented 5 years ago

Artix does have weird timings, the LUTRAMs are slower than the block ram (!) which implies the LUTs are quite slow. I just tried synthesizing my FFT implementation (owocomm-0/fpga-fft) on xc7a200tfbg484-3, and hit the LUTRAM frequency limit which is 476 MHz. (In case you are wondering, my design uses an excessive number of flip-flops ;) )

I think I know why Vivado is sometimes implementing the delay buffer in LUTRAM. The delay lines used are of size N/2, N/4, N/8, ..., so there is bound to be some sizes that are too small for BRAM (a lot of space will be wasted) and quite large for LUTRAM. I think I lucked out on this one in my implementation by using sqrt(n) sub-FFTs.

I will look into the SPDF architecture. I'm looking to implement FFTs with the minimal amount of multiplier usage because I have applications where a Spartan 6 has enough logic resources but lacks enough multipliers. I'm also looking into implementing the burst I/O radix-2 architecture, since most of my DSP will be on 30Msps data while the FFT core can run at 300MHz.

I don't have the luxury of 1.6Gsps ADCs (or anything above 100Msps for that matter) and the best FPGA I have access to is a Zynq 7010. Ultrascale/Ultrascale+ are way beyond imagination ;)

I'm also looking to do large FFTs (size 1M to 16M) on a Zynq in DDR3 memory by using the Bailey's 4-step algorithm which allows decomposing a 16M FFT into size 4096 FFTs. The main bottleneck there will be reading/writing DRAM in transposed order. Do you have any experience with other algorithms for large FFT on DRAM?

hukenovs commented 5 years ago

@gabriel-tenma-white Freq = 476 MHz after implementation? Wow, looks really good. Did you try it into FPGA? How is it working at that frequency? (I didn't test my projects more than 400MHz).

I suppose that ram_style attribute into HDL code can solve problem of choosing LUT / RAM. I used this trick into another project for N > 9 (fp23fftk repo, see fp_delay_line.vhd component ).

I am going to add Radix-2 burst I/O to this project as an generic parameter. Unfortunately I don't have much time for open source. May be you will see burst i/o into this project but not now, sorry. Or you can help me and we will do it together :)

Yep! I've done Ultra-Long FFTs project with flexible generic parameter from 256K to 16M into Xilinx Kintex Ultrascale+(XCKU11P and XCKU15P). Total NFFT = N1 x N2. So for 1M FFT it is 1024 x 1024 points FFTs, and for 16M FFT it is 4K x 4K points. Algorithm is similar to 2-D FFT, but it has some differences. The main difference into the Ultra Long FFT algorithm is that you need to use additional twiddle multipliers after N1 FFTs calculation and second shuffler. I used 3 independent SDRAM DDR4 controllers (each works at 2400 MHz) connected to FPGA, My Ultra long FFT core are fully pipelined and it has streaming cont. architecture. Input signal frequency is 250-300 MHz. The most complex part of Ultra Long FFT is buffers for shuffle data after DDR controllers (I have 100% UltraRAM and ~80% BRAM utilization). Sorry but I can't say more about details of this project because of NDA :)

BR, Alexander

gabriel-tenma-white commented 5 years ago

Yeah the Fmax numbers are after implementation by looking at WNS. I usually constrain the clock to the maximum clock of the BRAM from the device datasheet. I have tested the numerical accuracy in test benches with random data but not on hardware yet. I'm still in the process of designing the AXI DMA interface to the Zynq. I think I'll design the burst IO FFT after that.

So it looks like you already discovered the algorithm I'm using :+1: I know what the data shuffling you are describing is; the transpose step is best done by reading bursts of X values from each row of the matrix, so you end up with the first X columns of the matrix in BRAM/UltraRAM, then you read out the columns from the BRAM and perform streaming FFT on each column (after doing twiddles). For improved performance I thought about reordering address bits of the DRAM so that each DRAM page spans a sub-block (square) of the matrix rather than spanning one or more rows of the matrix (as is the case in row major order), which will maximize the number of operations done on each opened page.

Unfortunately the Zynq DDR controller only seems to be able to do about 2GB/s total throughput when reading and writing simultaneously, so I'm limited to fairly low sample rates doing large FFTs in realtime. I might be designing a Kintex board soon so might be able to play with a bit more memory bandwidth.

Have you played with the new Xilinx FPGAs with HBM? It's an integrated 1024-bit wide DDR4 chip and apparently gets you over 400GB/s bandwidth to 8GB of memory. It's way out of my budget of course but since you are already dealing with Ultraexpensive+ FPGAs maybe you can convince your employer to let you play with one of these :)

hukenovs commented 5 years ago

@gabriel-tenma-white Unfortunately complex FPGA designs can't work in high frequencies because of logic timings (For example, total frequency is less than expected if you have complex design with 80+% resource utilization and have several big independent logic blocks: ADC receiver, some SDRAM controllers, DSP cores (FFTs, FIRs, CIC, DDS etc.), PCI-e or Ethernet into one FPGA). Single FFT will work on 400+ MHz, but total project solution cannot run on that frequency. Also max freqs of logic or DSPs are different (DSP can run at x2 max freq of LUT/FFs).

BTW have you ever seen this article? https://www.xilinx.com/support/answers/68595.html I am trying to run some FIRs with this method. It really works and can help you save some DPS48 primitives.

You are right about collecting data into DRAM. I did it the same way:

For improved performance I thought about reordering address bits of the DRAM so that each DRAM page spans a sub-block (square) of the matrix rather than spanning one or more rows of the matrix (as is the case in row major order), which will maximize the number of operations done on each opened page.

Ooooh no! HBM and RFSoC Xilinx FPGAs are my dream! I really want to use them in my work but I can't because we have some customs restrictions on these FPGA parts. Unfortunately US imposed sanctions on Russia so we can't import Hi-tech FPGAs and some advanced chips :(

BR Alexander

gabriel-tenma-white commented 5 years ago

That's true. I do wonder what kind of Signals Intelligence work you are doing that manages to fill up 80% of a large Kintex Ultrascale though ;)

I think the closest thing I've done to that is a multi-channel CIC filter that runs at N times the sample rate. It uses shift register "rings" to store integrator state and uses the same number of adders (no multipliers) as a single channel filter, and no muxers.

Unfortunately US imposed sanctions on Russia so we can't import Hi-tech FPGAs and some advanced chips :(

This is the shit that makes my blood boil. I avoided working in the US because I refuse to help develop technology that will be kept secret and export restricted. The US government is the ultimate bully and the only one that tries to undermine the entire world's technological development in order to stay number 1. I'm hoping China can get its FPGA business in shape (they are still about a decade behind) and come up with a cheaper DRAM+FPGA solution that at least beats external DDR4.

hukenovs commented 5 years ago

@gabriel-tenma-white Do you use HLS in your projects? It is really fast and it gets you the same results for some DSP applications (FIRs, DDS, CORDIC etc.). Vivado HLS provides pragmas and directives that can be used to optimize your design. You can reduce latency, improve throughput performance or device resource utilization. I've made some DSP application by using HLS (FFTs, FIRs, CORDIC) - it is really easy method and it allows you the fast testing your applications. But the main advantage is that HLS method doesn't have any clocks when you write your code on C/C++. Try it if you haven't tried it yet.

I saw BMTI (China University) presentation about radiation-hardened FPGAs. It says that we will see the microchips like virtex-7 / kintex-7 in September 2020. Chips would be fully compatible with Xilinx developer tools. I suppose we can buy and test them,

BR, Alexander

gabriel-tenma-white commented 5 years ago

I haven't tried HLS yet. I do enough C++ work that I'm tired of the unsafe behavior of the language which makes it easy to make mistakes and for the code to behave in non-obvious ways. It's the same reason I prefer VHDL over Verilog. Maybe I'll try it for more complex projects. Usually I like to have more control over exactly what hardware gets generated because it makes timing closure easier. I have usually resorted to code generation for things that would be tedious to write in VHDL.

That would be nice, do you also know of any Russian FPGAs? ;)