This PR adds a new implementation for FPGA BDT inference: the conifer Forest Processing Unit (FPU).
Summary from the docs:
The conifer Forest Processing Unit (FPU) is a flexible, fast architecture for BDT inference on FPGAs.
The key difference of the FPU to the other conifer backends like HLS and VHDL, is that one FPU bitfile can perform inference for many BDTs, reconfigurable at runtime.
An FPU backend is added to build and interact with FPUs (and bitfiles are also made available on the conifer website). The FPU architecture is implemented with HLS, with Xilinx pynq for the runtime. It has currently been built with Xilinx tools (Vitis HLS, Vitis, Vivado) version 2022.2. Execution has been tested on pynq-z2 and Alveo U200 (with XRT version 2.14.354, platform xilinx_u200_gen3x16_xdma_2_202110_1).
On Alveo U200, measured inference time takes the form t = m * N + c for batch size N. c is around 100 μs, m is around 1 μs. This is corroborated with cosimulation, where the inference time in the FPU is around 1 μs, and the rest is from overheads and data movement.
This is the first PR that brings the core functionality, after which many developments are planned, including but not limited to:
multi-class implementation
multiple trees per Tree Engine
building much larger FPUs (more TEs) for Alveo
performance improvements (optimising data transfers)
'hyper-threading' (loop optimisations)
implementation optimised for many (>100) features and classes
This PR adds a new implementation for FPGA BDT inference: the conifer Forest Processing Unit (FPU).
Summary from the docs:
An FPU backend is added to build and interact with FPUs (and bitfiles are also made available on the conifer website). The FPU architecture is implemented with HLS, with Xilinx pynq for the runtime. It has currently been built with Xilinx tools (Vitis HLS, Vitis, Vivado) version 2022.2. Execution has been tested on pynq-z2 and Alveo U200 (with XRT version 2.14.354, platform xilinx_u200_gen3x16_xdma_2_202110_1).
On Alveo U200, measured inference time takes the form
t = m * N + c
for batch sizeN
.c
is around 100 μs,m
is around 1 μs. This is corroborated with cosimulation, where the inference time in the FPU is around 1 μs, and the rest is from overheads and data movement.This is the first PR that brings the core functionality, after which many developments are planned, including but not limited to: