This project contains a library of math-related hardware units.
Right now, it contains only "Fpxx" units: floating point with a user-programmable number of exponent and mantissa.
This library is using SpinalHDL.
If you want to run some of the code here, you first need to install that.
Installation instructions can be found here.
Once one, run ./run.sh
to generate whichever unit you want to test. Edit this file if you want to run a different test.
(All of this could be streamlined with a better Makefile
...)
Then run make sim
to run a test.
The Fpxx library is one that supports floating point operations for which the exponent and mantissa can be specified at compile time.
The primary use of this library is for FPGA projects that need floating point, but don't necessarily need all the features and precision of 32-bit standard floating point operations. By reducing the size of the mantissa and exponent, the hardware of some floating point operations can be made to map directly onto the hardware multipliers of the DSPs that are often present in today's FPGAs, and the maximum clock speed can be increased significantly.
For example, many FPGAs support 18x18 bit multiplications. By restricting the size of the mantissa, a single hardware multiplier may be sufficient to implement the core operation of the a floating point multiplier.
Goals:
SpinalHDL
The code is written in SpinalHDL instead of Verilog or VHDL. This makes it much easier to write generic code with programmable widths and pipeline stages. It also cuts back on boiler plate code.
That said, it's almost trivial to generate the Verilog or VHDL for use in your own project. And if that's too much effort, a number of configuration are pre-generated and stored as Verilog and VHDL in the repository, so they can be copied straight into your own project.
Floating port support for all basic operations
At the minimum, add, multiply and divide should work with acceptable accuracy, whatever that means.
For additional operations (e.g. sqrt and 1/sqrt), accuracy may very well be completely unacceptable: depending on my use cases, a small lookup table could be sufficient and the library won't have a better solution.
User-programmable mantissa and exponent size
There are some limitations. For example, FpxxDiv currently requires an odd numbered mantissa.
User-programmable size of various lookup tables or internal results
The user may want to specify a particular mantissa, but still restrict the precion for select operations when it's clear that the full precision won't be needed.
For example, one may want to use a 20-bit size mantissa in general, but restrict multiplications to 17 or 18 bits to map to a single FPGA DSP multiplier.
Similarly, the divide operation uses lookup table. For certain input ranges, the size and precision of this lookup table may not be as larges recommended for maximum precision.
Where possible, the library provides knobs to play with this.
Support for NaN, Infinity, and sign checking
It's important that NaN and Infinity values get propagated through the pipeline, to avoid cases where these kind of values alias into a real value. NaN number should be generated for operations such as asking for the square root of a negative. Overflows or division by 0 will result in Infinity.
One result per cycle
The library is initially designed for a use cases where one result is needed per clock cycle.
User-programmable pipeline depth
For each instance, the user can control the amount of intermediate pipeline stages. This makes it possible to trade off between clock speed, pipeline latency and clock speed.
C++ model
There is C++ template class with an implementation of the Fpxx modules.
This can be very useful to first create a C++ proof of concept of your design before implementing it in hardware.
The goal is for the C++ model and the hardware model to be bit exact (though this might not always be the case.)
Testbench
A testbench with directed and random vectors is provided to verify the results between a model that has a 23-bit mantissa and 8-bit exponent and the standard IEEE fp32 operations of your PC.
The testbench ignores differences that are due to the limitations of the library (e.g. denormals, rounding differences etc.)
Non-goals:
Support for denormals
Denormals requires quite a bit of additional logic for often little benefit. Support for them may be added later, there it's not there at this time.
When a denormal is encountered on an input, it is immediately clamped to zero. Denormal results are replaced by a zero as well.
(Correct) rounding
Rounding is a surprisingly expensive operation and hard to get really right. At this moment, it is not supported at all. This has definitely an impact on precision.
Correct handling of negative and positive zeros
For some operations, negative and positive zeros are dealt with correctly, but not all of them.
Simplified Floating Point for DSP
Cornell student project with C code and Verilog.
Create custom VHDL floating point cores of variable size.
Old user guide. Two-complements floating point section starts at page 4-4.
Links to conversion code.
Variable Precision Floating Point Division and Square Root
Very interesting presentation on how to create division and square root on FPGA.
The thesis about this presentation can be found here.
Another thesis implementing this kind of divider, with (bad) source code.
A Pipelined Divider with a Small Lookup Table
Paper that describes a similar divider as the one in the presentation above, but with smaller lookup table and more multipliers.
Fast Division Algorithm with a Small Lookup Table
Paper that is referenced by the two papers above as main inspiration for the LUT + 2 multipliers division operation.
Includes detailed mathematical derivation and error analysis.
Matlab - Implement Fixed-Point Square Root Using Lookup Table
Matlab code for fixed point square root lookup table.
Variable Precision Floating Point Division and Square Root
Uses combination of table lookup and a bunch of multipliers for square root. See same paper under the 'Division' section for related information.
Implementation of Single Precision Floating Point Square Root on FPGAs
Shows simple interative implementation and pipelined version, both for integer-only and floating point.
FP32 version requires 15 pipeline stages instead of 24, because some stages are so small that they can be collapsed.
Does not use a lookup table or multiplier, just a bunch of adders.
Parallel-Array Implementations of A Non-Restoring Square Root Algorithm
An Optimized Square Root Algorithm for Implementation in FPGA Hardware
Seems to be equivalent to the previous one.
An Efficient Implementation of the Non Restoring Square Root Algorithm in Gate Level
Paper that is referenced by the papers above as main inspiration for the LUT + multipliers approach.
Had detailed mathematical derivation about how things work.
Methods of Computing Square Roots
Wikipedia.
Not very useful.
Simple Seed Architectures for Reciprocal and Square Root Reciprocal
Not very useful.
Fixed-Point Implementations of the Reciprocal, Square Root and Reciprocal Square Root Functions
Modular Design of Fast Leading Zeros Counting Circuit
Very fast and low area regular leading zero counting implementation.
Stack Exchange Hierarchical Solution
Neat implementation, but apparently not nearly as area and speed efficient as the implementation of the previous bullet point. (See also this video
Leading-Zero Anticipatory Logic for High-Speed Floating Point Addition (1995)
Leading Zero Anticipation and Detection - A Comparison of Methods (2001)
Hybrid LZA: A Near Optimal Implementation of the Leading Zero Anticipator (2009)