Open perlindgren opened 7 months ago
An implementation of Zmmul has made its way to master
. As mentioned here, the mul implementation did indeed end up as a couple of DSP slices, so the timing/size hit is reasonable.
LLVM does indeed support Zmmul, this is exposed by Rustc through a generic flag passed directly to LLVM. Unfortunately, Rustc itself does not recognize the extension, meaning any inline assembly containing mul instructions is rejected, however, experiments have shown that multiplication in Rust code itself does result in mul instructions being emitted.
Something interesting here, the compiler seems to really try its best not to emit a mul instruction. This makes sense in most implementations, since mul is often multi-cycle. On Hippo, however, mul is just another single-cycle instruction, so it would be nice if the compiler treated it as one.
Hippomenes currently supports only RVI (integer base ISA).
Likely Rust+LLVM will do a fair job at implementing both integer multiplication and division in software, so maybe we should start by benchmarking the software solutions to get a baseline. Supporting Zmmul might offer a reasonable tradeoff. Could be that Rust+LLVM is actually outperforming an interative div in many cases, who knows?
Alternatively as a non-normative extension we could consider supporting the M (mul/div) or the more FPGA friendly Zmmul (mul only) extension.
Resources:
(As listed by LLVM, M/Zmmul are supported, but I have not tried making a custom Rust target for the Zmmul.)
The muldiv experiments gives some early evaluation using SystemVerilog+Vivado.
Point here is that we should not overthink the mul implementation, we cannot possible do better than Vivado in this case. (The four big blobs are DSP slices).
Well there is a bit more to it of course:
Timing: As the test did not set any timing constraints, there is no timing reported, but it may well be that we need to pipeline the multiplier. It looks like the critical path passes two of the slices, thus a single added stage might suffice. Not exactly sure how this will be best done, perhaps looking at the synthesized output, and re-write the HDL accordingly, or if there is an automated way to do this (hinting Vivado to inject the pipeline stage).
Instruction integration: Should be pretty straightforward. There is a bit of juggling for the MUL,
unsigned*unsigned
,signed*signed
,unsigned*signed
or something along those lines, but nothing alarming.The resource utilization is close to zero as the 4 DSP slices does all the heavy lifting.
Looking at the div on the other hand:
As expected this turns out like a rats nest, Vivado is not capable to leverage DSP slices, and we waste 10% of the ARTY FGPA directly (about the same size as the whole Hippo).
One could think about a bitwise iterative multi-cycle implementation (e.g., 32 iterations or taking a few steps each iteration, yielding 16, 8, 4 etc. with a bit clever design we should be able to infer reasonable slice logic. (There will be some input/output conditioning for handling signed/unsigned cases but nothing dramatic.)
An alternative, yet more complex, design could be to adopt Newton Raphson division. This amounts to normalizing the range, perform iteration until fixed point, restore range. One can reduce the number of iterations required by a look up table (giving an initial guess).
A more detailed explanation of NR is given here.
This seems however quite an effort, and proving correctness might be challenging. Also since Hippomenes primary goal is simplicity, this might overkill.
We can use this issue and the muldiv repository to further discuss/experiment with muldiv.