Supporting M/Zmmul - Githubissues

Hippomenes currently supports only RVI (integer base ISA).

Likely Rust+LLVM will do a fair job at implementing both integer multiplication and division in software, so maybe we should start by benchmarking the software solutions to get a baseline. Supporting Zmmul might offer a reasonable tradeoff. Could be that Rust+LLVM is actually outperforming an interative div in many cases, who knows?

Alternatively as a non-normative extension we could consider supporting the M (mul/div) or the more FPGA friendly Zmmul (mul only) extension.

Resources:

(As listed by LLVM, M/Zmmul are supported, but I have not tried making a custom Rust target for the Zmmul.)

The muldiv experiments gives some early evaluation using SystemVerilog+Vivado.

module multest(
    input logic [31:0] x,
    input logic [31:0] y,
    output logic [63:0] res
 );
    assign res = x * y;
endmodule

Point here is that we should not overthink the mul implementation, we cannot possible do better than Vivado in this case. (The four big blobs are DSP slices).

Well there is a bit more to it of course:

Timing: As the test did not set any timing constraints, there is no timing reported, but it may well be that we need to pipeline the multiplier. It looks like the critical path passes two of the slices, thus a single added stage might suffice. Not exactly sure how this will be best done, perhaps looking at the synthesized output, and re-write the HDL accordingly, or if there is an automated way to do this (hinting Vivado to inject the pipeline stage).
Instruction integration: Should be pretty straightforward. There is a bit of juggling for the MUL, unsigned*unsigned, signed*signed, unsigned*signed or something along those lines, but nothing alarming.

The resource utilization is close to zero as the 4 DSP slices does all the heavy lifting.

Looking at the div on the other hand:

module divtest(
    input logic [31:0] x,
    input logic [31:0] y,
    output logic [31:0] q,
    output logic [31:0] r
);

    assign q = x / y;
    assign r = x % y;

endmodule

As expected this turns out like a rats nest, Vivado is not capable to leverage DSP slices, and we waste 10% of the ARTY FGPA directly (about the same size as the whole Hippo).

One could think about a bitwise iterative multi-cycle implementation (e.g., 32 iterations or taking a few steps each iteration, yielding 16, 8, 4 etc. with a bit clever design we should be able to infer reasonable slice logic. (There will be some input/output conditioning for handling signed/unsigned cases but nothing dramatic.)

An alternative, yet more complex, design could be to adopt Newton Raphson division. This amounts to normalizing the range, perform iteration until fixed point, restore range. One can reduce the number of iterations required by a look up table (giving an initial guess).

A more detailed explanation of NR is given here.

This seems however quite an effort, and proving correctness might be challenging. Also since Hippomenes primary goal is simplicity, this might overkill.

We can use this issue and the muldiv repository to further discuss/experiment with muldiv.

perlindgren / hippomenes

Supporting M/Zmmul #12