perlindgren / hippomenes

In love with Atalanta
9 stars 4 forks source link

Supporting M/Zmmul #12

Open perlindgren opened 7 months ago

perlindgren commented 7 months ago

Hippomenes currently supports only RVI (integer base ISA).

Likely Rust+LLVM will do a fair job at implementing both integer multiplication and division in software, so maybe we should start by benchmarking the software solutions to get a baseline. Supporting Zmmul might offer a reasonable tradeoff. Could be that Rust+LLVM is actually outperforming an interative div in many cases, who knows?

Alternatively as a non-normative extension we could consider supporting the M (mul/div) or the more FPGA friendly Zmmul (mul only) extension.

Resources:

(As listed by LLVM, M/Zmmul are supported, but I have not tried making a custom Rust target for the Zmmul.)

The muldiv experiments gives some early evaluation using SystemVerilog+Vivado.

module multest(
    input logic [31:0] x,
    input logic [31:0] y,
    output logic [63:0] res
 );
    assign res = x * y;
endmodule

image

Point here is that we should not overthink the mul implementation, we cannot possible do better than Vivado in this case. (The four big blobs are DSP slices).

Well there is a bit more to it of course:

The resource utilization is close to zero as the 4 DSP slices does all the heavy lifting.

Looking at the div on the other hand:

module divtest(
    input logic [31:0] x,
    input logic [31:0] y,
    output logic [31:0] q,
    output logic [31:0] r
);

    assign q = x / y;
    assign r = x % y;

endmodule

image

As expected this turns out like a rats nest, Vivado is not capable to leverage DSP slices, and we waste 10% of the ARTY FGPA directly (about the same size as the whole Hippo).

One could think about a bitwise iterative multi-cycle implementation (e.g., 32 iterations or taking a few steps each iteration, yielding 16, 8, 4 etc. with a bit clever design we should be able to infer reasonable slice logic. (There will be some input/output conditioning for handling signed/unsigned cases but nothing dramatic.)

An alternative, yet more complex, design could be to adopt Newton Raphson division. This amounts to normalizing the range, perform iteration until fixed point, restore range. One can reduce the number of iterations required by a look up table (giving an initial guess).

A more detailed explanation of NR is given here.

This seems however quite an effort, and proving correctness might be challenging. Also since Hippomenes primary goal is simplicity, this might overkill.

We can use this issue and the muldiv repository to further discuss/experiment with muldiv.

onsdagens commented 6 months ago

An implementation of Zmmul has made its way to master. As mentioned here, the mul implementation did indeed end up as a couple of DSP slices, so the timing/size hit is reasonable.

LLVM does indeed support Zmmul, this is exposed by Rustc through a generic flag passed directly to LLVM. Unfortunately, Rustc itself does not recognize the extension, meaning any inline assembly containing mul instructions is rejected, however, experiments have shown that multiplication in Rust code itself does result in mul instructions being emitted.

Something interesting here, the compiler seems to really try its best not to emit a mul instruction. This makes sense in most implementations, since mul is often multi-cycle. On Hippo, however, mul is just another single-cycle instruction, so it would be nice if the compiler treated it as one.