c.mulh instruction - Githubissues

gsmecher commented 1 year ago

A c.mulh instruction to mirror c.mul in zcb would allow the following to be expressed in purely compressed instructions:

32-bit DSP (dot product/convolution/correlation)
wider multiplies (64 bits and upwards)

There is room in the instruction encodings directly below c.mul (see Table 16-6, RISC-V Unprivileged ISA) - there is a natural spot for c.mulh and it seems a shame not to take advantage of the symmetry here.

Related: #9. The MAC proposal was (reasonably) excluded as out-of-scope here. A c.mulh bookend to c.mul would partly address the same need but fits better within the mandate of this WG.

gsmecher commented 1 year ago

Here's a little additional colour: I develop the minimax RISC-V implementation (https://github.com/gsmecher/minimax), which directly executes only 16-bit instructions and emulates 32-bit instructions in microcode. Because of this architecture, there is an ~100x performance difference between compressed and uncompressed instructions. The ZC* extensions are critical because more code in the 16-bit space is a massive performance improvement for minimax.

More specifically: Minimax is currently useless for DSP because all multiplies must be emulated. A c.mul instruction brings high-performance 16-bit convolutions within reach. A c.mulh instruction would make performant 32-bit DSP possible in this core. This would open the core up to an entirely new type of application.

I understand that a ~100x performance differential in a weird implementation is a Me problem, not a You problem. I also understand it's late in the cycle for a new instruction. However: there is a perfect encoding candidate right below c.mul (see Table 16.6, RISC-V Unprivileged ISA). Since c.mul implies M, which mandates a 32x32 -> 64-bit multiplier anyways, the implementation cost is basically the multiplexer to select between low and high words of the 64-bit product. From this perspective: is the lack of c.mulh a deliberate decision, or just something nobody has requested yet?

Again, thanks for all your efforts - I am excited about this WG's work, with or without c.mulh.

jnk0le commented 1 year ago

The c.mulh would be usually paired with c.mul and in this case, there needs to be an explicit move due to destructive operand. Therefore c.mulh doesn't offer enough size reduction

gsmecher commented 1 year ago

@jnk0le - that is an excellent point. A synthetic c.mulh using Zcb instructions would be pretty short, and it looks like Zcb has all the right primitives already. Thank you - closing.

gsmecher commented 1 year ago

On reflection, @jnk0le, I closed this issue too soon. Your comment is predicated on paired c.mul/c.mulh, which is true for wider multiplies. It's not true for DSP algorithms (FFTs, FIRs, correlations, etc.). Because these are not general-purpose algorithms, they don't show up on general-purpose benchmarks.

I am, unfortunately, flatly wrong about the "obvious" encoding space for c.mulh. The slot below c.mulh in Table 16-6 is already consumed in Zc* and it does not look like there's an easy place to put c.mulh even if it were justified.

tovine commented 1 year ago

If you have any suggested benchmarks (preferably open source but IIRC ELF files is all you need) then I guess it should be possible/easy to add support for analyzing the impact of a c.mulh encoding to the hca script in this repository.

gsmecher commented 1 year ago

Thanks, @tovine: for DSP applications, ARM's CMSIS DSP library 0 seems like a good reference point. CMSIS is Apache licensed and should be directly portable to RISC-V (see e.g. 1).

The following is an example convolution kernel adapted from 2. The loop is unrolled by a factor of 4, but here's the compact version first:

#include <stdint.h>

typedef int32_t q31_t;
typedef int64_t q63_t;

int conv(int k, const q31_t * px, const q31_t * py) {
    q63_t sum;
    while(k > 0u) {
      sum = (q31_t) ((((q63_t) sum << 32) +
                      ((q63_t) * px++ * (*py--))) >> 32);

      k--;
    }
    return sum;
}

I get:

$ clang --target=riscv32 -march=rv32gc -Oz -c -o conv.o conv.c
$ riscv64-unknown-elf-objdump -d conv.o
00000000 <conv>:
   0:   c911                    beqz    a0,14 <conv+0x14>
   2:   4198                    lw      a4,0(a1)
   4:   421c                    lw      a5,0(a2)
   6:   0591                    addi    a1,a1,4
   8:   1671                    addi    a2,a2,-4
   a:   02e79733                mulh    a4,a5,a4
   e:   96ba                    add     a3,a3,a4
  10:   157d                    addi    a0,a0,-1
  12:   f965                    bnez    a0,2 <conv+0x2>
  14:   8536                    mv      a0,a3
  16:   8082                    ret

There are 8 instructions in the non-unrolled loop, of which mulh is the only uncompressed one. Pretty tough to make a case that c.mulh would improve code density here. I am not sure if there's a strong microarchitectural argument, and/or whether that's relevant to this WG anyway.

If I use the 4-way unrolled version of the kernel instead, I get something the following:

$ clang --target=riscv32 -march=rv32gc -Oz -c -o conv4.o conv4.c
$ riscv64-unknown-elf-objdump -d conv4.o
00000000 <conv>:
   0:   587d                    li      a6,-1
   2:   c529                    beqz    a0,4c <conv+0x4c>
   4:   419c                    lw      a5,0(a1)
   6:   4218                    lw      a4,0(a2)
   8:   02f71733                mulh    a4,a4,a5
   c:   0045a283                lw      t0,4(a1)
  10:   ffc62303                lw      t1,-4(a2)
  14:   98ba                    add     a7,a7,a4
  16:   0085a383                lw      t2,8(a1)
  1a:   ff862683                lw      a3,-8(a2)
  1e:   02531733                mulh    a4,t1,t0
  22:   0108f7b3                and     a5,a7,a6
  26:   973e                    add     a4,a4,a5
  28:   027698b3                mulh    a7,a3,t2
  2c:   45dc                    lw      a5,12(a1)
  2e:   ff462683                lw      a3,-12(a2)
  32:   01077733                and     a4,a4,a6
  36:   9746                    add     a4,a4,a7
  38:   01077733                and     a4,a4,a6
  3c:   02f696b3                mulh    a3,a3,a5
  40:   00d708b3                add     a7,a4,a3
  44:   157d                    addi    a0,a0,-1
  46:   05c1                    addi    a1,a1,16
  48:   1641                    addi    a2,a2,-16
  4a:   fd4d                    bnez    a0,4 <conv+0x4>
  4c:   8546                    mv      a0,a7
  4e:   8082                    ret

Here, there are 24 instructions in the loop (13 uncompressed, of which 4 are mulh.) There are RVC equivalents to everything except c.mulh - though it looks like it would take a different compiler or handwritten kernel to make a 16-bit c.mulh stand out.

An FFT implementation would be nearly as general-purpose as a convolution, but CMSIS doesn't have one. I'll poke around and see if I can put together something similar.

Ultimately, though, this is somewhat application-specific and I don't know if a benchmark belongs in your tree or not.

gsmecher commented 1 year ago

Here's an FFT, taken brutally and somewhat arbitrarily from cmsis-dsp.

#include <stdint.h>

typedef int32_t q31_t;
typedef int64_t q63_t;

#define multAcc_32x32_keep32_R(a, x, y) a = (q31_t) (((((q63_t) a) << 32) + ((q63_t) x * y) + 0x80000000LL ) >> 32)
#define multSub_32x32_keep32_R(a, x, y) a = (q31_t) (((((q63_t) a) << 32) - ((q63_t) x * y) + 0x80000000LL ) >> 32)
#define mult_32x32_keep32_R(a, x, y) a = (q31_t) (((q63_t) x * y + 0x80000000LL ) >> 32)
#define multAcc_32x32_keep32(a, x, y) a += (q31_t) (((q63_t) x * y) >> 32)
#define multSub_32x32_keep32(a, x, y) a -= (q31_t) (((q63_t) x * y) >> 32)
#define mult_32x32_keep32(a, x, y) a = (q31_t) (((q63_t) x * y ) >> 32)

void arm_split_rifft_q31(
        q31_t * pSrc,
        uint32_t fftLen,
  const q31_t * pATable,
  const q31_t * pBTable,
        q31_t * pDst,
        uint32_t modifier) {       
        q31_t outR, outI;
  const q31_t *pCoefA, *pCoefB;
        q31_t CoefA1, CoefA2, CoefB1;
        q31_t *pIn1 = &pSrc[0], *pIn2 = &pSrc[2 * fftLen + 1];

  pCoefA = &pATable[0];
  pCoefB = &pBTable[0];

  while (fftLen > 0U) {
     CoefA1 = *pCoefA++;
     CoefA2 = *pCoefA;

     mult_32x32_keep32_R (outR, *pIn1, CoefA1);
     mult_32x32_keep32_R (outI, *pIn1++, -CoefA2);
     multAcc_32x32_keep32_R (outR, *pIn1, CoefA2);
     multAcc_32x32_keep32_R (outI, *pIn1++, CoefA1);
     multAcc_32x32_keep32_R (outR, *pIn2, CoefA2);
     CoefB1 = *pCoefB;

     multSub_32x32_keep32_R (outI, *pIn2--, CoefB1);
     multAcc_32x32_keep32_R (outR, *pIn2, CoefB1);
     multAcc_32x32_keep32_R (outI, *pIn2--, CoefA2);

     *pDst++ = outR;
     *pDst++ = outI;

     pCoefB = pCoefB + (modifier * 2);
     pCoefA = pCoefA + (modifier * 2 - 1);

     fftLen--;
  }
}

I get the following:

$ clang --target=riscv32 -march=rv32gc -Oz -c -o cmsis_fft.o cmsis_fft.c
$ riscv64-unknown-elf-objdump -d cmsis_fft.o
00000000 <arm_split_rifft_q31>:
   0:   1141                    addi    sp,sp,-16
   2:   c622                    sw  s0,12(sp)
   4:   c426                    sw  s1,8(sp)
   6:   c24a                    sw  s2,4(sp)
   8:   c04e                    sw  s3,0(sp)
   a:   4801                    li  a6,0
   c:   58fd                    li  a7,-1
   e:   800002b7            lui t0,0x80000
  12:   00359313            slli    t1,a1,0x3
  16:   932a                    add t1,t1,a0
  18:   0311                    addi    t1,t1,4
  1a:   00379393            slli    t2,a5,0x3
  1e:   c9e5                    beqz    a1,10e <arm_split_rifft_q31+0x10e>
  20:   01060eb3            add t4,a2,a6
  24:   000eae03            lw  t3,0(t4)
  28:   411c                    lw  a5,0(a0)
  2a:   01068f33            add t5,a3,a6
  2e:   004eae83            lw  t4,4(t4)
  32:   03c79fb3            mulh    t6,a5,t3
  36:   03c78433            mul s0,a5,t3
  3a:   005404b3            add s1,s0,t0
  3e:   0084b433            sltu    s0,s1,s0
  42:   9fa2                    add t6,t6,s0
  44:   41d00433            neg s0,t4
  48:   02879933            mulh    s2,a5,s0
  4c:   028787b3            mul a5,a5,s0
  50:   00578433            add s0,a5,t0
  54:   4144                    lw  s1,4(a0)
  56:   00f437b3            sltu    a5,s0,a5
  5a:   993e                    add s2,s2,a5
  5c:   011fffb3            and t6,t6,a7
  60:   03d499b3            mulh    s3,s1,t4
  64:   03d48433            mul s0,s1,t4
  68:   005407b3            add a5,s0,t0
  6c:   0087b7b3            sltu    a5,a5,s0
  70:   97ce                    add a5,a5,s3
  72:   9fbe                    add t6,t6,a5
  74:   01197933            and s2,s2,a7
  78:   03c497b3            mulh    a5,s1,t3
  7c:   03c484b3            mul s1,s1,t3
  80:   00548433            add s0,s1,t0
  84:   009434b3            sltu    s1,s0,s1
  88:   00032403            lw  s0,0(t1)
  8c:   97a6                    add a5,a5,s1
  8e:   01278e33            add t3,a5,s2
  92:   011fffb3            and t6,t6,a7
  96:   03d41933            mulh    s2,s0,t4
  9a:   03d404b3            mul s1,s0,t4
  9e:   005487b3            add a5,s1,t0
  a2:   0097b7b3            sltu    a5,a5,s1
  a6:   000f2483            lw  s1,0(t5)
  aa:   97ca                    add a5,a5,s2
  ac:   01f78f33            add t5,a5,t6
  b0:   011e7e33            and t3,t3,a7
  b4:   02849fb3            mulh    t6,s1,s0
  b8:   02848433            mul s0,s1,s0
  bc:   0082b433            sltu    s0,t0,s0
  c0:   ffc32783            lw  a5,-4(t1)
  c4:   947e                    add s0,s0,t6
  c6:   408e0e33            sub t3,t3,s0
  ca:   011f7f33            and t5,t5,a7
  ce:   02979fb3            mulh    t6,a5,s1
  d2:   029784b3            mul s1,a5,s1
  d6:   00548433            add s0,s1,t0
  da:   00943433            sltu    s0,s0,s1
  de:   008f84b3            add s1,t6,s0
  e2:   9f26                    add t5,t5,s1
  e4:   011e7e33            and t3,t3,a7
  e8:   03d794b3            mulh    s1,a5,t4
  ec:   03d787b3            mul a5,a5,t4
  f0:   00578433            add s0,a5,t0
  f4:   00f437b3            sltu    a5,s0,a5
  f8:   97a6                    add a5,a5,s1
  fa:   97f2                    add a5,a5,t3
  fc:   01e72023            sw  t5,0(a4)
 100:   c35c                    sw  a5,4(a4)
 102:   15fd                    addi    a1,a1,-1
 104:   981e                    add a6,a6,t2
 106:   0521                    addi    a0,a0,8
 108:   1361                    addi    t1,t1,-8
 10a:   0721                    addi    a4,a4,8
 10c:   f991                    bnez    a1,20 <arm_split_rifft_q31+0x20>
 10e:   4432                    lw  s0,12(sp)
 110:   44a2                    lw  s1,8(sp)
 112:   4912                    lw  s2,4(sp)
 114:   4982                    lw  s3,0(sp)
 116:   0141                    addi    sp,sp,16
 118:   8082                    ret

This implementation looks like a dead end for c.mulh: most of the loop consists of 32-bit instructions (mul/mulh pairs not least among them, per @jnk0le's point above).

gsmecher commented 1 year ago

Just to be 100% clear: I think this idea is DOA for non-technical reasons (schedule, scope) and technical reasons outside the scope of all my analysis above (lack of available encoding space for c.mulh). It is probably not worth discussing the motivational benchmarks alone if it's moot.

abukharmeh commented 1 year ago

In the initial analysis phase, we generated frequency count of all instructions used in all benchmarks we have used, and before trying to fit to any compressed encoding, the saving from c.mulh were way too low to consider. Like you said this would be more applicable in DSP type application , but would need to prove some how that the code size saving per encoding bit spent is worth while, and I think that would be fairly hard.

Zc* extension is frozen at its current state, so its very unlikely that any new instructions would be added. But if you would like to make a better case for such instruction, I would encourage you to find opensource benchmarks that you think would benefit from it, and PR their compilation instructions for RISC-V and link to their source here thus we can evaluate using them when we work on the next iteration of the extension !

gsmecher commented 1 year ago

Thanks - this context is helpful, and I understand and agree with what you say. I will keep my eyes open for a justifiable benchmark that supports c.mulh.

riscvarchive / riscv-code-size-reduction

c.mulh instruction #208