Open gsmecher opened 1 year ago
Here's a little additional colour: I develop the minimax RISC-V implementation (https://github.com/gsmecher/minimax), which directly executes only 16-bit instructions and emulates 32-bit instructions in microcode. Because of this architecture, there is an ~100x performance difference between compressed and uncompressed instructions. The ZC* extensions are critical because more code in the 16-bit space is a massive performance improvement for minimax.
More specifically: Minimax is currently useless for DSP because all multiplies must be emulated. A c.mul instruction brings high-performance 16-bit convolutions within reach. A c.mulh instruction would make performant 32-bit DSP possible in this core. This would open the core up to an entirely new type of application.
I understand that a ~100x performance differential in a weird implementation is a Me problem, not a You problem. I also understand it's late in the cycle for a new instruction. However: there is a perfect encoding candidate right below c.mul (see Table 16.6, RISC-V Unprivileged ISA). Since c.mul implies M, which mandates a 32x32 -> 64-bit multiplier anyways, the implementation cost is basically the multiplexer to select between low and high words of the 64-bit product. From this perspective: is the lack of c.mulh a deliberate decision, or just something nobody has requested yet?
Again, thanks for all your efforts - I am excited about this WG's work, with or without c.mulh.
The c.mulh
would be usually paired with c.mul
and in this case, there needs to be an explicit move due to destructive operand.
Therefore c.mulh
doesn't offer enough size reduction
@jnk0le - that is an excellent point. A synthetic c.mulh using Zcb instructions would be pretty short, and it looks like Zcb has all the right primitives already. Thank you - closing.
On reflection, @jnk0le, I closed this issue too soon. Your comment is predicated on paired c.mul/c.mulh, which is true for wider multiplies. It's not true for DSP algorithms (FFTs, FIRs, correlations, etc.). Because these are not general-purpose algorithms, they don't show up on general-purpose benchmarks.
I am, unfortunately, flatly wrong about the "obvious" encoding space for c.mulh. The slot below c.mulh in Table 16-6 is already consumed in Zc* and it does not look like there's an easy place to put c.mulh even if it were justified.
If you have any suggested benchmarks (preferably open source but IIRC ELF files is all you need) then I guess it should be possible/easy to add support for analyzing the impact of a c.mulh encoding to the hca script in this repository.
Thanks, @tovine: for DSP applications, ARM's CMSIS DSP library 0 seems like a good reference point. CMSIS is Apache licensed and should be directly portable to RISC-V (see e.g. 1).
The following is an example convolution kernel adapted from 2. The loop is unrolled by a factor of 4, but here's the compact version first:
#include <stdint.h>
typedef int32_t q31_t;
typedef int64_t q63_t;
int conv(int k, const q31_t * px, const q31_t * py) {
q63_t sum;
while(k > 0u) {
sum = (q31_t) ((((q63_t) sum << 32) +
((q63_t) * px++ * (*py--))) >> 32);
k--;
}
return sum;
}
I get:
$ clang --target=riscv32 -march=rv32gc -Oz -c -o conv.o conv.c
$ riscv64-unknown-elf-objdump -d conv.o
00000000 <conv>:
0: c911 beqz a0,14 <conv+0x14>
2: 4198 lw a4,0(a1)
4: 421c lw a5,0(a2)
6: 0591 addi a1,a1,4
8: 1671 addi a2,a2,-4
a: 02e79733 mulh a4,a5,a4
e: 96ba add a3,a3,a4
10: 157d addi a0,a0,-1
12: f965 bnez a0,2 <conv+0x2>
14: 8536 mv a0,a3
16: 8082 ret
There are 8 instructions in the non-unrolled loop, of which mulh is the only uncompressed one. Pretty tough to make a case that c.mulh would improve code density here. I am not sure if there's a strong microarchitectural argument, and/or whether that's relevant to this WG anyway.
If I use the 4-way unrolled version of the kernel instead, I get something the following:
$ clang --target=riscv32 -march=rv32gc -Oz -c -o conv4.o conv4.c
$ riscv64-unknown-elf-objdump -d conv4.o
00000000 <conv>:
0: 587d li a6,-1
2: c529 beqz a0,4c <conv+0x4c>
4: 419c lw a5,0(a1)
6: 4218 lw a4,0(a2)
8: 02f71733 mulh a4,a4,a5
c: 0045a283 lw t0,4(a1)
10: ffc62303 lw t1,-4(a2)
14: 98ba add a7,a7,a4
16: 0085a383 lw t2,8(a1)
1a: ff862683 lw a3,-8(a2)
1e: 02531733 mulh a4,t1,t0
22: 0108f7b3 and a5,a7,a6
26: 973e add a4,a4,a5
28: 027698b3 mulh a7,a3,t2
2c: 45dc lw a5,12(a1)
2e: ff462683 lw a3,-12(a2)
32: 01077733 and a4,a4,a6
36: 9746 add a4,a4,a7
38: 01077733 and a4,a4,a6
3c: 02f696b3 mulh a3,a3,a5
40: 00d708b3 add a7,a4,a3
44: 157d addi a0,a0,-1
46: 05c1 addi a1,a1,16
48: 1641 addi a2,a2,-16
4a: fd4d bnez a0,4 <conv+0x4>
4c: 8546 mv a0,a7
4e: 8082 ret
Here, there are 24 instructions in the loop (13 uncompressed, of which 4 are mulh.) There are RVC equivalents to everything except c.mulh - though it looks like it would take a different compiler or handwritten kernel to make a 16-bit c.mulh stand out.
An FFT implementation would be nearly as general-purpose as a convolution, but CMSIS doesn't have one. I'll poke around and see if I can put together something similar.
Ultimately, though, this is somewhat application-specific and I don't know if a benchmark belongs in your tree or not.
Here's an FFT, taken brutally and somewhat arbitrarily from cmsis-dsp.
#include <stdint.h>
typedef int32_t q31_t;
typedef int64_t q63_t;
#define multAcc_32x32_keep32_R(a, x, y) a = (q31_t) (((((q63_t) a) << 32) + ((q63_t) x * y) + 0x80000000LL ) >> 32)
#define multSub_32x32_keep32_R(a, x, y) a = (q31_t) (((((q63_t) a) << 32) - ((q63_t) x * y) + 0x80000000LL ) >> 32)
#define mult_32x32_keep32_R(a, x, y) a = (q31_t) (((q63_t) x * y + 0x80000000LL ) >> 32)
#define multAcc_32x32_keep32(a, x, y) a += (q31_t) (((q63_t) x * y) >> 32)
#define multSub_32x32_keep32(a, x, y) a -= (q31_t) (((q63_t) x * y) >> 32)
#define mult_32x32_keep32(a, x, y) a = (q31_t) (((q63_t) x * y ) >> 32)
void arm_split_rifft_q31(
q31_t * pSrc,
uint32_t fftLen,
const q31_t * pATable,
const q31_t * pBTable,
q31_t * pDst,
uint32_t modifier) {
q31_t outR, outI;
const q31_t *pCoefA, *pCoefB;
q31_t CoefA1, CoefA2, CoefB1;
q31_t *pIn1 = &pSrc[0], *pIn2 = &pSrc[2 * fftLen + 1];
pCoefA = &pATable[0];
pCoefB = &pBTable[0];
while (fftLen > 0U) {
CoefA1 = *pCoefA++;
CoefA2 = *pCoefA;
mult_32x32_keep32_R (outR, *pIn1, CoefA1);
mult_32x32_keep32_R (outI, *pIn1++, -CoefA2);
multAcc_32x32_keep32_R (outR, *pIn1, CoefA2);
multAcc_32x32_keep32_R (outI, *pIn1++, CoefA1);
multAcc_32x32_keep32_R (outR, *pIn2, CoefA2);
CoefB1 = *pCoefB;
multSub_32x32_keep32_R (outI, *pIn2--, CoefB1);
multAcc_32x32_keep32_R (outR, *pIn2, CoefB1);
multAcc_32x32_keep32_R (outI, *pIn2--, CoefA2);
*pDst++ = outR;
*pDst++ = outI;
pCoefB = pCoefB + (modifier * 2);
pCoefA = pCoefA + (modifier * 2 - 1);
fftLen--;
}
}
I get the following:
$ clang --target=riscv32 -march=rv32gc -Oz -c -o cmsis_fft.o cmsis_fft.c
$ riscv64-unknown-elf-objdump -d cmsis_fft.o
00000000 <arm_split_rifft_q31>:
0: 1141 addi sp,sp,-16
2: c622 sw s0,12(sp)
4: c426 sw s1,8(sp)
6: c24a sw s2,4(sp)
8: c04e sw s3,0(sp)
a: 4801 li a6,0
c: 58fd li a7,-1
e: 800002b7 lui t0,0x80000
12: 00359313 slli t1,a1,0x3
16: 932a add t1,t1,a0
18: 0311 addi t1,t1,4
1a: 00379393 slli t2,a5,0x3
1e: c9e5 beqz a1,10e <arm_split_rifft_q31+0x10e>
20: 01060eb3 add t4,a2,a6
24: 000eae03 lw t3,0(t4)
28: 411c lw a5,0(a0)
2a: 01068f33 add t5,a3,a6
2e: 004eae83 lw t4,4(t4)
32: 03c79fb3 mulh t6,a5,t3
36: 03c78433 mul s0,a5,t3
3a: 005404b3 add s1,s0,t0
3e: 0084b433 sltu s0,s1,s0
42: 9fa2 add t6,t6,s0
44: 41d00433 neg s0,t4
48: 02879933 mulh s2,a5,s0
4c: 028787b3 mul a5,a5,s0
50: 00578433 add s0,a5,t0
54: 4144 lw s1,4(a0)
56: 00f437b3 sltu a5,s0,a5
5a: 993e add s2,s2,a5
5c: 011fffb3 and t6,t6,a7
60: 03d499b3 mulh s3,s1,t4
64: 03d48433 mul s0,s1,t4
68: 005407b3 add a5,s0,t0
6c: 0087b7b3 sltu a5,a5,s0
70: 97ce add a5,a5,s3
72: 9fbe add t6,t6,a5
74: 01197933 and s2,s2,a7
78: 03c497b3 mulh a5,s1,t3
7c: 03c484b3 mul s1,s1,t3
80: 00548433 add s0,s1,t0
84: 009434b3 sltu s1,s0,s1
88: 00032403 lw s0,0(t1)
8c: 97a6 add a5,a5,s1
8e: 01278e33 add t3,a5,s2
92: 011fffb3 and t6,t6,a7
96: 03d41933 mulh s2,s0,t4
9a: 03d404b3 mul s1,s0,t4
9e: 005487b3 add a5,s1,t0
a2: 0097b7b3 sltu a5,a5,s1
a6: 000f2483 lw s1,0(t5)
aa: 97ca add a5,a5,s2
ac: 01f78f33 add t5,a5,t6
b0: 011e7e33 and t3,t3,a7
b4: 02849fb3 mulh t6,s1,s0
b8: 02848433 mul s0,s1,s0
bc: 0082b433 sltu s0,t0,s0
c0: ffc32783 lw a5,-4(t1)
c4: 947e add s0,s0,t6
c6: 408e0e33 sub t3,t3,s0
ca: 011f7f33 and t5,t5,a7
ce: 02979fb3 mulh t6,a5,s1
d2: 029784b3 mul s1,a5,s1
d6: 00548433 add s0,s1,t0
da: 00943433 sltu s0,s0,s1
de: 008f84b3 add s1,t6,s0
e2: 9f26 add t5,t5,s1
e4: 011e7e33 and t3,t3,a7
e8: 03d794b3 mulh s1,a5,t4
ec: 03d787b3 mul a5,a5,t4
f0: 00578433 add s0,a5,t0
f4: 00f437b3 sltu a5,s0,a5
f8: 97a6 add a5,a5,s1
fa: 97f2 add a5,a5,t3
fc: 01e72023 sw t5,0(a4)
100: c35c sw a5,4(a4)
102: 15fd addi a1,a1,-1
104: 981e add a6,a6,t2
106: 0521 addi a0,a0,8
108: 1361 addi t1,t1,-8
10a: 0721 addi a4,a4,8
10c: f991 bnez a1,20 <arm_split_rifft_q31+0x20>
10e: 4432 lw s0,12(sp)
110: 44a2 lw s1,8(sp)
112: 4912 lw s2,4(sp)
114: 4982 lw s3,0(sp)
116: 0141 addi sp,sp,16
118: 8082 ret
This implementation looks like a dead end for c.mulh: most of the loop consists of 32-bit instructions (mul/mulh pairs not least among them, per @jnk0le's point above).
Just to be 100% clear: I think this idea is DOA for non-technical reasons (schedule, scope) and technical reasons outside the scope of all my analysis above (lack of available encoding space for c.mulh). It is probably not worth discussing the motivational benchmarks alone if it's moot.
In the initial analysis phase, we generated frequency count of all instructions used in all benchmarks we have used, and before trying to fit to any compressed encoding, the saving from c.mulh were way too low to consider. Like you said this would be more applicable in DSP type application , but would need to prove some how that the code size saving per encoding bit spent is worth while, and I think that would be fairly hard.
Zc* extension is frozen at its current state, so its very unlikely that any new instructions would be added. But if you would like to make a better case for such instruction, I would encourage you to find opensource benchmarks that you think would benefit from it, and PR their compilation instructions for RISC-V and link to their source here thus we can evaluate using them when we work on the next iteration of the extension !
Thanks - this context is helpful, and I understand and agree with what you say. I will keep my eyes open for a justifiable benchmark that supports c.mulh.
A c.mulh instruction to mirror c.mul in zcb would allow the following to be expressed in purely compressed instructions:
There is room in the instruction encodings directly below c.mul (see Table 16-6, RISC-V Unprivileged ISA) - there is a natural spot for c.mulh and it seems a shame not to take advantage of the symmetry here.
Related: #9. The MAC proposal was (reasonably) excluded as out-of-scope here. A c.mulh bookend to c.mul would partly address the same need but fits better within the mandate of this WG.