SIMD instruction for loading values into the register at same time.

riscv / riscv-p-spec

RISC-V Packed SIMD Extension

https://jira.riscv.org/browse/RVG-129

Creative Commons Attribution 4.0 International

138 stars 38 forks source link

SIMD instruction for loading values into the register at same time. #38

Open subhajit26 opened 3 years ago

subhajit26 commented 3 years ago

Hi, Do we have any support for loading multiple data from memory to a big register at once with single load instruction. Let say loading 4 (8-bit data) to 32-bit register. Is there any work around in the packed extension to allow this sort of SIMD operation or is there any intrinsics that I can use for riscv.

Right now I am trying the to analyse the timing of 2 programs with and without SIMD implemenation:

NON_SIMD:

#include <stdio.h>
#include <time.h>       // for clock_t, clock(), 
#include <unistd.h>

int main() {
    int a[4] = {1,2,3,4};
    int b[4] = {1,2,3,4};
    int c[4];

    c[0] = a[0] + b[0];
    c[1] = a[1] + b[1];
    c[2] = a[2] + b[2];
    c[3] = a[3] + b[3];

    printf ("Result_vector=%d\n%d\n%d\n%d\n",c[0],c[1],c[2],c[3]);

    double time_spent = 0.0;

    clock_t begin = clock();

    // do some stuff here
    sleep(3);

    clock_t end = clock();

    // calculate elapsed time by finding difference (end - begin) and
    // dividing the difference by CLOCKS_PER_SEC to convert to seconds
    time_spent += (double)(end - begin) / CLOCKS_PER_SEC;

    printf("Time elpased is %f seconds", time_spent);

    return 0;
}

WITH SIMD:

#include <stdio.h>
#include <time.h>       // for clock_t, clock(), 
#include <unistd.h>

int main() {
    int a[4] = {1,2,3,4};
    int b[4] = {1,2,3,4};
    int c[4];
    int i;

    for (i=0; i<4; i++){
        c[i]=a[i]+b[i];
        printf ("Result_vector=%d\n",c[i]);
    }

    double time_spent = 0.0;

    clock_t begin = clock();

    // do some stuff here
    sleep(3);

    clock_t end = clock();

    // calculate elapsed time by finding difference (end - begin) and
    // dividing the difference by CLOCKS_PER_SEC to convert to seconds
    time_spent += (double)(end - begin) / CLOCKS_PER_SEC;

    printf("Time elpased is %f seconds", time_spent);

    return 0;
}

The SIMD implemnation is taking more time than non-SIMD implementation in bot software as well as in hardware. I think if we can load the values from memory at same time we can save lot of cycles and according to me will be proper implementation of >SIMD. Please suggest as I am very new to it and I am trying to implement SIMD to accelerate the FFT algo for my master thesis. My hardware Target is Ibex core.

rdolbeau commented 3 years ago

@subhajit26 The proposed "P" (packed simd) extension to RISC-V is different in this regard to most other SIMD instruction sets; it does not use specific registers (such as the XMM registers for SSE) , but instead does SIMD in the 'normal' (general-purpose) registers. So there is no need for specific load/store operations; the regular instructions from I are enough. The "V" proposal, on the other hand, uses specific registers.

You can see an example of using "P" here: https://github.com/rdolbeau/VexRiscvBPluginGenerator/blob/master/pbsad.c , which implement the Sum of Absolute Difference operations in regular C and "P" intrinsics (the reference C function is taken from the FFmpeg library, where many SIMD implementations are available). The same repository also has a Sum of Squared Errors implementation in pbsse.c.

Once compiled to assembly (as it's test code, it's using assembly macros rather than the proper syntax), the inner loop of function pix_abs8_r5vp looks like this:

.L9:
    lw  a1,0(a3)
    lw  a0,0(a2)
    addi    a3,a3,32
    PBSADA reg_a4, reg_a1, reg_a0
    addi    a2,a2,32
    lw  a1,-28(a3)
    lw  a0,-28(a2)
    PBSADA reg_a4, reg_a1, reg_a0
    bne a3,t5,.L9

As you can see, registers a0 and a1 are loaded with 'regular' loads and used in the pbsada instruction from P (see section 5.78 in the P proposal). Accumulation of the sum is done in regular register a4.

HTH

subhajit26 commented 3 years ago

Hi, Is it possible to use kmxda and smds for multiplication of 2 complex numbers. I want to implement the cross multiplication and normal multiplication with add and subtract operation. I am trying to implement this SIMD instruction inline assembly in my FFT algorithm to speed it up. But I am not getting the desired results. Do I need to need to change the functionality according to the below mentioned implementation. Please suggest.

static i16_complex_t mul_i16_complex(i16_complex_t a, i16_complex_t b)
{
    int32_t real = (int32_t)a.real * (int32_t)b.real - (int32_t)a.imag * (int32_t)b.imag;//SMDS
    int32_t imag = (int32_t)a.real * (int32_t)b.imag + (int32_t)a.imag * (int32_t)b.real; //KMXDA

subhajit26 commented 3 years ago

Hi, I see the implementation defined in decode.h . It is automatically deciding the bits for 16 bit LO and 16 bit HI to get the 32 bit result?

#define P_MUL_CROSS_PARAMS(BIT) \
  auto pd = P_FIELD(rd_tmp, i, BIT * 2); \
  auto ps1 = P_FIELD(rs1, i, BIT); \
  auto ps2 = P_FIELD(rs2, (i ^ 1), BIT);

rdolbeau commented 3 years ago

@subhajit26 TL;DR: yes :-)

Instructions like kmxda are indeed made for complex arithmetic. But you need to know how the data are organized in your i16_complex_t̀ type. Assuming it's a pair of 16 bits value packed in 32 bits and that the 'real' part is in the low order 16 bits and the 'imaginary' part is in the high order 16 bits, then this: (int32_t)a.real * (int32_t)b.imag + (int32_t)a.imag * (int32_t)b.real is equivalent to this: (int32_t)a[low] * (int32_t)b[high] + (int32_t)a[high] * (int32_t)b[low] which is exactly what kmxda does. If 'real' and 'imaginary' are reversed in 'high' and 'low', it doesn't change anything (addition is commutative).

but this: (int32_t)a.real * (int32_t)b.real - (int32_t)a.imag * (int32_t)b.imag; is (int32_t)a[low] * (int32_t)b[low] - (int32_t)a[high] * (int32_t)b[high]; which is actually smdrs. If 'real' and 'imaginary' are reversed in 'high' and 'low', then the operation would actually be smds. Both instructions exist because subtraction isn't commutative (but the operands are because they are used symmetrically in the commutative multiplications - smds r,a,b is the same as smds r,b,a; smxds has only one variant, because you can permute operands to reverse the operations).

So you need to know exactly how your data are laid out in memory/registers so you can apply the proper operation to both 16 bits halves in the 32 bits register(s).

subhajit26 commented 3 years ago

Hi, I am trying to implement a custom instruction called absd almost similar to kabsw I want absd instruction to perform the absolute value of the difference between 2 signed 32 bit values. The functionality I have implemented to verify in spike is:

if(sext_xlen(RS1) >= sext_xlen(RS2))
  WRITE_RD(sext_xlen(sext_xlen(RS1) - sext_xlen(RS2)));
else
  WRITE_RD(sext_xlen(sext_xlen(RS2) - sext_xlen(RS1)));

But it says illegal instruction on executing. Is there any way to get this function with p-type extension same as kabsw ? Can I modify the kabsw functionality with something sort of this:

require_extension('P');
int32_t rs1 = P_W(RS1, 0);
int32_t rs2 = P_W(RS2, 0);
if( rs1 >= rs2)
  WRITE_RD(sext_xlen( rs1 - rs2));
else
  WRITE_RD(sext_xlen( rs2 - rs1));

Do I need to mention P_SET_OV(1);, INT32_MAX; andINT32_MIN;s given in the file decode.h ? It will of great help if you can suggest me this functionality without using P extension.

rdolbeau commented 3 years ago

@subhajit26 Can't help with spike. For 32-bits data manipulation, you may want to look at the B extension (bitmanip) instead of P, it has 32-bits MAX[U]/MIN[U] for instance.

subhajit26 commented 3 years ago

Hi, I have certain function which holds 32 bit composite value as arguments and I want to assign them directly to the 32 bit register so that I make the packed instructions such as smdrs and kmxda work for the complex multiplication. Can you please me how to pass the composite data type through inline assembly. Please see the function below: As of now I had to use union to combine the real and imaginary part to 32 bit register to make the packed instruction work. But if you see the inline assembly, I want to directly assign a and b to the input register operands which as of now is throwing me error in tool-chain that the operands does not match the constraint.. Any suggestion will be of great help

static i16_complex_t mul_i16_complex(i16_complex_t a, i16_complex_t b)
{

combine_t combine;
comb_t comb;
combine.a = a;
comb.b = b;

int32_t c = combine.value1;
int32_t d = comb.value2;
int32_t real, imag;

    //int32_t real = (int32_t)a.real * (int32_t)b.real - (int32_t)a.imag * (int32_t)b.imag;
    //int32_t real = (int16_t)a.real * (int16_t)b.real - (int16_t)a.imag * (int16_t)b.imag;

    //Custom Instruction:
     asm volatile
    (
    "smdrs   %[z], %[x], %[y]\n\t"
    : [z] "=r" (real)
    : [x] "r" (c), [y] "r" (d)
    );

    //int32_t imag = (int32_t)a.real * (int32_t)b.imag + (int32_t)a.imag * (int32_t)b.real; 
    //int32_t imag = (int16_t)a.real * (int16_t)b.imag + (int16_t)a.imag * (int16_t)b.real; 

    //Custom Instruction:
    asm volatile
    (
    "kmxda   %[z], %[x], %[y]\n\t"
    : [z] "=r" (imag)
    : [x] "r" (c), [y] "r" (d)
    );

    // Shift with rounding
    real = (real + 0x4000) >> 15;
    imag = (imag + 0x4000) >> 15;

    return (i16_complex_t){real, imag};
}

rdolbeau commented 3 years ago

@subhajit26 You need to tell us how you define your structure for this code to be understood.

Typically you want to end up with something like that, where you use a.cplx.real and a.cplx.imag to access the 16-bits element separately, and a.comb to access the 32-bits combined value:

#include <stdint.h>

typedef union {
    struct {
        uint16_t real;
        uint16_t imag;
    } cplx;
    uint32_t comb;
} i16_complex_t;

i16_complex_t mul_i16_complex(i16_complex_t a, i16_complex_t b)
{
    int32_t real, imag;
    i16_complex_t r;

    //int32_t real = (int32_t)a.real * (int32_t)b.real - (int32_t)a.imag * (int32_t)b.imag;
    //int32_t real = (int16_t)a.real * (int16_t)b.real - (int16_t)a.imag * (int16_t)b.imag;

    //Custom Instruction:
     asm volatile
    (
    "smdrs   %[z], %[x], %[y]\n\t"
    : [z] "=r" (real)
    : [x] "r" (a.comb), [y] "r" (b.comb)
    );

    //int32_t imag = (int32_t)a.real * (int32_t)b.imag + (int32_t)a.imag * (int32_t)b.real; 
    //int32_t imag = (int16_t)a.real * (int16_t)b.imag + (int16_t)a.imag * (int16_t)b.real; 

    //Custom Instruction:
    asm volatile
    (
    "kmxda   %[z], %[x], %[y]\n\t"
    : [z] "=r" (imag)
    : [x] "r" (a.comb), [y] "r" (b.comb)
    );

    // Shift with rounding
    r.cplx.real = (real + 0x4000) >> 15;
    r.cplx.imag = (imag + 0x4000) >> 15;

    return r;
}

Edit: if you have the B extension enabled in the cmpiler, the last step (combining the value in the struct) should use the pack instruction. With P, there's the pkbb16 instruction.

subhajit26 commented 3 years ago

Hi , The structure is defined as below:

typedef struct {
    int16_t real;
    int16_t imag;
} i16_complex_t;

i16_complex_t mul_i16_complex(i16_complex_t a, i16_complex_t b)

So , basically it is 32 bit composite value stored in the form ofi16_complex_t a and i16_complex_t b in the above function. For the packed instruction to work I has to use union but I was searching for an alternative whether I can straight way assign the composite value to the 32 bit register instead of defining the union to combine the value.

rdolbeau commented 3 years ago

@subhajit26 I recommend using a union for this particular use case. It's quite readable, and there's no runtime overhead, the compiler will directly use the general purpose registers.

BTW, for the 'shift with rounding', it seems like P's SRLI32 could be useful. It will save you the constant set-up and the two adds.

Edit: srli32 not 16