Add a section with examples

rofirrim commented 5 months ago

The ARC review suggests that we can improve readability by adding some examples.

My suggestion is make intrinsic versions of the examples in the V extension specification (found in https://github.com/riscv/riscv-v-spec/tree/master/example ).

rofirrim commented 4 months ago

I wrote once a base64 encoder to autovectorize it, but I think it might be an interesting showcase for RVV intrinsics and RVV features (gather, segment store, masked operations).

(Note: the scalar code should be correct but the RVV I have not tested it yet, the autovectorized version worked fine though)

#include <assert.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>

const char table[] = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K',
                      'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V',
                      'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g',
                      'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r',
                      's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2',
                      '3', '4', '5', '6', '7', '8', '9', '+', '/'};

char* tobase64_ref(const char* input, size_t N) {
    assert(N > 0 && "Empty size is not allowed");

    size_t B = (N + 2) / 3;  // floor(N / 3)
    size_t output_size = B * 4;
    char* output = malloc(output_size + 1);
    output[output_size] = '\0';

    size_t j = 0;
    size_t r = N % 3;
    r = r > 0 ? 3 - r : 0;

    for (size_t i = 0; i < N + r; i += 3) {
        uint8_t a = input[i];
        uint8_t b = i + 1 < N ? input[i + 1] : 0;
        uint8_t c = i + 2 < N ? input[i + 2] : 0;

        uint8_t i0 = a >> 2;
        uint8_t i1 = ((a & 0x3) << 4) | (b >> 4);
        uint8_t i2 = (b & 0xf) << 2 | (c >> 6);
        uint8_t i3 = c & 0x3f;

        output[j] = table[i0];
        output[j + 1] = table[i1];
        output[j + 2] = i + 1 < N ? table[i2] : '=';
        output[j + 3] = i + 2 < N ? table[i3] : '=';
        j += 4;
    }

    return output;
}

#include <riscv_vector.h>

char* tobase64_rvv(const char* input, size_t N) {
    assert(N > 0 && "Empty size is not allowed");

    size_t B = (N + 2) / 3;  // floor(N / 3)
    size_t output_size = B * 4;
    char* output = malloc(output_size + 1);
    output[output_size] = '\0';

    size_t vlmax = __riscv_vsetvlmax_e8m1();
    vuint8m1_t vout_pad = __riscv_vmv_v_x_u8m1('=', vlmax);

    size_t i = 0;
    do {
        size_t vl = __riscv_vsetvl_e8m1(B);

        // uint8_t a = input[3*i];
        vuint8m1_t va = __riscv_vlse8_v_u8m1((uint8_t*)(input + 3 * i), 3, vl);

        vuint8m1_t ix3 = __riscv_vmul_vx_u8m1(__riscv_vid_v_u8m1(vl), 3, vl);
        vuint8m1_t ix3plus1 = __riscv_vadd_vx_u8m1(ix3, 1, vl);
        vuint8m1_t ix3plus2 = __riscv_vadd_vx_u8m1(ix3, 2, vl);

        // bool plus1 = 3*i + 1 < N;
        vbool8_t plus1 = __riscv_vmsltu_vx_u8m1_b8(ix3plus1, N, vl);

        // uint8_t b = plus1 ? input[3*i + 1] : 0;
        vuint8m1_t vb =
            __riscv_vlse8_v_u8m1_m(plus1, (uint8_t*)(input + 3 * i + 1), 3, vl);

        // bool plus2 = 3*i + 2 < N;
        vbool8_t plus2 = __riscv_vmsltu_vx_u8m1_b8(ix3plus2, N, vl);

        // uint8_t c = plus2 < N ? input[3*i + 2] : 0;
        vuint8m1_t vc =
            __riscv_vlse8_v_u8m1_m(plus2, (uint8_t*)(input + 3 * i + 2), 3, vl);

        // uint8_t i0 = a >> 2;
        vuint8m1_t vi0 = __riscv_vsrl_vx_u8m1(va, 2, vl);
        // uint8_t i1 = ((a & 0x3) << 4) | (b >> 4);
        vuint8m1_t vi1 = __riscv_vor_vv_u8m1(
            __riscv_vsll_vx_u8m1(__riscv_vand_vx_u8m1(va, 0x3, vl), 4, vl),
            __riscv_vsrl_vx_u8m1(vb, 4, vl), vl);
        // uint8_t i2 = (b & 0xf) << 2 | (c >> 6);
        vuint8m1_t vi2 = __riscv_vor_vv_u8m1(
            __riscv_vsll_vx_u8m1(__riscv_vand_vx_u8m1(vb, 0xf, vl), 2, vl),
            __riscv_vsrl_vx_u8m1(vc, 6, vl), vl);
        // uint8_t i3 = c & 0x3f;
        vuint8m1_t vi3 = __riscv_vand_vx_u8m1(vc, 0x3f, vl);

        // output[4*i] = table[i0];
        // output[4*i + 1] = table[i1];
        // output[4*i + 2] = plus1 ? table[i2] : '=';
        // output[4*i + 3] = plus2 ? table[i3] : '=';
        vuint8m1_t vout0 = __riscv_vluxei8_v_u8m1((uint8_t*)table, vi0, vl);
        vuint8m1_t vout1 = __riscv_vluxei8_v_u8m1((uint8_t*)table, vi1, vl);
        vuint8m1_t vout2 = __riscv_vluxei8_v_u8m1_mu(plus1, vout_pad,
                                                     (uint8_t*)table, vi2, vl);
        vuint8m1_t vout3 = __riscv_vluxei8_v_u8m1_mu(plus2, vout_pad,
                                                     (uint8_t*)table, vi3, vl);
        __riscv_vsseg4e8_v_u8m1x4(
            (int8_t*)(output + 4 * i),
            __riscv_vcreate_v_u8m1x4(vout0, vout1, vout2, vout3), vl);

        i += vl;
        B -= vl;
    } while (B > 0);

    return output;
}

Let me know if this could be an interesting example to add so I make sure no bugs remain.

camel-cdr commented 4 months ago

@rofirrim I don't think this problem is a good fit for example code, since there are a lot of possible ways to implement it, and there is no one that will obviously perform best/good on most hardware.

We explored a few implementation strategies here: https://github.com/WojciechMula/base64simd/issues/9

A slightly better example could be the one mentioned as an example of something that would be hard to implement scalable, in this critique of arm SVE:

Vector based designs, such as SVE, work well for problems like SAXPY, but if your problem involves fixed width units, or even requires different algorithms depending on the width, how feasible is an SVE implementation when you don’t know the vector width? Some problems may be solvable with some rethinking, others may be frustrating to deal with, and others just may not work well at all (for example, the despacer problem may not scale well under SVE, at least without non-trivial redesign).

size_t despace(char *str, size_t len) {
    size_t pos = 0;
    for(size_t i = 0; i < len; i++) {
        char c = str[i];
        if (c == '\r' || c == '\n' || c == ' ') {
            continue;
        }
        str[pos++] = c;
    }
    return pos;
}

size_t despace_rvv(char *str, size_t len) {
    uint8_t *dest = (uint8_t*)str;
    uint8_t *src = (uint8_t*)str;
    for (size_t vl, VL; len > 0; dest += vl, src += VL, len -= VL) {
        VL = __riscv_vsetvl_e8m8(len);
        vuint8m8_t v = __riscv_vle8_v_u8m8(src, VL);
        vbool1_t m1 = __riscv_vmsne_vx_u8m8_b1(v, ' ', VL);
        vbool1_t m2 = __riscv_vmsne_vx_u8m8_b1(v, '\n', VL);
        vbool1_t m3 = __riscv_vmsne_vx_u8m8_b1(v, '\r', VL);
        vbool1_t m = __riscv_vmand_mm_b1(m1, __riscv_vmand_mm_b1(m2, m3, VL), VL);
        v = __riscv_vcompress_vm_u8m8(v, m, VL);
        vl = __riscv_vcpop_m_b1(m, VL);
        __riscv_vse8_v_u8m8(dest, v, vl);
    }
    return dest - (uint8_t*)str;
}

Test program I used for validation: https://godbolt.org/z/Eqf499MsY

In practice the above isn't quite optimal on current hardware, because the current implementations have a disproportionately slow LMUL=8 vcompress.vv implementation. (LMUL=4 will likely be optimal on current hardware) I don't think this trend will continue with new hardware though, because you can easily implement LMUL=8 vcompress.vv using just in lane LMUL=1 primitives as I argue here: https://gist.github.com/camel-cdr/f2cc9cdf6ac9499f069357784f53b324

Something like this, showcasing the segmented load/stores, would probably be a good to include as well.

BTW, should the examples use the overloaded intrinsics?

rofirrim commented 4 months ago

@rofirrim I don't think this problem is a good fit for example code, since there are a lot of possible ways to implement it, and there is no one that will obviously perform best/good on most hardware.

I think the number of ways to implement a problem may not be an ideal metric on its suitabilty as an example. Otherwise we may risk constraining ourselves to very simple examples.

I am happy, though, not adding this example. It was just a suggestion.

A slightly better example could be the one mentioned as an example of something that would be hard to implement scalable, in this critique of arm SVE:

(This is off-topic so I'm not sure we have to discuss it here: I am not much versed in SVE but I don't see what is the fundamental limitation SVE has with this example. It seems to me than rather than using a vector length like we can do in RVV, SVE could use a mask to control the execution of the loop (using something like the whilelt / whilelo I think it is called) and then combining this mask into the mask that controls the execution of the continue.)

Something like this, showcasing the segmented load/stores, would probably be a good to include as well.

Fair, I think some AoS/SoA like algorithm can be useful here? Maybe converting RGB to XYZ or similar thing?

BTW, should the examples use the overloaded intrinsics?

We can add an example using them.

My opinion is that if examples are to be used by someone learning about RVV, an explicit interface may make the example easier to grasph as it involves less "automagic" features. (One could argue, though, that the extra syntactic noise can be confusing but at least the examples with explict interfaces show how the convention looks like).

Thanks for the comments @camel-cdr

riscv-non-isa / rvv-intrinsic-doc

Add a section with examples #319