llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.19k stars 11.14k forks source link

RISC-V vector intrinsics construct unshared constant-zero temporaries #78647

Open sh1boot opened 6 months ago

sh1boot commented 6 months ago

If you need two vectors of zeroes with different types, you might write something like:

int f(vuint8m1_t v8, vuint16m1_t v16) {
    const vuint8m1_t z8 = __riscv_vmv_v_x_u8m1(0, __riscv_vsetvlmax_e8m1());
    const vuint16m1_t z16 = __riscv_vmv_v_x_u16m1(0, __riscv_vsetvlmax_e16m1());

    return __riscv_vmv_x(__riscv_vredsum(v8, z8, __riscv_vsetvlmax_e8m1()))
         + __riscv_vmv_x(__riscv_vredsum(v16, z16, __riscv_vsetvlmax_e16m1()));
}

Or you might write the equivalent expressions inline in other code. Either way this results in separate temporaries -- a wasted register and an unnecessary vsetvli/vmv.vx pair.

You can work around it with this:

int f(vuint8m1_t v8, vuint16m1_t v16) {
    const vuint8m1_t z8 = __riscv_vmv_v_x_u8m1(0, __riscv_vsetvlmax_e8m1());
    const vuint16m1_t z16 = __riscv_vreinterpret_u16m1(z8);

    return __riscv_vmv_x(__riscv_vredsum(v8, z8, __riscv_vsetvlmax_e8m1()))
         + __riscv_vmv_x(__riscv_vredsum(v16, z16, __riscv_vsetvlmax_e16m1()));
}

It'd be helpful to fold these together if possible. I'm not sure what all the constraints are.

llvmbot commented 6 months ago

@llvm/issue-subscribers-backend-risc-v

Author: SH (sh1boot)

If you need two vectors of zeroes with different types, you might write something like: ```C int f(vuint8m1_t v8, vuint16m1_t v16) { const vuint8m1_t z8 = __riscv_vmv_v_x_u8m1(0, __riscv_vsetvlmax_e8m1()); const vuint16m1_t z16 = __riscv_vmv_v_x_u16m1(0, __riscv_vsetvlmax_e16m1()); return __riscv_vmv_x(__riscv_vredsum(v8, z8, __riscv_vsetvlmax_e8m1())) + __riscv_vmv_x(__riscv_vredsum(v16, z16, __riscv_vsetvlmax_e16m1())); } ``` Or you might write the equivalent expressions inline in other code. Either way this results in separate temporaries -- a wasted register and an unnecessary `vsetvli`/`vmv.vx` pair. You can work around it with this: ```C int f(vuint8m1_t v8, vuint16m1_t v16) { const vuint8m1_t z8 = __riscv_vmv_v_x_u8m1(0, __riscv_vsetvlmax_e8m1()); const vuint16m1_t z16 = __riscv_vreinterpret_u16m1(z8); return __riscv_vmv_x(__riscv_vredsum(v8, z8, __riscv_vsetvlmax_e8m1())) + __riscv_vmv_x(__riscv_vredsum(v16, z16, __riscv_vsetvlmax_e16m1())); } ``` It'd be helpful to fold these together if possible. I'm not sure what all the constraints are.
wangpc-pp commented 5 months ago

This may be related to https://github.com/llvm/llvm-project/issues/80392.

sh1boot commented 5 months ago

This seems to introduce risks of conflict with #80099 if two operands turn out to be immediate 0 with different sizes -- though I cannot think of an instruction which is still worth issuing at that point, so hopefully any bugs introduced with this optimisation will evaporate before the instruction is emitted.

topperc commented 5 months ago

Some microarchitectures may store different EEWs in a different internal format and may have a penalty for reading them with a different EEW than they were written with.

sh1boot commented 5 months ago

Yeah, I figured it could be like that (at least with floating-point it makes some sense). I'm not sure if the register pressure implied here could ever get to the point of justifying the complexity of switching this sort of thing on and off depending on whether or not it works for the given target.

Conversely, it does imply that my vreinterpret workaround is actually non-zero-cost for those targets. Do they need a separate optimisation to fix that for them?