llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.87k stars 11.92k forks source link

[RISCV] Missing opportunities to optimize RVV instructions #80392

Open wangpc-pp opened 9 months ago

wangpc-pp commented 9 months ago

In the SelectionDAG level, we have several code paths to generate RVV pseudos:

  1. RVV intrinsics -> RVV pseudos.
  2. ISD nodes -> RVV pseudos.
  3. RISCVISD nodes -> RVV pseudos.
  4. RVV intrinsics -> RISCVISD nodes -> RVV pseudos.
  5. ISD nodes -> RISCVISD nodes -> RVV pseudos.
  6. etc.

Most of the optimizations for RVV are based on RISCVISD nodes, so we may miss some opportunities to optimize some codes. For example (https://godbolt.org/z/f1jWEfhG7):

vuint8m1_t dup(uint8_t* data) {
    return __riscv_vmv_v_x_u8m1(*data, __riscv_vsetvlmax_e8m1());
}

vuint8m1_t dup2(uint8_t* data) {
    return __riscv_vlse8_v_u8m1(data, 0, __riscv_vsetvlmax_e8m1());
}
dup:
        vsetvli a1, zero, e8, m1, ta, ma
        vlse8.v v8, (a0), zero
        ret
dup2:
        vsetvli a1, zero, e8, m1, ta, ma
        vlse8.v v8, (a0), zero
        ret

These two snippets are of same assemblies because we lower intrinsics of vmv.v.x to RISCVISD::VMV_V_X first, and then we can optimize it to zero-stride load if profitable. But, this is not common for other cases:

vuint16m2_t vadd(vuint16m2_t a, vuint8m1_t b) {
    int vl = __riscv_vsetvlmax_e8m1();
    vuint16m2_t c = __riscv_vzext_vf2_u16m2(b, vl);
    return __riscv_vadd_vv_u16m2(a, c, vl);
}

vuint16m2_t vwaddu(vuint16m2_t a, vuint8m1_t b) {
    return __riscv_vwaddu_wv_u16m2(a, b, __riscv_vsetvlmax_e16m2());
}
vadd:
        vsetvli a0, zero, e16, m2, ta, ma
        vzext.vf2       v12, v10
        vadd.vv v8, v8, v12
        ret
vwaddu:
        vsetvli a0, zero, e8, m1, ta, ma
        vwaddu.wv       v8, v8, v10
        ret

We can't optimize vzext.vf2+vadd.vv to vwaddu.wv, because we lower these intrinsics to RVV pseudos directly. Of cource, there is the same problem for ISD->RVV pseudos path:

typedef vuint8m1_t v16xi8 __attribute__((riscv_rvv_vector_bits(__riscv_v_fixed_vlen)));
typedef vuint16m2_t v16xi32 __attribute__((riscv_rvv_vector_bits(__riscv_v_fixed_vlen * 2)));

v16xi32 add(v16xi32 a, v16xi8 b) {
    v16xi32 c = __riscv_vzext_vf2_u16m2(b, 16);
    return a + c;
}
add:
        vsetivli        zero, 16, e16, m2, ta, ma
        vzext.vf2       v12, v10
        vadd.vv v8, v12, v8
        ret

I think we need to an universal representation (RISCVISD?) to do optimizations. But when GISel is supported, we may need to do all the optimizations on GIR again? Or should we move all optimizations to later MIR passes?

llvmbot commented 9 months ago

@llvm/issue-subscribers-backend-risc-v

Author: Wang Pengcheng (wangpc-pp)

In the SelectionDAG level, we have several code paths to generate RVV pseudos: 1. RVV intrinsics -> RVV pseudos. 2. ISD nodes -> RVV pseudos. 3. RISCVISD nodes -> RVV pseudos. 4. RVV intrinsics -> RISCVISD nodes -> RVV pseudos. 5. ISD nodes -> RISCVISD nodes -> RVV pseudos. 6. etc. Most of the optimizations for RVV are based on RISCVISD nodes, so we may miss some opportunities to optimize some codes. For example (https://godbolt.org/z/f1jWEfhG7): ```c vuint8m1_t dup(uint8_t* data) { return __riscv_vmv_v_x_u8m1(*data, __riscv_vsetvlmax_e8m1()); } vuint8m1_t dup2(uint8_t* data) { return __riscv_vlse8_v_u8m1(data, 0, __riscv_vsetvlmax_e8m1()); } ``` ```asm dup: vsetvli a1, zero, e8, m1, ta, ma vlse8.v v8, (a0), zero ret dup2: vsetvli a1, zero, e8, m1, ta, ma vlse8.v v8, (a0), zero ret ``` These two snippets are of same assemblies because we lower intrinsics of `vmv.v.x` to `RISCVISD::VMV_V_X` first, and then we can optimize it to zero-stride load if profitable. But, this is not common for other cases: ```c vuint16m2_t vadd(vuint16m2_t a, vuint8m1_t b) { int vl = __riscv_vsetvlmax_e8m1(); vuint16m2_t c = __riscv_vzext_vf2_u16m2(b, vl); return __riscv_vadd_vv_u16m2(a, c, vl); } vuint16m2_t vwaddu(vuint16m2_t a, vuint8m1_t b) { return __riscv_vwaddu_wv_u16m2(a, b, __riscv_vsetvlmax_e16m2()); } ``` ```asm vadd: vsetvli a0, zero, e16, m2, ta, ma vzext.vf2 v12, v10 vadd.vv v8, v8, v12 ret vwaddu: vsetvli a0, zero, e8, m1, ta, ma vwaddu.wv v8, v8, v10 ret ``` We can't optimize `vzext.vf2+vadd.vv` to `vwaddu.wv`, because we lower these intrinsics to RVV pseudos directly. Of cource, there is the same problem for `ISD->RVV pseudos` path: ```c typedef vuint8m1_t v16xi8 __attribute__((riscv_rvv_vector_bits(__riscv_v_fixed_vlen))); typedef vuint16m2_t v16xi32 __attribute__((riscv_rvv_vector_bits(__riscv_v_fixed_vlen * 2))); v16xi32 add(v16xi32 a, v16xi8 b) { v16xi32 c = __riscv_vzext_vf2_u16m2(b, 16); return a + c; } ``` ```asm add: vsetivli zero, 16, e16, m2, ta, ma vzext.vf2 v12, v10 vadd.vv v8, v12, v8 ret ``` I think we need to an universal representation (RISCVISD?) to do optimizations. But when GISel is supported, we may need to do all the optimizations on GIR again? Or should we move all optimizations to later MIR passes?
topperc commented 9 months ago

The last example can be optimized with full use of ISD nodes instead of mixing in intrinsics.

v16xi32 add(v16xi32 a, v16xi8 b) {
    v16xi32 c = __builtin_convertvector(b, v16xi32);
    return a + c;
}
lukel97 commented 9 months ago

But when GISel is supported, we may need to do all the optimizations on GIR again? Or should we move all optimizations to later MIR passes?

RISCVFoldMasks and #71764 is an effort to move some of the SelectionDAG code out into MIR passes.

We can't optimize vzext.vf2+vadd.vv to vwaddu.wv, because we lower these intrinsics to RVV pseudos directly.

I can't remember where I first heard this argument, but I think there was a question as to whether or not intrinsics should be optimised away? Since there might be the expectation that if the user writes __riscv_vzext_vf2_u16m2 then there should be a vzext.vf2 in the resulting code.

wangpc-pp commented 9 months ago

The last example can be optimized with full use of ISD nodes instead of mixing in intrinsics.

v16xi32 add(v16xi32 a, v16xi8 b) {
    v16xi32 c = __builtin_convertvector(b, v16xi32);
    return a + c;
}

Thanks! I think my unawareness of this just shows these potential missed optimizations. 😄

wangpc-pp commented 9 months ago

But when GISel is supported, we may need to do all the optimizations on GIR again? Or should we move all optimizations to later MIR passes?

RISCVFoldMasks and #71764 is an effort to move some of the SelectionDAG code out into MIR passes.

Yeah! Thanks for mentioning these works!

We can't optimize vzext.vf2+vadd.vv to vwaddu.wv, because we lower these intrinsics to RVV pseudos directly.

I can't remember where I first heard this argument, but I think there was a question as to whether or not intrinsics should be optimised away? Since there might be the expectation that if the user writes __riscv_vzext_vf2_u16m2 then there should be a vzext.vf2 in the resulting code.

As my example shows, we have already broken this convention for vmv.v.x intrinsics now.

topperc commented 9 months ago

Not directly related to this, but I'm not sure HasOptimizedZeroStrideLoad should default to true.

wangpc-pp commented 9 months ago

Not directly related to this, but I'm not sure HasOptimizedZeroStrideLoad should default to true.

See https://reviews.llvm.org/D137699. cc @preames