[SelectionDAG] Add a new ISD Node for vector saturating truncation

sun-jacobi commented 7 months ago

Based on the #73424. There were several patches and issues on the vector saturating truncation operation on different platform.

For instance, #68466 for x86 and #75145 for RISC-V vector extension. I think it is better for us to do a target-independent combine for this operation, which could be expressed as a sequence of LLVM IR like:

define void @trunc_sat_i8i16(ptr %x, ptr %y) {
  %1 = load <8 x i16>, ptr %x, align 16
  %2 = tail call <8 x i16> @llvm.smax.v8i16(<8 x i16> %1, <8 x i16> <i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128>)
  %3 = tail call <8 x i16> @llvm.smin.v8i16(<8 x i16> %2, <8 x i16> <i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127>)
  %4 = trunc <8 x i16> %3 to <8 x i8>
  store <8 x i8> %4, ptr %y, align 8
  ret void
}

Especially, I think this combine could be done at the SelectionDAG, like the other saturating operation such as ISD::SADDSAT. This could remove complex TableGen pattern matching and refactor each platform's CodeGen path for this operation.

llvmbot commented 7 months ago

@llvm/issue-subscribers-backend-x86

Author: Chia (sun-jacobi)

Based on the #73424. There were several patches and issues on the vector saturating truncation operation on different platform. For instance, #68466 for x86, #75145 for RISC-V vector extension. I think it is better for us to do a target-independent combine for this operation, which could expressed as a sequence of LLVM IR like: ``` define void @trunc_sat_i8i16(ptr %x, ptr %y) { %1 = load <8 x i16>, ptr %x, align 16 %2 = tail call <8 x i16> @llvm.smax.v8i16(<8 x i16> %1, <8 x i16> <i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128>) %3 = tail call <8 x i16> @llvm.smin.v8i16(<8 x i16> %2, <8 x i16> <i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127>) %4 = trunc <8 x i16> %3 to <8 x i8> store <8 x i8> %4, ptr %y, align 8 ret void } ``` Especially, I think this combine could be done at the SelectionDAG, like the other saturating operation such as `ISD::SADDSAT`. This could remove complex TableGen pattern matching for this operation.

llvmbot commented 7 months ago

@llvm/issue-subscribers-backend-risc-v

Author: Chia (sun-jacobi)

Based on the #73424. There were several patches and issues on the vector saturating truncation operation on different platform. For instance, #68466 for x86, #75145 for RISC-V vector extension. I think it is better for us to do a target-independent combine for this operation, which could expressed as a sequence of LLVM IR like: ``` define void @trunc_sat_i8i16(ptr %x, ptr %y) { %1 = load <8 x i16>, ptr %x, align 16 %2 = tail call <8 x i16> @llvm.smax.v8i16(<8 x i16> %1, <8 x i16> <i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128>) %3 = tail call <8 x i16> @llvm.smin.v8i16(<8 x i16> %2, <8 x i16> <i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127>) %4 = trunc <8 x i16> %3 to <8 x i8> store <8 x i8> %4, ptr %y, align 8 ret void } ``` Especially, I think this combine could be done at the SelectionDAG, like the other saturating operation such as `ISD::SADDSAT`. This could remove complex TableGen pattern matching for this operation.

llvmbot commented 7 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Chia (sun-jacobi)

Based on the #73424. There were several patches and issues on the vector saturating truncation operation on different platform. For instance, #68466 for x86, #75145 for RISC-V vector extension. I think it is better for us to do a target-independent combine for this operation, which could expressed as a sequence of LLVM IR like: ``` define void @trunc_sat_i8i16(ptr %x, ptr %y) { %1 = load <8 x i16>, ptr %x, align 16 %2 = tail call <8 x i16> @llvm.smax.v8i16(<8 x i16> %1, <8 x i16> <i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128, i16 -128>) %3 = tail call <8 x i16> @llvm.smin.v8i16(<8 x i16> %2, <8 x i16> <i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127, i16 127>) %4 = trunc <8 x i16> %3 to <8 x i8> store <8 x i8> %4, ptr %y, align 8 ret void } ``` Especially, I think this combine could be done at the SelectionDAG, like the other saturating operation such as `ISD::SADDSAT`. This could remove complex TableGen pattern matching for this operation.

dtcxzyw commented 7 months ago

cc @topperc @RKSimon @preames @asb

RKSimon commented 7 months ago

I'm sure this has come up before as a phab review but I can't find it right now (@davemgreen can you remember?).

What would be awesome is if there was a way to handle the 2 SSE PACK instruction signed saturations:

/// PACKSS (normal signed sat)
/// (truncate (smin ((smax (x, signed_min_of_dest_type)),
///                  signed_max_of_dest_type)) to dest_type)
/// or:
/// (truncate (smax ((smin (x, signed_max_of_dest_type)),
///                  signed_min_of_dest_type)) to dest_type).
///
/// PACKUS (signed sat of unsigned range) - NOT the same as USAT
/// the smax/smin range is [0, unsigned_max_of_dest_type].

sun-jacobi commented 7 months ago

@RKSimon Thank you for the information. Do you mean we also need to add a new ISD Node for signed sat of unsigned range?

RKSimon commented 7 months ago

I don't know yet - I think what you were proposing was we'd have a ISD::TRUNCSSAT node that just used the signed smin/smax clamp limits of the destination type (and ISD::TRUNCUSAT that had a umax limit?).

What I'm wondering is how useful it would be to add extra args describing the limits - I just don't want to end up having a target node that has to match the generic nodes, especially if we end up with more complex folds for cases such as sat-trunc-store patterns.

davemgreen commented 7 months ago

For MVE this is spelled ARMISD::VQMOVNs/ARMISD::VQMOVNu, but as it uses the top/bottom instructions the new ISD might not be a good fit.

It would be better for AArch64/Neon (the SQXTN/UQXTN instructions) - there I think I added tablegen patterns to keep the nodes generic as long as possible, but so long as the new nodes handled known/demanded bits and whatnot then it should be a good replacement.

ParkHanbum commented 3 months ago

Can a beginner take over this issue? if may, I'm very glad to take.

topperc commented 3 months ago

The RISC-V code was further modified in #93596 #93728 #93756 #93752. There are no longer complex tablegen patterns, but there is a bunch of target specific DAG combine code.

ParkHanbum commented 3 months ago

@topperc Thanks for letting me know. If there are any architectures that have not been patched yet, I will refer to the link you provided.

topperc commented 3 months ago

@topperc Thanks for letting us know. If there are any architectures that have not been patched yet, I will refer to the link you provided.

I think the RISC-V specific DAGCombine code could be removed if we had a generic saturation node. We'd still need custom lowering to split something like a vXi32->vXi8 saturating truncate into multiple vnclip instructions since we can only truncate vXi32->vXi16 or vXi16->vXi8 in a single instruction.

sun-jacobi commented 3 months ago

@topperc Thanks for letting me know. If there are any architectures that have not been patched yet, I will refer to the link you provided.

@ParkHanbum What I proposed is actually a generic saturation node (just as @topperc said). The proposed node is somewhat difficult to work with x86, but AArch64/Neon and RISCV-V would benefit.

If you are interested in this, feel free to try it :)

RKSimon commented 3 months ago

Starting with adding ISD::TRUNCSSAT would make sense - cheers.

ParkHanbum commented 3 months ago

I'm on duty for this. as @RKSimon guided, implemented (is it correctly?) truncate_ssat first.

Legalized selection DAG: %bb.0 'saturate2:'
SelectionDAG has 7 nodes:
  t0: ch,glue = EntryToken
      t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
    t13: v8i8 = truncate_ssat t2
  t11: ch,glue = CopyToReg t0, Register:v8i8 $d0, t13
  t12: ch = AArch64ISD::RET_GLUE t11, Register:v8i8 $d0, t11:1

https://github.com/ParkHanbum/llvm-project/commit/f88709cb39098d8cc82aac7b83b8d7807259600a#diff-d7065626b3d269e24241429ce037d51fc91d5ead5896d67fcc038aefc1111fd2

I should start implementing architecture-specific DAGToDAGISel now, right? 😄

inclyc commented 3 months ago

Hi @ParkHanbum,

I should start implementing architecture-specific DAGToDAGISel now, right?

I think maybe modifications on llvm/lib/Target/*.td files would be sufficient. You can start looking there.

ParkHanbum commented 3 months ago

Hi @inclyc thanks for advice. I'll find what can I do!

RKSimon commented 3 months ago

@ParkHanbum You might want to use something like 2887f1463930044a6093f111dc8eba5594144c33 as reference about introducing a new ISD node, although in that case this was a conversion of an existing aarch64 opcode

ParkHanbum commented 3 months ago

@RKSimon thanks!! it is very helpful!!

ParkHanbum commented 3 months ago

can I ask a question? is we cannot support i64 type for this issue in aarch64 and x86?

because I found it does not supported in aarch64&x86 but supported in riscv.

in this testcase :

define <2 x i32> @vqmovni64_smaxmin(<2 x i64> %s0) {
entry:
  %c1 = icmp slt <2 x i64> %s0, <i64 2147483647, i64 2147483647>
  %s1 = select <2 x i1> %c1, <2 x i64> %s0, <2 x i64> <i64 2147483647, i64 2147483647>
  %c2 = icmp sgt <2 x i64> %s1, <i64 -2147483648, i64 -2147483648>
  %s2 = select <2 x i1> %c2, <2 x i64> %s1, <2 x i64> <i64 -2147483648, i64 -2147483648>
  %t = trunc <2 x i64> %s2 to <2 x i32>
  ret <2 x i32> %t
}

result of aarch64:

vqmovni64_smaxmin:                      // @vqmovni64_smaxmin
        mov     w8, #2147483647                 // =0x7fffffff
        dup     v1.2d, x8
        mov     x8, #-2147483648                // =0xffffffff80000000
        cmgt    v2.2d, v1.2d, v0.2d
        bif     v0.16b, v1.16b, v2.16b
        dup     v1.2d, x8
        cmgt    v2.2d, v0.2d, v1.2d
        bif     v0.16b, v1.16b, v2.16b
        xtn     v0.2s, v0.2d
        ret

result of riscv:

vqmovni64_smaxmin:                      # @vqmovni64_smaxmin
        vsetivli        zero, 2, e32, mf2, ta, ma
        vnclip.wi       v8, v8, 0
        ret

link : https://godbolt.org/z/7W6baE83c

and as I tested result, that trunc at @vqmovni64_smaxmin is saturated.


Optimized legalized selection DAG: %bb.0 'vqmovni64_smaxmin:entry'
SelectionDAG has 7 nodes:
  t0: ch,glue = EntryToken
      t2: v2i64,ch = CopyFromReg t0, Register:v2i64 %0
    t17: v2i32 = truncate_ssat t2
  t15: ch,glue = CopyToReg t0, Register:v2i32 $d0, t17
  t16: ch = AArch64ISD::RET_GLUE t15, Register:v2i32 $d0, t15:1

RKSimon commented 3 months ago

I expect you will have to add legalizer support as well - not just handle known legal instructions.

Search for ISD::TRUNCATE in the llvm\lib\CodeGen\SelectionDAG\Legalize* files - its likely you will just have duplicate some of this code with a suitable smin/smax clamping before truncation.

ParkHanbum commented 3 months ago

Thanks @RKSimon, the commits you shared earlier gave me what I have to do.

The reason I asked is that I wanted to make sure that the cases that currently appear to be unsupported are still unsupportable.

ParkHanbum commented 3 months ago

I've implemented truncate_[us]at for aarch64, riscv. x86 has a complicated implementation and I think I need to do more work on it, can I send a PR with the implemented one first?

RKSimon commented 3 months ago

Please raise a PR, keep it as draft if you want (but people will comment whatever :))

llvm / llvm-project

[SelectionDAG] Add a new ISD Node for vector saturating truncation #85903