Closed sun-jacobi closed 2 months ago
@llvm/issue-subscribers-backend-x86
Author: Chia (sun-jacobi)
@llvm/issue-subscribers-backend-risc-v
Author: Chia (sun-jacobi)
@llvm/issue-subscribers-backend-aarch64
Author: Chia (sun-jacobi)
cc @topperc @RKSimon @preames @asb
I'm sure this has come up before as a phab review but I can't find it right now (@davemgreen can you remember?).
What would be awesome is if there was a way to handle the 2 SSE PACK instruction signed saturations:
/// PACKSS (normal signed sat)
/// (truncate (smin ((smax (x, signed_min_of_dest_type)),
/// signed_max_of_dest_type)) to dest_type)
/// or:
/// (truncate (smax ((smin (x, signed_max_of_dest_type)),
/// signed_min_of_dest_type)) to dest_type).
///
/// PACKUS (signed sat of unsigned range) - NOT the same as USAT
/// the smax/smin range is [0, unsigned_max_of_dest_type].
@RKSimon Thank you for the information. Do you mean we also need to add a new ISD Node for signed sat of unsigned range?
I don't know yet - I think what you were proposing was we'd have a ISD::TRUNCSSAT node that just used the signed smin/smax clamp limits of the destination type (and ISD::TRUNCUSAT that had a umax limit?).
What I'm wondering is how useful it would be to add extra args describing the limits - I just don't want to end up having a target node that has to match the generic nodes, especially if we end up with more complex folds for cases such as sat-trunc-store patterns.
For MVE this is spelled ARMISD::VQMOVNs/ARMISD::VQMOVNu, but as it uses the top/bottom instructions the new ISD might not be a good fit.
It would be better for AArch64/Neon (the SQXTN/UQXTN instructions) - there I think I added tablegen patterns to keep the nodes generic as long as possible, but so long as the new nodes handled known/demanded bits and whatnot then it should be a good replacement.
Can a beginner take over this issue? if may, I'm very glad to take.
The RISC-V code was further modified in #93596 #93728 #93756 #93752. There are no longer complex tablegen patterns, but there is a bunch of target specific DAG combine code.
@topperc Thanks for letting me know. If there are any architectures that have not been patched yet, I will refer to the link you provided.
@topperc Thanks for letting us know. If there are any architectures that have not been patched yet, I will refer to the link you provided.
I think the RISC-V specific DAGCombine code could be removed if we had a generic saturation node. We'd still need custom lowering to split something like a vXi32->vXi8 saturating truncate into multiple vnclip instructions since we can only truncate vXi32->vXi16 or vXi16->vXi8 in a single instruction.
@topperc Thanks for letting me know. If there are any architectures that have not been patched yet, I will refer to the link you provided.
@ParkHanbum What I proposed is actually a generic saturation node (just as @topperc said). The proposed node is somewhat difficult to work with x86, but AArch64/Neon and RISCV-V would benefit.
If you are interested in this, feel free to try it :)
Starting with adding ISD::TRUNCSSAT would make sense - cheers.
I'm on duty for this. as @RKSimon guided, implemented (is it correctly?) truncate_ssat first.
Legalized selection DAG: %bb.0 'saturate2:'
SelectionDAG has 7 nodes:
t0: ch,glue = EntryToken
t2: v8i16,ch = CopyFromReg t0, Register:v8i16 %0
t13: v8i8 = truncate_ssat t2
t11: ch,glue = CopyToReg t0, Register:v8i8 $d0, t13
t12: ch = AArch64ISD::RET_GLUE t11, Register:v8i8 $d0, t11:1
I should start implementing architecture-specific DAGToDAGISel now, right? 😄
Hi @ParkHanbum,
I should start implementing architecture-specific DAGToDAGISel now, right?
I think maybe modifications on llvm/lib/Target/*.td
files would be sufficient. You can start looking there.
Hi @inclyc thanks for advice. I'll find what can I do!
@ParkHanbum You might want to use something like 2887f1463930044a6093f111dc8eba5594144c33 as reference about introducing a new ISD node, although in that case this was a conversion of an existing aarch64 opcode
@RKSimon thanks!! it is very helpful!!
can I ask a question? is we cannot support i64 type for this issue in aarch64 and x86?
because I found it does not supported in aarch64&x86 but supported in riscv.
in this testcase :
define <2 x i32> @vqmovni64_smaxmin(<2 x i64> %s0) {
entry:
%c1 = icmp slt <2 x i64> %s0, <i64 2147483647, i64 2147483647>
%s1 = select <2 x i1> %c1, <2 x i64> %s0, <2 x i64> <i64 2147483647, i64 2147483647>
%c2 = icmp sgt <2 x i64> %s1, <i64 -2147483648, i64 -2147483648>
%s2 = select <2 x i1> %c2, <2 x i64> %s1, <2 x i64> <i64 -2147483648, i64 -2147483648>
%t = trunc <2 x i64> %s2 to <2 x i32>
ret <2 x i32> %t
}
result of aarch64:
vqmovni64_smaxmin: // @vqmovni64_smaxmin
mov w8, #2147483647 // =0x7fffffff
dup v1.2d, x8
mov x8, #-2147483648 // =0xffffffff80000000
cmgt v2.2d, v1.2d, v0.2d
bif v0.16b, v1.16b, v2.16b
dup v1.2d, x8
cmgt v2.2d, v0.2d, v1.2d
bif v0.16b, v1.16b, v2.16b
xtn v0.2s, v0.2d
ret
result of riscv:
vqmovni64_smaxmin: # @vqmovni64_smaxmin
vsetivli zero, 2, e32, mf2, ta, ma
vnclip.wi v8, v8, 0
ret
link : https://godbolt.org/z/7W6baE83c
and as I tested result, that trunc at @vqmovni64_smaxmin is saturated.
Optimized legalized selection DAG: %bb.0 'vqmovni64_smaxmin:entry'
SelectionDAG has 7 nodes:
t0: ch,glue = EntryToken
t2: v2i64,ch = CopyFromReg t0, Register:v2i64 %0
t17: v2i32 = truncate_ssat t2
t15: ch,glue = CopyToReg t0, Register:v2i32 $d0, t17
t16: ch = AArch64ISD::RET_GLUE t15, Register:v2i32 $d0, t15:1
I expect you will have to add legalizer support as well - not just handle known legal instructions.
Search for ISD::TRUNCATE in the llvm\lib\CodeGen\SelectionDAG\Legalize* files - its likely you will just have duplicate some of this code with a suitable smin/smax clamping before truncation.
Thanks @RKSimon, the commits you shared earlier gave me what I have to do.
The reason I asked is that I wanted to make sure that the cases that currently appear to be unsupported are still unsupportable.
I've implemented truncate_[us]at for aarch64, riscv. x86 has a complicated implementation and I think I need to do more work on it, can I send a PR with the implemented one first?
Please raise a PR, keep it as draft if you want (but people will comment whatever :))
Based on the #73424. There were several patches and issues on the vector saturating truncation operation on different platform.
For instance, #68466 for x86 and #75145 for RISC-V vector extension. I think it is better for us to do a target-independent combine for this operation, which could be expressed as a sequence of LLVM IR like:
Especially, I think this combine could be done at the SelectionDAG, like the other saturating operation such as
ISD::SADDSAT
. This could remove complex TableGen pattern matching and refactor each platform's CodeGen path for this operation.