Closed adinn closed 9 months ago
The subgraph patterns in the AArch64 ad file that are of major interest include the following
These are mostly used to manage address match inputs.
Normal oops: (basic reg index/offset is omitted)
operand indIndexScaledI2L(iRegP reg, iRegI ireg, immIScale scale)
match(AddP reg (LShiftL (ConvI2L ireg) scale));
operand indIndexScaled(iRegP reg, iRegL lreg, immIScale scale)
match(AddP reg (LShiftL lreg scale));
operand indIndexI2L(iRegP reg, iRegI ireg)
match(AddP reg (ConvI2L ireg));
operand indIndexScaledI2LN(iRegN reg, iRegI ireg, immIScale scale)
match(AddP (DecodeN reg) (LShiftL (ConvI2L ireg) scale));
Narrow oops: (each one analagous to a corresponding normal oop rule)
operand indIndexScaledN(iRegN reg, iRegL lreg, immIScale scale)
match(AddP (DecodeN reg) (LShiftL lreg scale));
operand indIndexI2LN(iRegN reg, iRegI ireg)
match(AddP (DecodeN reg) (ConvI2L ireg));
operand indIndexN(iRegN reg, iRegL lreg)
match(AddP (DecodeN reg) lreg);
operand indOffIN(iRegN reg, immIOffset off)
match(AddP (DecodeN reg) off);
operand indOffLN(iRegN reg, immLoffset off)
match(AddP (DecodeN reg) off);
The above rules (along with a couple of direct operand matches) enable complex memory address inputs to be subsumed into one 'memory' operand class. These are already, for the most part, implemented by the AArch64AddressLowering phase. 'mostly' because of the following wrinkle.
The C2 match rules (including those for address arithmetic) profit from being able to elide long and int inputs using this opclass
opclass iRegIorL2I(iRegI, iRegL2I);
Opclass iRegIorL2I is used throughout aarch64.ad to enable patterns to accept a 64 bit input and use it as a 32-bit input, avoiding the need to have many equivalent rules and have them explicitly manage type conversions for the corresponding input or output values.
Note that each pattern is annotated with the target AArch64 instruction employed to translate the match (look for the ==>
)
Loads zero/sign extend elimination at load
match(Set dst (ConvI2L (LoadB mem))); ==> ldrsb
match(Set dst (ConvI2L (LoadUB mem))); ==> ldrb
match(Set dst (ConvI2L (LoadS mem))); ==> ldrsh
match(Set dst (ConvI2L (LoadUS mem))); ==> ldrh
match(Set dst (ConvI2L (LoadI mem))); ==> ldrsw
match(Set dst (AndL (ConvI2L (LoadI mem)) 0xffffffff)); ==> ldrw
Casts
match(Set dst (ConvL2I (CastP2X src))); ==> movw # P --> int
# only when Universe::narrow_oop_shift() == 0
match(Set dst (ConvL2I (CastP2X (DecodeN src)))); ==> mov # N -> int
Cond Move for X in {I, L, P, N, F, D} -- zero/one match is only for I n.b. cmp is cmpOp or cmpOpU
match(Set dst (CMoveX (Binary cmp cr) (Binary src1 src2))); ==> csel/fcsel
match(Set dst (CMoveX (Binary cmp cr) (Binary zero src))); ==> csel/fcsel
match(Set dst (CMoveX (Binary cmp cr) (Binary src zero))); ==> csel/fcsel
# special case for creating a boolean
match(Set dst (CMoveI (Binary cmp cr) (Binary one zero))); ==> csinc
Arithmetic, Logical and Shifts
# elide l2i
match(Set dst (AddI (ConvL2I src1) src2)); ==> addw
# elide i2l and/or shift in address arithmetic
match(Set dst (AddP src1 (ConvI2L src2))); ==> add sxtw
match(Set dst (AddP src1 (LShiftL src2 scale))); ==> lea # aka add LSL
match(Set dst (AddP src1 (LShiftL (ConvI2L src2) scale))); ==> lea
# bitfield extract for address arithmetic?
match(Set dst (LShiftL (ConvI2L src) scale)); ==> sbfiz
# fused mul add/sub
match(Set dst (MulL (ConvI2L src1) (ConvI2L src2))); => smull
match(Set dst (AddI src3 (MulI src1 src2))); ==> maddw
match(Set dst (SubI src3 (MulI src1 src2))); ==> msubw
match(Set dst (AddL src3 (MulL src1 src2))); ==> madd
match(Set dst (SubL src3 (MulL src1 src2))); ==> msub
# sign extract div1 == div2 == 31
match(Set dst (URShiftI (RShiftI src1 div1) div2)); ==> lsrw 31
# div 2 round div1 == div2 == 31
match(Set dst (AddI src (URShiftI (RShiftI src div1) div2))); ==> addw LSR
# sign extract div1 == div2 == 31
match(Set dst (URShiftL (RShiftL src1 div1) div2)); ==> lsrw 63
# div 2 round div1 == div2 == 63
match(Set dst (AddL src (URShiftL (RShiftL src div1) div2))); ==? add LSR
# special case for card table stores - no cast, ands src2 with 0x3f
match(Set dst (URShiftL (CastP2X src1) src2)); ==> lsr
# bit twiddling, m1 == -1
# merge negate into logic op
match(Set dst (AndI src1 (XorI src2 m1))); ==> bicw
match(Set dst (AndL src1 (XorL src2 m1))); ==> bic
match(Set dst (OrI src1 (XorI src2 m1))); ==> ornw
match(Set dst (OrL src1 (XorL src2 m1))); ==> orn
match(Set dst (XorI m1 (XorI src2 src1))); ==> eonw
match(Set dst (XorL m1 (XorL src2 src1))); ==> eon
# merge negated shift into And
match(Set dst (AndI src1 (XorI(URShiftI src2 src3) m1))); ==> bicw LSR
match(Set dst (AndL src1 (XorL(URShiftL src2 src3) m1))); ==> bic LSR
match(Set dst (AndI src1 (XorI(RShiftI src2 src3) m1))); ==> bicw ASR
match(Set dst (AndL src1 (XorL(RShiftL src2 src3) m1))); ==> bic ASR
match(Set dst (AndI src1 (XorI(LShiftI src2 src3) m1))); ==> bicw LSL
match(Set dst (AndL src1 (XorL(LShiftL src2 src3) m1))); ==> bic LSL
# merge shifted Or into negate
match(Set dst (XorI m1 (XorI(URShiftI src2 src3) src1))); ==> eonw LSR
match(Set dst (XorL m1 (XorL(URShiftL src2 src3) src1))); ==> eon LSR
match(Set dst (XorI m1 (XorI(RShiftI src2 src3) src1))); ==> eonw ASR
match(Set dst (XorL m1 (XorL(RShiftL src2 src3) src1))); ==> eon ASR
match(Set dst (XorI m1 (XorI(LShiftI src2 src3) src1))); ==> eonw LSL
match(Set dst (XorL m1 (XorL(LShiftL src2 src3) src1))); ==> eon LSL
# merge negated shift into Or
match(Set dst (OrI src1 (XorI(URShiftI src2 src3) src4)));==> ornw LSR
match(Set dst (OrL src1 (XorL(URShiftL src2 src3) src4)));==> orn LSR
match(Set dst (OrI src1 (XorI(RShiftI src2 src3) src4))); ==> ornw ASR
match(Set dst (OrL src1 (XorL(RShiftL src2 src3) src4))); ==> orn ASR
match(Set dst (OrI src1 (XorI(LShiftI src2 src3) src4))); ==> ornw LSL
match(Set dst (OrL src1 (XorL(LShiftL src2 src3) src4))); ==> orn LSL
# merge shift into And
match(Set dst (AndI src1 (URShiftI src2 src3))); ==> andw LSR
match(Set dst (AndL src1 (URShiftL src2 src3))); ==> and LSR
match(Set dst (AndI src1 (RShiftI src2 src3))); ==> andw ASR
match(Set dst (AndL src1 (RShiftL src2 src3))); ==> and ASR
match(Set dst (AndI src1 (LShiftI src2 src3))); ==> andw LSL
match(Set dst (AndL src1 (LShiftL src2 src3))); ==> and LSL
# merge shift into XOr
match(Set dst (XorI src1 (URShiftI src2 src3))); ==> eorw LSR
match(Set dst (XorL src1 (URShiftL src2 src3))); ==> eor LSR
match(Set dst (XorI src1 (RShiftI src2 src3))); ==> eorw ASR
match(Set dst (XorL src1 (RShiftL src2 src3))); ==> eor ASR
match(Set dst (XorI src1 (LShiftI src2 src3))); ==> eorw LSL
match(Set dst (XorL src1 (LShiftL src2 src3))); ==> eor LSL
# merge shift into Or
match(Set dst (OrI src1 (URShiftI src2 src3))); ==> orrw LSR
match(Set dst (OrL src1 (URShiftL src2 src3))); ==> orr LSR
match(Set dst (OrI src1 (RShiftI src2 src3))); ==> orrw ASR
match(Set dst (OrL src1 (RShiftL src2 src3))); ==> orr ASR
match(Set dst (OrI src1 (LShiftI src2 src3))); ==> orrw LSL
match(Set dst (OrL src1 (LShiftL src2 src3))); ==> orr LSL
# merge shift into Add
match(Set dst (AddI src1 (URShiftI src2 src3))); ==> addw LSR
match(Set dst (AddL src1 (URShiftL src2 src3))); ==> add LSR
match(Set dst (AddI src1 (RShiftI src2 src3))); ==> addw ASR
match(Set dst (AddL src1 (RShiftL src2 src3))); ==> add ASR
match(Set dst (AddI src1 (LShiftI src2 src3))); ==> addw LSL
match(Set dst (AddL src1 (LShiftL src2 src3))); ==> addw LSL
# merge shift into Sub
match(Set dst (SubI src1 (URShiftI src2 src3))); ==> subw LSR
match(Set dst (SubL src1 (URShiftL src2 src3))); etc
match(Set dst (SubI src1 (RShiftI src2 src3)));
match(Set dst (SubL src1 (RShiftL src2 src3)));
match(Set dst (SubI src1 (LShiftI src2 src3)));
match(Set dst (SubL src1 (LShiftL src2 src3)));
# merge shifts for i2b -- requires l/rshift_count <= 63 or 31
match(Set dst (RShiftL (LShiftL src lshift_count) rshift_count)); ==> sbfm
match(Set dst (RShiftI (LShiftI src lshift_count) rshift_count)); ==> sbfmw
match(Set dst (URShiftL (LShiftL src lshift_count) rshift_count)); ==> ubfm
match(Set dst (URShiftI (LShiftI src lshift_count) rshift_count)); ==> ubfmw
# shift and mask bit extract -- mask == 2**k - 1, k <= 29 or 61
match(Set dst (AndI (URShiftI src rshift) mask)); ==> ubfxw
match(Set dst (AndL (URShiftL src rshift) mask)); ==> ubfx
# zero-extended shift and mask bit extract
match(Set dst (ConvI2L (AndI (URShiftI src rshift) mask))); ==> ubfc
# left shift mask bit extract lshift <= 32 or 64, mask as above
match(Set dst (LShiftI (AndI src mask) lshift)); ==> ubfizw
match(Set dst (LShiftL (AndL src mask) lshift)); ==> ubfiz
# left shift zero-extended mask bit extract
match(Set dst (LShiftL (ConvI2L(AndI src mask)) lshift)); ==> ubfiz
# rotations rshift + lshift & 63 == 0
match(Set dst (OrL (LShiftL src1 lshift) (URShiftL src2 rshift))); ==> extr
match(Set dst (OrI (LShiftI src1 lshift) (URShiftI src2 rshift)));==> extrw
# rotated add
match(Set dst (AddL (LShiftL src1 lshift) (URShiftL src2 rshift))); ==> extr
match(Set dst (AddI (LShiftI src1 lshift) (URShiftI src2 rshift))); ==> extrw
# rol/ror
match(Set dst (OrL (LShiftL src shift) (URShiftL src (SubI 64 shift)))); ==> subw ; rorv
match(Set dst (OrL (LShiftL src shift) (URShiftL src (SubI 0 shift)))); ==> subw ; rorv
match(Set dst (OrI (LShiftI src shift) (URShiftI src (SubI 32 shift)))); ==> subw ; rorvw
match(Set dst (OrI (LShiftI src shift) (URShiftI src (SubI 0 shift)))); ==> subw ; rorvw
match(Set dst (OrL (URShiftL src shift) (LShiftL src (SubI 64 shift)))); ==> rorv
match(Set dst (OrL (URShiftL src shift) (LShiftL src (SubI 0 shift)))); ==> rorvw
match(Set dst (OrI (URShiftI src shift) (LShiftI src (SubI 32 shift)))); ==> rorv
match(Set dst (OrI (URShiftI src shift) (LShiftI src (SubI 0 shift)))); ==> rorvw
# merge sign-extend (i2l) into add/sub
match(Set dst (AddL src1 (ConvI2L src2))); ==> add sxtw
match(Set dst (SubL src1 (ConvI2L src2))); ==> sub sxtw
# merge zero/sign-extend (shift pair) into add/sub
match(Set dst (AddI src1 (RShiftI (LShiftI src2 16) 16))); ==> add sxth
match(Set dst (AddI src1 (RShiftI (LShiftI src2 24) 24))); ==> add sxtb
match(Set dst (AddI src1 (URShiftI (LShiftI src2 24) 24))); ==> add uxtb
match(Set dst (AddL src1 (RShiftL (LShiftL src2 48) 48))); ==> add sxth
match(Set dst (AddL src1 (RShiftL (LShiftL src2 32) 32))); ==> add stxw
match(Set dst (AddL src1 (RShiftL (LShiftL src2 56) 56))); ==> add sxtb
match(Set dst (AddL src1 (URShiftL (LShiftL src2 56) 56))); ==> add uxtb
# merge integral downcast (mask) into add/sub
match(Set dst (AddI src1 (AndI src2 0xff))); ==> addw uxtb
match(Set dst (AddI src1 (AndI src2 0xffff))); ==> addw uxth
match(Set dst (AddL src1 (AndL src2 0xff))); ==> add uxtb
match(Set dst (AddL src1 (AndL src2 0xffff))); ==> add uxth
match(Set dst (AddL src1 (AndL src2 0xffffffff))); ==> add uxtw
match(Set dst (SubI src1 (AndI src2 0xff))); ==> subw uxtb
match(Set dst (SubI src1 (AndI src2 0xffff))); ==> subw uxth
match(Set dst (SubL src1 (AndL src2 0xff))); ==> sub uxtb
match(Set dst (SubL src1 (AndL src2 0xffff))); ==> sub uxth
match(Set dst (SubL src1 (AndL src2 0xffffffff))); ==> sub uxtw
# merge shifted zero/sign-extend (via shift) into shifted add/sub
# 0 <= lshift <= 4
match(Set dst (AddL src1 (LShiftL (RShiftL (LShiftL src2 56) 56) lshift))); ==> add sxtb #lshift
match(Set dst (AddL src1 (LShiftL (RShiftL (LShiftL src2 48) 48) lshift))); ==> add sxth #lshift
match(Set dst (AddL src1 (LShiftL (RShiftL (LShiftL src2 32) 32) lshift))); ==> add sxtw #lshift
match(Set dst (SubL src1 (LShiftL (RShiftL (LShiftL src2 56) 56) lshift))); ==> sub sxtb #lshift
match(Set dst (SubL src1 (LShiftL (RShiftL (LShiftL src2 48) 48) lshift))); ==> sub sxth #lshift
match(Set dst (SubL src1 (LShiftL (RShiftL (LShiftL src2 32) 32) lshift))); ==> sub sxtw #lshift
match(Set dst (AddI src1 (LShiftI (RShiftI (LShiftI src2 24) 24) lshift))); ==> addw sxtb #lshift
match(Set dst (AddI src1 (LShiftI (RShiftI (LShiftI src2 16) 16) lshift))); ==> addw sxth #lshift
match(Set dst (SubI src1 (LShiftI (RShiftI (LShiftI src2 24) 24) lshift))); ==> subw sxtb #lshift
match(Set dst (SubI src1 (LShiftI (RShiftI (LShiftI src2 16) 16) lshift))); ==> subw sxth #lshift
# merge shifted zero/sign-extend (i2l or mask) into add/sub
# 0 <= lshift <= 4
match(Set dst (AddL src1 (LShiftL (ConvI2L src2) lshift))); ==> add sxtw #lshift
match(Set dst (SubL src1 (LShiftL (ConvI2L src2) lshift))); ==> sub sxtw #lshift
match(Set dst (AddL src1 (LShiftL (AndL src2 0xff) lshift))); ==> add uxtb #lshift
match(Set dst (AddL src1 (LShiftL (AndL src2 0xffff) lshift))); ==> add uxth #lshift
match(Set dst (AddL src1 (LShiftL (AndL src2 0xffffffff) lshift))); ==> add uxtw #lshift
match(Set dst (SubL src1 (LShiftL (AndL src2 0xff) lshift))); ==> sub uxtb #lshift
match(Set dst (SubL src1 (LShiftL (AndL src2 0xffff) lshift))); ==> sub uxth #lshift
match(Set dst (SubL src1 (LShiftL (AndL src2 0xffffffff) lshift))); ==> sub uxtw #lshift
match(Set dst (AddI src1 (LShiftI (AndI src2 0xff) lshift))); ==> addw uxtb #lshift
match(Set dst (AddI src1 (LShiftI (AndI src2 0xffff) lshift))); ==> addw uxth #lshift
match(Set dst (SubI src1 (LShiftI (AndI src2 0xff) lshift))); ==> subw uxtb #lshift
match(Set dst (SubI src1 (LShiftI (AndI src2 0xffff) lshift))); ==> subw uxth #lshift
# fused multiply add variants
match(Set dst (FmaF src3 (Binary src1 src2))); ==> fmadds
match(Set dst (FmaD src3 (Binary src1 src2))); ==> fmaddd
match(Set dst (FmaF src3 (Binary (NegF src1) src2))); ==> fmsubs
match(Set dst (FmaF src3 (Binary src1 (NegF src2)))); ditto
match(Set dst (FmaD src3 (Binary (NegD src1) src2))); ==> fmsubd
match(Set dst (FmaD src3 (Binary src1 (NegD src2)))); ditto
match(Set dst (FmaF (NegF src3) (Binary (NegF src1) src2))); ==> fnmadds
match(Set dst (FmaF (NegF src3) (Binary src1 (NegF src2)))); ditto
match(Set dst (FmaD (NegD src3) (Binary (NegD src1) src2))); ==> fnmaddd
match(Set dst (FmaD (NegD src3) (Binary src1 (NegD src2)))); ditto
match(Set dst (FmaF (NegF src3) (Binary src1 src2))); ==> fnmsubs
match(Set dst (FmaD (NegD src3) (Binary src1 src2))); ==> fnmsubd
# float sqrt via double sqrt
match(Set dst (ConvD2F (SqrtD (ConvF2D src)))); ==> fsqrts
# 32-bit masked i2l i.e. ui2l -- appears in bignum math
match(Set dst (AndL (ConvI2L src) 0xffffffff)); ==> ubfm
for X in {I, L} op2 in {reg, immediate}
match(Set cr (OverflowAddX op1 op2)); ==> cmn/cmnw
match(Set cr (OverflowSubX op1 op2)); ==> cmp/cmpw
# this has several translations into 5-6 instructions
match(Set cr (OverflowMulX op1 op2)); ==> . . .
Branches
# merge oop decode into cpmpare against null
match(If cmp (CmpP (DecodeN oop) zero)); ==> cbzw/cbznw
# use bit test branch for 2^k literal mask
match(If cmp (CmpL (AndL op1 2^k) 0)); ==> tbz k/tbnz k
match(If cmp (CmpI (AndI op1 2^k) 0)); ==> tbz k/tbnz k
# use bit test for 2^k literal mask
match(Set cr (CmpL (AndL op1 2^k) 0)); ==> tst k
match(Set cr (CmpI (AndI op1 2^k) 0)); ==> tstw k
Subtypecheck
match(Set result (PartialSubtypeCheck sub super));
match(Set cr (CmpP (PartialSubtypeCheck sub super) zero));
Non-complex matches e.g.
match(Set dst (LoadB mem));
match(Set dst (AddL src1 src2));
match(Set dst (LShiftL src1 src2));
match(Set dst (CountLeadingZerosI src));
. . .
String Intrinsics these come in in many flavours
match(Set result (StrComp (Binary str1 cnt1) (Binary str2 cnt2)));
match(Set result (StrIndexOf (Binary str1 cnt1) (Binary str2 cnt2)));
match(Set result (StrEquals (Binary str1 str2) cnt));
. . .
Float Vector Ops There are a vast array of such ops but none of them use complex matches
Of course, the rules in the previous comment may not apply to Graal -- directly or, even, in suitably modified form -- because of differences in the Graal front end abstract instructions. Conversely, there may be alternative Graal front end abstract instructions which require new rules to translate them to the available AArch64 back-end instructions. The C2 rules were merely provided as a starting point to identify opportunities for better AArch64 code generation.
Hi @adinn , we have added most of the missed rules for AArch64. And we will continue to add more if there are new rules added in C2 in future. Thanks!
Hi @XiaohongGong Yes, I have been following your work with interest. Thank you very much for your very helpful contribution to this project. It is important that support for AArch64 does not lag support behind support for x86_64 and you have done a great deal to improve the situation.
AArch64 makes heavy use of C2 pattern rules to transform individual abstract high-level instructions and subgraphs comprising several linked high-level instructions to a variety of architecture-specific instructions. The subgraph translations avoid generation of multiple instructions in the back end improving execution time and, more importantly, code density. These transforms should be considered for implementation in the Graal Arch64 implementation, where possible using MatchRule translations.