oracle / graal

GraalVM compiles Java applications into native executables that start instantly, scale fast, and use fewer compute resources 🚀
https://www.graalvm.org
Other
20.21k stars 1.62k forks source link

Implement Graal equivalents for AArch64 C2 pattern rules #323

Closed adinn closed 9 months ago

adinn commented 6 years ago

AArch64 makes heavy use of C2 pattern rules to transform individual abstract high-level instructions and subgraphs comprising several linked high-level instructions to a variety of architecture-specific instructions. The subgraph translations avoid generation of multiple instructions in the back end improving execution time and, more importantly, code density. These transforms should be considered for implementation in the Graal Arch64 implementation, where possible using MatchRule translations.

adinn commented 6 years ago

The subgraph patterns in the AArch64 ad file that are of major interest include the following

1) Instruction operand class rules

These are mostly used to manage address match inputs.

Normal oops: (basic reg index/offset is omitted)

operand indIndexScaledI2L(iRegP reg, iRegI ireg, immIScale scale)
  match(AddP reg (LShiftL (ConvI2L ireg) scale));
operand indIndexScaled(iRegP reg, iRegL lreg, immIScale scale)
  match(AddP reg (LShiftL lreg scale));
operand indIndexI2L(iRegP reg, iRegI ireg)
  match(AddP reg (ConvI2L ireg));
operand indIndexScaledI2LN(iRegN reg, iRegI ireg, immIScale scale)
  match(AddP (DecodeN reg) (LShiftL (ConvI2L ireg) scale));

Narrow oops: (each one analagous to a corresponding normal oop rule)

operand indIndexScaledN(iRegN reg, iRegL lreg, immIScale scale)
  match(AddP (DecodeN reg) (LShiftL lreg scale));
operand indIndexI2LN(iRegN reg, iRegI ireg)
  match(AddP (DecodeN reg) (ConvI2L ireg));
operand indIndexN(iRegN reg, iRegL lreg)
  match(AddP (DecodeN reg) lreg);
operand indOffIN(iRegN reg, immIOffset off)
  match(AddP (DecodeN reg) off);
operand indOffLN(iRegN reg, immLoffset off)
  match(AddP (DecodeN reg) off);

The above rules (along with a couple of direct operand matches) enable complex memory address inputs to be subsumed into one 'memory' operand class. These are already, for the most part, implemented by the AArch64AddressLowering phase. 'mostly' because of the following wrinkle.

The C2 match rules (including those for address arithmetic) profit from being able to elide long and int inputs using this opclass

  opclass iRegIorL2I(iRegI, iRegL2I);

Opclass iRegIorL2I is used throughout aarch64.ad to enable patterns to accept a 64 bit input and use it as a 32-bit input, avoiding the need to have many equivalent rules and have them explicitly manage type conversions for the corresponding input or output values.

2) Instruction rules for multi-instruction subgraphs

Note that each pattern is annotated with the target AArch64 instruction employed to translate the match (look for the ==>)

Loads zero/sign extend elimination at load

  match(Set dst (ConvI2L (LoadB mem)));  ==> ldrsb
  match(Set dst (ConvI2L (LoadUB mem))); ==> ldrb
  match(Set dst (ConvI2L (LoadS mem)));  ==> ldrsh
  match(Set dst (ConvI2L (LoadUS mem))); ==> ldrh
  match(Set dst (ConvI2L (LoadI mem)));  ==> ldrsw
  match(Set dst (AndL (ConvI2L (LoadI mem)) 0xffffffff)); ==> ldrw

Casts

  match(Set dst (ConvL2I (CastP2X src)));   ==> movw # P --> int
  # only when Universe::narrow_oop_shift() == 0
  match(Set dst (ConvL2I (CastP2X (DecodeN src)))); ==> mov # N -> int

Cond Move for X in {I, L, P, N, F, D} -- zero/one match is only for I n.b. cmp is cmpOp or cmpOpU

  match(Set dst (CMoveX (Binary cmp cr) (Binary src1 src2))); ==> csel/fcsel
  match(Set dst (CMoveX (Binary cmp cr) (Binary zero src)));  ==> csel/fcsel
  match(Set dst (CMoveX (Binary cmp cr) (Binary src zero)));  ==> csel/fcsel
  # special case for creating a boolean
  match(Set dst (CMoveI (Binary cmp cr) (Binary one zero)));  ==> csinc

Arithmetic, Logical and Shifts

  # elide l2i
  match(Set dst (AddI (ConvL2I src1) src2));        ==> addw
  # elide i2l and/or shift in address arithmetic
  match(Set dst (AddP src1 (ConvI2L src2)));        ==> add sxtw
  match(Set dst (AddP src1 (LShiftL src2 scale)));  ==> lea # aka add LSL
  match(Set dst (AddP src1 (LShiftL (ConvI2L src2) scale))); ==> lea
  # bitfield extract for address arithmetic?
  match(Set dst (LShiftL (ConvI2L src) scale)); ==> sbfiz
  # fused mul add/sub
  match(Set dst (MulL (ConvI2L src1) (ConvI2L src2))); => smull
  match(Set dst (AddI src3 (MulI src1 src2)));  ==> maddw
  match(Set dst (SubI src3 (MulI src1 src2)));  ==> msubw
  match(Set dst (AddL src3 (MulL src1 src2)));  ==> madd
  match(Set dst (SubL src3 (MulL src1 src2)));  ==> msub
  # sign extract div1 == div2 == 31
  match(Set dst (URShiftI (RShiftI src1 div1) div2)); ==> lsrw 31
  # div 2 round div1 == div2 == 31
  match(Set dst (AddI src (URShiftI (RShiftI src div1) div2))); ==> addw LSR
  # sign extract div1 == div2 == 31
  match(Set dst (URShiftL (RShiftL src1 div1) div2)); ==> lsrw 63
  # div 2 round div1 == div2 == 63
  match(Set dst (AddL src (URShiftL (RShiftL src div1) div2))); ==? add LSR
  # special case for card table stores - no cast, ands src2 with 0x3f
  match(Set dst (URShiftL (CastP2X src1) src2)); ==> lsr
  # bit twiddling,  m1 == -1
  # merge negate into logic op
  match(Set dst (AndI src1 (XorI src2 m1)));  ==> bicw
  match(Set dst (AndL src1 (XorL src2 m1)));  ==> bic
  match(Set dst (OrI src1 (XorI src2 m1)));   ==> ornw
  match(Set dst (OrL src1 (XorL src2 m1)));   ==> orn
  match(Set dst (XorI m1 (XorI src2 src1)));  ==> eonw
  match(Set dst (XorL m1 (XorL src2 src1)));  ==> eon
  # merge negated shift into And
  match(Set dst (AndI src1 (XorI(URShiftI src2 src3) m1))); ==> bicw LSR
  match(Set dst (AndL src1 (XorL(URShiftL src2 src3) m1))); ==> bic LSR
  match(Set dst (AndI src1 (XorI(RShiftI src2 src3) m1)));  ==> bicw ASR
  match(Set dst (AndL src1 (XorL(RShiftL src2 src3) m1)));  ==> bic ASR
  match(Set dst (AndI src1 (XorI(LShiftI src2 src3) m1)));  ==> bicw LSL
  match(Set dst (AndL src1 (XorL(LShiftL src2 src3) m1)));  ==> bic LSL
  # merge shifted Or into negate
  match(Set dst (XorI m1 (XorI(URShiftI src2 src3) src1))); ==> eonw LSR
  match(Set dst (XorL m1 (XorL(URShiftL src2 src3) src1))); ==> eon LSR
  match(Set dst (XorI m1 (XorI(RShiftI src2 src3) src1)));  ==> eonw ASR
  match(Set dst (XorL m1 (XorL(RShiftL src2 src3) src1)));  ==> eon ASR
  match(Set dst (XorI m1 (XorI(LShiftI src2 src3) src1)));  ==> eonw LSL
  match(Set dst (XorL m1 (XorL(LShiftL src2 src3) src1)));  ==> eon LSL
  # merge negated shift into Or
  match(Set dst (OrI src1 (XorI(URShiftI src2 src3) src4)));==> ornw LSR
  match(Set dst (OrL src1 (XorL(URShiftL src2 src3) src4)));==> orn LSR
  match(Set dst (OrI src1 (XorI(RShiftI src2 src3) src4))); ==> ornw ASR
  match(Set dst (OrL src1 (XorL(RShiftL src2 src3) src4))); ==> orn ASR
  match(Set dst (OrI src1 (XorI(LShiftI src2 src3) src4))); ==> ornw LSL
  match(Set dst (OrL src1 (XorL(LShiftL src2 src3) src4))); ==> orn LSL
  # merge shift into And
  match(Set dst (AndI src1 (URShiftI src2 src3))); ==> andw LSR
  match(Set dst (AndL src1 (URShiftL src2 src3))); ==> and LSR
  match(Set dst (AndI src1 (RShiftI src2 src3)));  ==> andw ASR
  match(Set dst (AndL src1 (RShiftL src2 src3)));  ==> and ASR
  match(Set dst (AndI src1 (LShiftI src2 src3)));  ==> andw LSL
  match(Set dst (AndL src1 (LShiftL src2 src3)));  ==> and LSL
  # merge shift into XOr
  match(Set dst (XorI src1 (URShiftI src2 src3))); ==> eorw LSR
  match(Set dst (XorL src1 (URShiftL src2 src3))); ==> eor LSR
  match(Set dst (XorI src1 (RShiftI src2 src3)));  ==> eorw ASR
  match(Set dst (XorL src1 (RShiftL src2 src3)));  ==> eor ASR
  match(Set dst (XorI src1 (LShiftI src2 src3)));  ==> eorw LSL
  match(Set dst (XorL src1 (LShiftL src2 src3)));  ==> eor LSL
  # merge shift into Or
  match(Set dst (OrI src1 (URShiftI src2 src3)));  ==> orrw LSR
  match(Set dst (OrL src1 (URShiftL src2 src3)));  ==> orr LSR
  match(Set dst (OrI src1 (RShiftI src2 src3)));   ==> orrw ASR
  match(Set dst (OrL src1 (RShiftL src2 src3)));   ==> orr ASR
  match(Set dst (OrI src1 (LShiftI src2 src3)));   ==> orrw LSL
  match(Set dst (OrL src1 (LShiftL src2 src3)));   ==> orr LSL
  # merge shift into Add
  match(Set dst (AddI src1 (URShiftI src2 src3))); ==> addw LSR
  match(Set dst (AddL src1 (URShiftL src2 src3))); ==> add LSR
  match(Set dst (AddI src1 (RShiftI src2 src3)));  ==> addw ASR
  match(Set dst (AddL src1 (RShiftL src2 src3)));  ==> add ASR
  match(Set dst (AddI src1 (LShiftI src2 src3)));  ==> addw LSL
  match(Set dst (AddL src1 (LShiftL src2 src3)));  ==> addw LSL
  # merge shift into Sub
  match(Set dst (SubI src1 (URShiftI src2 src3))); ==> subw LSR
  match(Set dst (SubL src1 (URShiftL src2 src3)));  etc
  match(Set dst (SubI src1 (RShiftI src2 src3)));
  match(Set dst (SubL src1 (RShiftL src2 src3)));
  match(Set dst (SubI src1 (LShiftI src2 src3)));
  match(Set dst (SubL src1 (LShiftL src2 src3)));
  # merge shifts for i2b -- requires l/rshift_count <= 63 or 31
  match(Set dst (RShiftL (LShiftL src lshift_count) rshift_count)); ==> sbfm
  match(Set dst (RShiftI (LShiftI src lshift_count) rshift_count)); ==> sbfmw
  match(Set dst (URShiftL (LShiftL src lshift_count) rshift_count)); ==> ubfm
  match(Set dst (URShiftI (LShiftI src lshift_count) rshift_count)); ==> ubfmw
  # shift and mask bit extract -- mask == 2**k - 1, k <= 29 or 61
  match(Set dst (AndI (URShiftI src rshift) mask));  ==> ubfxw
  match(Set dst (AndL (URShiftL src rshift) mask));  ==> ubfx
  # zero-extended shift and mask bit extract
  match(Set dst (ConvI2L (AndI (URShiftI src rshift) mask))); ==> ubfc
  # left shift mask bit extract lshift <= 32 or 64, mask as above
  match(Set dst (LShiftI (AndI src mask) lshift)); ==> ubfizw
  match(Set dst (LShiftL (AndL src mask) lshift)); ==> ubfiz
  # left shift zero-extended mask bit extract
  match(Set dst (LShiftL (ConvI2L(AndI src mask)) lshift)); ==> ubfiz
  # rotations rshift + lshift & 63 == 0
  match(Set dst (OrL (LShiftL src1 lshift) (URShiftL src2 rshift))); ==> extr
  match(Set dst (OrI (LShiftI src1 lshift) (URShiftI src2 rshift)));==> extrw
  # rotated add
  match(Set dst (AddL (LShiftL src1 lshift) (URShiftL src2 rshift))); ==> extr
  match(Set dst (AddI (LShiftI src1 lshift) (URShiftI src2 rshift))); ==> extrw
  # rol/ror
  match(Set dst (OrL (LShiftL src shift) (URShiftL src (SubI 64 shift)))); ==> subw ; rorv
  match(Set dst (OrL (LShiftL src shift) (URShiftL src (SubI 0 shift)))); ==> subw ; rorv
  match(Set dst (OrI (LShiftI src shift) (URShiftI src (SubI 32 shift)))); ==> subw ; rorvw
  match(Set dst (OrI (LShiftI src shift) (URShiftI src (SubI 0 shift)))); ==> subw ; rorvw
  match(Set dst (OrL (URShiftL src shift) (LShiftL src (SubI 64 shift)))); ==> rorv
  match(Set dst (OrL (URShiftL src shift) (LShiftL src (SubI 0 shift)))); ==> rorvw
  match(Set dst (OrI (URShiftI src shift) (LShiftI src (SubI 32 shift)))); ==> rorv
  match(Set dst (OrI (URShiftI src shift) (LShiftI src (SubI 0 shift)))); ==> rorvw
  # merge sign-extend (i2l) into add/sub
  match(Set dst (AddL src1 (ConvI2L src2)));  ==> add sxtw
  match(Set dst (SubL src1 (ConvI2L src2)));  ==> sub sxtw
  # merge zero/sign-extend (shift pair) into add/sub
  match(Set dst (AddI src1 (RShiftI (LShiftI src2 16) 16))); ==> add sxth
  match(Set dst (AddI src1 (RShiftI (LShiftI src2 24) 24))); ==> add sxtb
  match(Set dst (AddI src1 (URShiftI (LShiftI src2 24) 24))); ==> add uxtb
  match(Set dst (AddL src1 (RShiftL (LShiftL src2 48) 48))); ==> add sxth
  match(Set dst (AddL src1 (RShiftL (LShiftL src2 32) 32))); ==> add stxw
  match(Set dst (AddL src1 (RShiftL (LShiftL src2 56) 56))); ==> add sxtb
  match(Set dst (AddL src1 (URShiftL (LShiftL src2 56) 56))); ==> add uxtb
  # merge integral downcast (mask) into add/sub
  match(Set dst (AddI src1 (AndI src2 0xff)));        ==> addw uxtb
  match(Set dst (AddI src1 (AndI src2 0xffff)));      ==> addw uxth
  match(Set dst (AddL src1 (AndL src2 0xff)));        ==> add uxtb
  match(Set dst (AddL src1 (AndL src2 0xffff)));      ==> add uxth
  match(Set dst (AddL src1 (AndL src2 0xffffffff)));  ==> add uxtw
  match(Set dst (SubI src1 (AndI src2 0xff)));        ==> subw uxtb
  match(Set dst (SubI src1 (AndI src2 0xffff)));      ==> subw uxth
  match(Set dst (SubL src1 (AndL src2 0xff)));        ==> sub uxtb
  match(Set dst (SubL src1 (AndL src2 0xffff)));      ==> sub uxth
  match(Set dst (SubL src1 (AndL src2 0xffffffff)));  ==> sub uxtw
  # merge shifted zero/sign-extend (via shift) into shifted add/sub
  # 0 <= lshift <= 4
  match(Set dst (AddL src1 (LShiftL (RShiftL (LShiftL src2 56) 56) lshift)));  ==> add sxtb #lshift
  match(Set dst (AddL src1 (LShiftL (RShiftL (LShiftL src2 48) 48) lshift)));  ==> add sxth #lshift
  match(Set dst (AddL src1 (LShiftL (RShiftL (LShiftL src2 32) 32) lshift)));  ==> add sxtw #lshift
  match(Set dst (SubL src1 (LShiftL (RShiftL (LShiftL src2 56) 56) lshift)));  ==> sub sxtb #lshift
  match(Set dst (SubL src1 (LShiftL (RShiftL (LShiftL src2 48) 48) lshift)));  ==> sub sxth #lshift
  match(Set dst (SubL src1 (LShiftL (RShiftL (LShiftL src2 32) 32) lshift)));  ==> sub sxtw #lshift
  match(Set dst (AddI src1 (LShiftI (RShiftI (LShiftI src2 24) 24) lshift)));  ==> addw sxtb #lshift
  match(Set dst (AddI src1 (LShiftI (RShiftI (LShiftI src2 16) 16) lshift)));  ==> addw sxth #lshift
  match(Set dst (SubI src1 (LShiftI (RShiftI (LShiftI src2 24) 24) lshift)));  ==> subw sxtb #lshift
  match(Set dst (SubI src1 (LShiftI (RShiftI (LShiftI src2 16) 16) lshift)));  ==> subw sxth #lshift
  # merge shifted zero/sign-extend (i2l or mask) into add/sub
  # 0 <= lshift <= 4
  match(Set dst (AddL src1 (LShiftL (ConvI2L src2) lshift)));  ==> add sxtw #lshift
  match(Set dst (SubL src1 (LShiftL (ConvI2L src2) lshift)));  ==> sub sxtw #lshift
  match(Set dst (AddL src1 (LShiftL (AndL src2 0xff) lshift)));  ==> add uxtb #lshift
  match(Set dst (AddL src1 (LShiftL (AndL src2 0xffff) lshift)));  ==> add uxth #lshift
  match(Set dst (AddL src1 (LShiftL (AndL src2 0xffffffff) lshift))); ==> add uxtw #lshift
  match(Set dst (SubL src1 (LShiftL (AndL src2 0xff) lshift)));  ==> sub uxtb #lshift
  match(Set dst (SubL src1 (LShiftL (AndL src2 0xffff) lshift)));  ==> sub uxth #lshift
  match(Set dst (SubL src1 (LShiftL (AndL src2 0xffffffff) lshift))); ==> sub uxtw #lshift
  match(Set dst (AddI src1 (LShiftI (AndI src2 0xff) lshift)));   ==> addw uxtb #lshift
  match(Set dst (AddI src1 (LShiftI (AndI src2 0xffff) lshift)));   ==> addw uxth #lshift
  match(Set dst (SubI src1 (LShiftI (AndI src2 0xff) lshift)));   ==> subw uxtb #lshift
  match(Set dst (SubI src1 (LShiftI (AndI src2 0xffff) lshift)));   ==> subw uxth #lshift
  # fused multiply add variants
  match(Set dst (FmaF src3 (Binary src1 src2)));  ==> fmadds
  match(Set dst (FmaD src3 (Binary src1 src2)));  ==> fmaddd
  match(Set dst (FmaF src3 (Binary (NegF src1) src2))); ==> fmsubs
  match(Set dst (FmaF src3 (Binary src1 (NegF src2)))); ditto
  match(Set dst (FmaD src3 (Binary (NegD src1) src2))); ==> fmsubd
  match(Set dst (FmaD src3 (Binary src1 (NegD src2)))); ditto
  match(Set dst (FmaF (NegF src3) (Binary (NegF src1) src2))); ==> fnmadds
  match(Set dst (FmaF (NegF src3) (Binary src1 (NegF src2)))); ditto
  match(Set dst (FmaD (NegD src3) (Binary (NegD src1) src2))); ==> fnmaddd
  match(Set dst (FmaD (NegD src3) (Binary src1 (NegD src2)))); ditto
  match(Set dst (FmaF (NegF src3) (Binary src1 src2))); ==> fnmsubs
  match(Set dst (FmaD (NegD src3) (Binary src1 src2))); ==> fnmsubd
  # float sqrt via double sqrt
  match(Set dst (ConvD2F (SqrtD (ConvF2D src))));  ==> fsqrts
  # 32-bit masked i2l i.e. ui2l -- appears in bignum math
  match(Set dst (AndL (ConvI2L src) 0xffffffff));  ==> ubfm
  for X in {I, L} op2 in {reg, immediate}
  match(Set cr (OverflowAddX op1 op2)); ==> cmn/cmnw
  match(Set cr (OverflowSubX op1 op2)); ==> cmp/cmpw
  # this has several translations into 5-6 instructions
  match(Set cr (OverflowMulX op1 op2));  ==> . . .

Branches

 # merge oop decode into cpmpare against null
  match(If cmp (CmpP (DecodeN oop) zero)); ==> cbzw/cbznw
  # use bit test branch for 2^k literal mask
  match(If cmp (CmpL (AndL op1 2^k) 0));  ==> tbz k/tbnz k
  match(If cmp (CmpI (AndI op1 2^k) 0));  ==> tbz k/tbnz k
  # use bit test for 2^k literal mask
  match(Set cr (CmpL (AndL op1 2^k) 0));  ==> tst k
  match(Set cr (CmpI (AndI op1 2^k) 0));  ==> tstw k

Subtypecheck

  match(Set result (PartialSubtypeCheck sub super));
  match(Set cr (CmpP (PartialSubtypeCheck sub super) zero));

3) Cases (currently) omitted from consideration

Non-complex matches e.g.

  match(Set dst (LoadB mem));
  match(Set dst (AddL src1 src2));
  match(Set dst (LShiftL src1 src2));
  match(Set dst (CountLeadingZerosI src));
  . . .

String Intrinsics these come in in many flavours

  match(Set result (StrComp (Binary str1 cnt1) (Binary str2 cnt2)));
  match(Set result (StrIndexOf (Binary str1 cnt1) (Binary str2 cnt2)));
  match(Set result (StrEquals (Binary str1 str2) cnt));
  . . .

Float Vector Ops There are a vast array of such ops but none of them use complex matches

adinn commented 6 years ago

Of course, the rules in the previous comment may not apply to Graal -- directly or, even, in suitably modified form -- because of differences in the Graal front end abstract instructions. Conversely, there may be alternative Graal front end abstract instructions which require new rules to translate them to the available AArch64 back-end instructions. The C2 rules were merely provided as a starting point to identify opportunities for better AArch64 code generation.

XiaohongGong commented 4 years ago

Hi @adinn , we have added most of the missed rules for AArch64. And we will continue to add more if there are new rules added in C2 in future. Thanks!

adinn commented 4 years ago

Hi @XiaohongGong Yes, I have been following your work with interest. Thank you very much for your very helpful contribution to this project. It is important that support for AArch64 does not lag support behind support for x86_64 and you have done a great deal to improve the situation.