[amdgpu] some VOP3 instructions not defined as commutable

yxsamliu commented 1 month ago

some VOP3 instructions are commutable but are not defined as such. e.g.

# RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -run-pass=machine-cse -verify-machineinstrs %s -o - 2>&1 | FileCheck --check-prefix=GCN %s
---
name: test_machine_cse_op_sel
body: |
  bb.0:
    %0:vgpr_32 = IMPLICIT_DEF
    %1:vgpr_32 = IMPLICIT_DEF
    %2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 0, 0, implicit $mode, implicit $exec
    %3:vgpr_32 = V_ADD_NC_U16_e64 0, %1, 0, %0, 0, 0, implicit $mode, implicit $exec
    DS_WRITE2_B32_gfx9 undef %4:vgpr_32, %2, %3, 0, 1, 0, implicit $exec
...

V_ADD_NC_U16_e64 is commutable but are not merged by machine-cse.

This issue was found during reviewing https://github.com/llvm/llvm-project/pull/106920

Two changes are needed:

td files need to be updated for commutable VOP3 insts
SIInstrInfo::commuteInstructionImpl need to be fixed to handle op_sel

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-amdgpu

Author: Yaxun (Sam) Liu (yxsamliu)

some VOP3 instructions are commutable but are not defined as such. e.g. ``` # RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -run-pass=machine-cse -verify-machineinstrs %s -o - 2>&1 | FileCheck --check-prefix=GCN %s --- name: test_machine_cse_op_sel body: | bb.0: %0:vgpr_32 = IMPLICIT_DEF %1:vgpr_32 = IMPLICIT_DEF %2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 0, 0, implicit $mode, implicit $exec %3:vgpr_32 = V_ADD_NC_U16_e64 0, %1, 0, %0, 0, 0, implicit $mode, implicit $exec DS_WRITE2_B32_gfx9 undef %4:vgpr_32, %2, %3, 0, 1, 0, implicit $exec ... ``` V_ADD_NC_U16_e64 is commutable but are not merged by machine-cse. This issue was found during reviewing https://github.com/llvm/llvm-project/pull/106920 Two changes are needed: 1. td files need to be updated for commutable VOP3 insts 2. SIInstrInfo::commuteInstructionImpl need to be fixed to handle op_sel

llvmbot commented 1 month ago

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

Check that no other contributor has already been assigned to this issue. If you believe that no one is actually working on it despite an assignment, ping the person. After one week without a response, the assignee may be changed.
In the comments of this issue, request for it to be assigned to you, or just create a pull request after following the steps below. Mention this issue in the description of the pull request.
Fix the issue locally.
Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
Create a Git commit.
Run git clang-format HEAD~1 to format your changes.
Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation. Mention this issue in the description of the pull request.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

llvmbot commented 1 month ago

@llvm/issue-subscribers-good-first-issue

Author: Yaxun (Sam) Liu (yxsamliu)

some VOP3 instructions are commutable but are not defined as such. e.g. ``` # RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -run-pass=machine-cse -verify-machineinstrs %s -o - 2>&1 | FileCheck --check-prefix=GCN %s --- name: test_machine_cse_op_sel body: | bb.0: %0:vgpr_32 = IMPLICIT_DEF %1:vgpr_32 = IMPLICIT_DEF %2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 0, 0, implicit $mode, implicit $exec %3:vgpr_32 = V_ADD_NC_U16_e64 0, %1, 0, %0, 0, 0, implicit $mode, implicit $exec DS_WRITE2_B32_gfx9 undef %4:vgpr_32, %2, %3, 0, 1, 0, implicit $exec ... ``` V_ADD_NC_U16_e64 is commutable but are not merged by machine-cse. This issue was found during reviewing https://github.com/llvm/llvm-project/pull/106920 Two changes are needed: 1. td files need to be updated for commutable VOP3 insts 2. SIInstrInfo::commuteInstructionImpl need to be fixed to handle op_sel

easyonaadit commented 1 month ago

Hey @yxsamliu, I'm just starting out in Compilers, and would love to try my hand at this issue. Could you please assign it to me?

yxsamliu commented 1 month ago

Hey @yxsamliu, I'm just starting out in Compilers, and would love to try my hand at this issue. Could you please assign it to me?

Thanks for looking into this.

easyonaadit commented 1 month ago

Hey, I had a doubt. %2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 0, 0 in this V_ADD_NC_U16_e64 instruction, what are the 4 numbers given as modifiers supposed to stand for? I have tried looking at documentation but can't seem to find any reference to them. %2:vgpr_32 = V_ADD_NC_U16e64 0, %0, 0, %1, 0, 0_ what I'm talking about is the 0's which are in bold above. Could you please provide some links to where I can read more about them?

vg0204 commented 1 month ago

Hey, I had a doubt. %2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 0, 0 in this V_ADD_NC_U16_e64 instruction, what are the 4 numbers given as modifiers supposed to stand for? I have tried looking at documentation but can't seem to find any reference to them. %2:vgpr_32 = V_ADD_NC_U16e64 0, %0, 0, %1, 0, 0_ what I'm talking about is the 0's which are in bold above. Could you please provide some links to where I can read more about them?

Refer this link, it might help you!

arsenm commented 1 month ago

I usually look at the fully expanded tablegen definitions to figure out the operand structure. If you extract an llvm-tblgen invocation out of the build log, and remove the -gen-* argument, you can look at the fully expanded tablegen and find the relevant instruction. In this case, that is:

def V_ADD_NC_U16_e64 {  // InstructionEncoding Instruction AMDGPUInst PredicateControl InstSI VOP SIMCInstr VOP_Pseudo VOP3_Pseudo VOP3InstBase
 field bit isRegisterLoad = 0;
  field bit isRegisterStore = 0;
  field bits<96> SoftFail = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
  field bit SALU = 0; 
  field bit VALU = 1;
  field bit SOP1 = 0;
  field bit SOP2 = 0;
  field bit SOPC = 0;
  field bit SOPK = 0;
  field bit SOPP = 0;
  field bit VOP1 = 0;
  field bit VOP2 = 0;
  field bit VOPC = 0;
  field bit VOP3 = 1;
  field bit VOP3P = 0;
  field bit VINTRP = 0;
  field bit SDWA = 0;
  field bit DPP = 0;
  field bit TRANS = 0;
  field bit MUBUF = 0;
  field bit MTBUF = 0; 
  field bit SMRD = 0;
  field bit MIMG = 0;
  field bit VIMAGE = 0;
  field bit VSAMPLE = 0;
  field bit EXP = 0;
  field bit FLAT = 0;
  field bit DS = 0;
  field bit Spill = 0;
  field bit LDSDIR = 0;
  field bit VINTERP = 0;
  field bit VM_CNT = 0;
  field bit EXP_CNT = 0;
  field bit LGKM_CNT = 0;
  field bit WQM = 0;
  field bit DisableWQM = 0;
  field bit Gather4 = 0;
  field bit ScalarStore = 0; 
  field bit FixedSize = 0;
  field bit VOP3_OPSEL = 1;
  field bit maybeAtomic = 1;
  field bit FPClamp = 0;
  field bit IntClamp = 1;
  field bit ClampLo = 1;
  field bit ClampHi = 0;
  field bit IsPacked = 0;
  field bit D16Buf = 0;
  field bit FlatGlobal = 0;
  field bit ReadsModeReg = 0;
  field bit FPDPRounding = 0;
  field bit FPAtomic = 0;
  field bit IsMAI = 0;
  field bit IsDOT = 0;
  field bit FlatScratch = 0;
  field bit IsAtomicNoRet = 0;
  field bit IsAtomicRet = 0;
  field bit IsWMMA = 0;
  field bit TiedSourceNotRead = 0;
  field bit IsNeverUniform = 0;
  field bit GWS = 0;
  field bit IsSWMMAC = 0;
  int Size = 8;
  string DecoderNamespace = "AMDGPU";
  list<Predicate> Predicates = [isGFX10Plus];
  string DecoderMethod = "";
  bit hasCompleteDecoder = 1;
  string Namespace = "AMDGPU";
  dag OutOperandList = (outs anonymous_15962:$vdst);
  dag InOperandList = (ins IntOpSelMods:$src0_modifiers, VSrc_b16:$src0, IntOpSelMods:$src1_modifiers, VSrc_b16:$src1, Clamp0:$clamp, op_sel0:$op_sel);
  string AsmString = "";
  EncodingByHwMode EncodingInfos = ?;
  list<dag> Pattern = [(set i16:$vdst, (add (i16 (VOP3OpSelMods i16:$src0, i32:$src0_modifiers)), (i16 (VOP3OpSelMods i16:$src1, i32:$src1_modifiers))))];
  list<Register> Uses = [EXEC];
  list<Register> Defs = [];
  int CodeSize = 0;
  int AddedComplexity = -1000;
  bit isPreISelOpcode = 0;
  bit isReturn = 0;
  bit isBranch = 0;
  bit isEHScopeReturn = 0;
  bit isIndirectBranch = 0;
  bit isCompare = 0;
  bit isMoveImm = 0;
  bit isMoveReg = 0;
  bit isBitcast = 0;
  bit isSelect = 0;
  bit isBarrier = 0;
  bit isCall = 0;
  bit isAdd = 0;
  bit isTrap = 0;
  bit canFoldAsLoad = 0;
  bit mayLoad = 0;
  bit mayStore = 0;
  bit mayRaiseFPException = 0;
  bit isConvertibleToThreeAddress = 0;
  bit isCommutable = 0;
  bit isTerminator = 0;
  bit isReMaterializable = 0;
  bit isPredicable = 0;
  bit isUnpredicable = 0;
  bit hasDelaySlot = 0;
  bit usesCustomInserter = 0;
  bit hasPostISelHook = 1;
  bit hasCtrlDep = 0;
  bit isNotDuplicable = 0;
  bit isConvergent = 0;
  bit isAuthenticated = 0;
  bit isAsCheapAsAMove = 0;
  bit hasExtraSrcRegAllocReq = 1;
  bit hasExtraDefRegAllocReq = 0;
  bit isRegSequence = 0;
  bit isPseudo = 1;
  bit isMeta = 0;
  bit isExtractSubreg = 0;
  bit isInsertSubreg = 0;
  bit variadicOpsAreDefs = 0;
  bit hasSideEffects = 0;
  bit isCodeGenOnly = 1;
  bit isAsmParserOnly = 0;
  bit hasNoSchedulingInfo = 0;
  InstrItinClass Itinerary = NullALU;
  list<SchedReadWrite> SchedRW = [Write32Bit];
  string Constraints = "";
  string DisableEncoding = "";
  string PostEncoderMethod = "";
  bits<64> TSFlags = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0 };
  string AsmMatchConverter = "cvtVOP3OpSel";
  string TwoOperandAliasConstraint = "";
  string AsmVariantName = "VOP3";
  bit UseNamedOperandTable = 1;
  bit UseLogicalOperandMappings = 0;
  bit FastISelShouldIgnore = 0;
  bit HasPositionOrder = 0;
  Predicate SubtargetPredicate = isGFX10Plus;
  Predicate AssemblerPredicate = TruePredicate;  Predicate WaveSizePredicate = TruePredicate;
  True16PredicateClass True16Predicate = NoTrue16Predicate;
  list<Predicate> OtherPredicates = [];
  string OpName = "v_add_nc_u16";
  string PseudoInstr = "v_add_nc_u16_e64";
  int Subtarget = -1;
  string Mnemonic = "v_add_nc_u16";
  Instruction Opcode = V_ADD_NC_U16_e64;
  bit IsTrue16 = 0;
  VOPProfile Pfl = anonymous_24949;
  string AsmOperands = "$vdst, $src0, $src1$op_sel$clamp";
  bit HasFP8DstByteSel = 0;
}

The relevant piece is: dag OutOperandList = (outs anonymous_15962:$vdst); dag InOperandList = (ins IntOpSelMods:$src0_modifiers, VSrc_b16:$src0, IntOpSelMods:$src1_modifiers, VSrc_b16:$src1, Clamp0:$clamp, op_sel0:$op_sel);

So the immediate 0s are src0_modifiers, src1_modifiers, clamp, and op_sel

easyonaadit commented 1 month ago

Hey Matt, Thanks so much for this, I had no idea this was possible! I was a bit caught up in work for the past few days, I will start tinkering with this now.

arsenm commented 1 month ago

This should just be a matter of adding let isCommutable = 1 to a few of the relevant instruction definitions

easyonaadit commented 1 month ago

Hey, I've put up a draft. I had some questions about this tho:

1. I have just added the isCommutable =1 to V_ADD_NC_U16, V_SUB_NC_U16. I'm looking through the .td for more instructions, but I feel like there must be a better way to find out which instructions need the isCommutable flag. Any suggestions?

2. I have made some changes in the compute-op-sel.mir test(dw, I will change it back before raising a PR). There are 3 cases in the modified test:

Case 1: Both the inputs to the instruction are immediate values. This case works fine and gives the expected output. Case 2: One of the input to the instruction is an immediate value and another one is a global value. This also works fine and gives the expected output. Case 3: (it is commented out): 2 global values are given as input to the instruction. This case crashes with the following error: *** Bad machine code: VOP2/VOP3 instruction uses more than one literal *** - function: test_machine_cse_op_sel - basic block: %bb.0 (0x6066d493dcf8) - instruction: %7:vgpr_32 = V_ADD_NC_U16_e64 0, @foo, 0, @bar, 0, 0, implicit $mode, implicit $exec

According to this ISA on page 45, 2^nd point from the top, it says (in reference to VALU instruction inputs):

At most one literal constant can be used, and only when an SGPR or M0 is not used as a source

So why is it fine when there are 2 immediate values, or one immediate and one global variable, but it's not okay when there are two global variables? Also why does it consider a global variable as a literal?

(PS: I have not as yet done the implementation for FrameInfo literals. I will get to that) (PPS: Also should I squash all the commits into one? Or should I make one commit for adding the isCommutable attribute and another commit for the changes done in the SIInstrInfo.cpp file?) (PPPS: I am using this command to execute the modified test case: llc -mtriple=amdgcn -mcpu=gfx1030 -run-pass=machine-cse -verify-machineinstrs testcaseMIRTemp.mir -print-after-all 2> output.txt && code output.txt)

arsenm commented 1 month ago

1. I have just added the isCommutable =1 to V_ADD_NC_U16, V_SUB_NC_U16. I'm looking through the .td for more instructions, but I feel like there must be a better way to find out which instructions need the isCommutable flag. Any suggestions?

It's the usual set of arithmetic operations. VSUB is a funny case, because the opcode needs to change to VSUBREV to perform the commute. V_MAD / V_FMA are also commutable, since you don't need to touch the 3rd operand. V_MIN/MAX.

Case 2: One of the input to the instruction is an immediate value and another one is a global value.

The global value case should not occur. Any use should probably be a verifier error anyway

So why is it fine when there are 2 immediate values, or one immediate and one global variable, but it's not okay when there are two global variables? Also why does it consider a global variable as a literal?

A global variable will ultimately be encoded as a literal. In this case you're also encoding a 64-bit global address into a 16/32-bit operand, which isn't valid either.

Not all immediate values are considered literals. Integers -16-64 (plus some FP values) are free in the encoding and don't count against the constant bus restriction. You should see the same error if you use values outside of that range (e.g. 65 + 123 should also violate the rule). But this is also gfx9, so you can't use literals with VOP3 instructions. This is allowed on gfx10+ (but you still are only allowed to use one literal)

(PS: I have not as yet done the implementation for FrameInfo literals. I will get to that)

There's little practical reason to handle this. We won't want to fold those into 16-bit instructions

(PPS: Also should I squash all the commits into one? Or should I make one commit for adding the isCommutable attribute and another commit for the changes done in the SIInstrInfo.cpp file?)

These are one commit. I would do this as one commit per instruction changed

easyonaadit commented 3 weeks ago

Hey, Sorry it took so long. I got stuck in a rabbit hole, and ended up spending wayyyy to much time on other stuff. I have added the Commutable property to the following instructions: V_FMA_F16 V_MAD_U16 V_MAD_I16 V_MAD_F16 V_MAD_U32_U16 V_MAD_I32_I16 V_ADD_NC_U16

All other VOP3 operations which should be commutable (based on my knowledge) have already been marked as such. Also I had to modify some test cases as well.

As far as the V_MIN/V_MAX families are concerned, I came across this comment:

// TODO src0 contains the opsel bit for dst, so if we commute, need to mask and swap this // to the new src0.

So I haven't modified those instructions as yet. Maybe once this issue is closed, I could get working on that.

arsenm commented 3 weeks ago

As far as the V_MIN/V_MAX families are concerned, I came across this comment:

This probably applies to all of the cases

easyonaadit commented 3 weeks ago

Sorry I didn't really understand. "applies for all cases" meaning for all cases of only V_MIN/V_MAX, or for "all cases" as in all the instructions I have changed, even the V_MAD/V_FMA? Also is there anything in the draft which you see as red flags or can I put it up for review?

llvm / llvm-project