Open yxsamliu opened 1 month ago
@llvm/issue-subscribers-backend-amdgpu
Author: Yaxun (Sam) Liu (yxsamliu)
Hi!
This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:
test/
create fine-grained testing targets, so you can e.g. use make check-clang-ast
to only run Clang's AST tests.git clang-format HEAD~1
to format your changes.If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.
@llvm/issue-subscribers-good-first-issue
Author: Yaxun (Sam) Liu (yxsamliu)
Hey @yxsamliu, I'm just starting out in Compilers, and would love to try my hand at this issue. Could you please assign it to me?
Hey @yxsamliu, I'm just starting out in Compilers, and would love to try my hand at this issue. Could you please assign it to me?
Thanks for looking into this.
Hey, I had a doubt.
%2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 0, 0
in this V_ADD_NC_U16_e64
instruction, what are the 4 numbers given as modifiers supposed to stand for? I have tried looking at documentation but can't seem to find any reference to them.
%2:vgpr_32 = V_ADD_NC_U16e64 0, %0, 0, %1, 0, 0_
what I'm talking about is the 0's which are in bold above. Could you please provide some links to where I can read more about them?
Hey, I had a doubt.
%2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 0, 0
in thisV_ADD_NC_U16_e64
instruction, what are the 4 numbers given as modifiers supposed to stand for? I have tried looking at documentation but can't seem to find any reference to them. %2:vgpr_32 = V_ADD_NC_U16e64 0, %0, 0, %1, 0, 0_ what I'm talking about is the 0's which are in bold above. Could you please provide some links to where I can read more about them?
Refer this link, it might help you!
I usually look at the fully expanded tablegen definitions to figure out the operand structure. If you extract an llvm-tblgen invocation out of the build log, and remove the -gen-* argument, you can look at the fully expanded tablegen and find the relevant instruction. In this case, that is:
def V_ADD_NC_U16_e64 { // InstructionEncoding Instruction AMDGPUInst PredicateControl InstSI VOP SIMCInstr VOP_Pseudo VOP3_Pseudo VOP3InstBase
field bit isRegisterLoad = 0;
field bit isRegisterStore = 0;
field bits<96> SoftFail = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
field bit SALU = 0;
field bit VALU = 1;
field bit SOP1 = 0;
field bit SOP2 = 0;
field bit SOPC = 0;
field bit SOPK = 0;
field bit SOPP = 0;
field bit VOP1 = 0;
field bit VOP2 = 0;
field bit VOPC = 0;
field bit VOP3 = 1;
field bit VOP3P = 0;
field bit VINTRP = 0;
field bit SDWA = 0;
field bit DPP = 0;
field bit TRANS = 0;
field bit MUBUF = 0;
field bit MTBUF = 0;
field bit SMRD = 0;
field bit MIMG = 0;
field bit VIMAGE = 0;
field bit VSAMPLE = 0;
field bit EXP = 0;
field bit FLAT = 0;
field bit DS = 0;
field bit Spill = 0;
field bit LDSDIR = 0;
field bit VINTERP = 0;
field bit VM_CNT = 0;
field bit EXP_CNT = 0;
field bit LGKM_CNT = 0;
field bit WQM = 0;
field bit DisableWQM = 0;
field bit Gather4 = 0;
field bit ScalarStore = 0;
field bit FixedSize = 0;
field bit VOP3_OPSEL = 1;
field bit maybeAtomic = 1;
field bit FPClamp = 0;
field bit IntClamp = 1;
field bit ClampLo = 1;
field bit ClampHi = 0;
field bit IsPacked = 0;
field bit D16Buf = 0;
field bit FlatGlobal = 0;
field bit ReadsModeReg = 0;
field bit FPDPRounding = 0;
field bit FPAtomic = 0;
field bit IsMAI = 0;
field bit IsDOT = 0;
field bit FlatScratch = 0;
field bit IsAtomicNoRet = 0;
field bit IsAtomicRet = 0;
field bit IsWMMA = 0;
field bit TiedSourceNotRead = 0;
field bit IsNeverUniform = 0;
field bit GWS = 0;
field bit IsSWMMAC = 0;
int Size = 8;
string DecoderNamespace = "AMDGPU";
list<Predicate> Predicates = [isGFX10Plus];
string DecoderMethod = "";
bit hasCompleteDecoder = 1;
string Namespace = "AMDGPU";
dag OutOperandList = (outs anonymous_15962:$vdst);
dag InOperandList = (ins IntOpSelMods:$src0_modifiers, VSrc_b16:$src0, IntOpSelMods:$src1_modifiers, VSrc_b16:$src1, Clamp0:$clamp, op_sel0:$op_sel);
string AsmString = "";
EncodingByHwMode EncodingInfos = ?;
list<dag> Pattern = [(set i16:$vdst, (add (i16 (VOP3OpSelMods i16:$src0, i32:$src0_modifiers)), (i16 (VOP3OpSelMods i16:$src1, i32:$src1_modifiers))))];
list<Register> Uses = [EXEC];
list<Register> Defs = [];
int CodeSize = 0;
int AddedComplexity = -1000;
bit isPreISelOpcode = 0;
bit isReturn = 0;
bit isBranch = 0;
bit isEHScopeReturn = 0;
bit isIndirectBranch = 0;
bit isCompare = 0;
bit isMoveImm = 0;
bit isMoveReg = 0;
bit isBitcast = 0;
bit isSelect = 0;
bit isBarrier = 0;
bit isCall = 0;
bit isAdd = 0;
bit isTrap = 0;
bit canFoldAsLoad = 0;
bit mayLoad = 0;
bit mayStore = 0;
bit mayRaiseFPException = 0;
bit isConvertibleToThreeAddress = 0;
bit isCommutable = 0;
bit isTerminator = 0;
bit isReMaterializable = 0;
bit isPredicable = 0;
bit isUnpredicable = 0;
bit hasDelaySlot = 0;
bit usesCustomInserter = 0;
bit hasPostISelHook = 1;
bit hasCtrlDep = 0;
bit isNotDuplicable = 0;
bit isConvergent = 0;
bit isAuthenticated = 0;
bit isAsCheapAsAMove = 0;
bit hasExtraSrcRegAllocReq = 1;
bit hasExtraDefRegAllocReq = 0;
bit isRegSequence = 0;
bit isPseudo = 1;
bit isMeta = 0;
bit isExtractSubreg = 0;
bit isInsertSubreg = 0;
bit variadicOpsAreDefs = 0;
bit hasSideEffects = 0;
bit isCodeGenOnly = 1;
bit isAsmParserOnly = 0;
bit hasNoSchedulingInfo = 0;
InstrItinClass Itinerary = NullALU;
list<SchedReadWrite> SchedRW = [Write32Bit];
string Constraints = "";
string DisableEncoding = "";
string PostEncoderMethod = "";
bits<64> TSFlags = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0 };
string AsmMatchConverter = "cvtVOP3OpSel";
string TwoOperandAliasConstraint = "";
string AsmVariantName = "VOP3";
bit UseNamedOperandTable = 1;
bit UseLogicalOperandMappings = 0;
bit FastISelShouldIgnore = 0;
bit HasPositionOrder = 0;
Predicate SubtargetPredicate = isGFX10Plus;
Predicate AssemblerPredicate = TruePredicate; Predicate WaveSizePredicate = TruePredicate;
True16PredicateClass True16Predicate = NoTrue16Predicate;
list<Predicate> OtherPredicates = [];
string OpName = "v_add_nc_u16";
string PseudoInstr = "v_add_nc_u16_e64";
int Subtarget = -1;
string Mnemonic = "v_add_nc_u16";
Instruction Opcode = V_ADD_NC_U16_e64;
bit IsTrue16 = 0;
VOPProfile Pfl = anonymous_24949;
string AsmOperands = "$vdst, $src0, $src1$op_sel$clamp";
bit HasFP8DstByteSel = 0;
}
The relevant piece is: dag OutOperandList = (outs anonymous_15962:$vdst); dag InOperandList = (ins IntOpSelMods:$src0_modifiers, VSrc_b16:$src0, IntOpSelMods:$src1_modifiers, VSrc_b16:$src1, Clamp0:$clamp, op_sel0:$op_sel);
So the immediate 0s are src0_modifiers, src1_modifiers, clamp, and op_sel
Hey Matt, Thanks so much for this, I had no idea this was possible! I was a bit caught up in work for the past few days, I will start tinkering with this now.
This should just be a matter of adding let isCommutable = 1 to a few of the relevant instruction definitions
Hey, I've put up a draft. I had some questions about this tho:
Case 1: Both the inputs to the instruction are immediate values. This case works fine and gives the expected output.
Case 2: One of the input to the instruction is an immediate value and another one is a global value. This also works fine and gives the expected output.
Case 3: (it is commented out): 2 global values are given as input to the instruction. This case crashes with the following error:
*** Bad machine code: VOP2/VOP3 instruction uses more than one literal ***
- function: test_machine_cse_op_sel
- basic block: %bb.0 (0x6066d493dcf8)
- instruction: %7:vgpr_32 = V_ADD_NC_U16_e64 0, @foo, 0, @bar, 0, 0, implicit $mode, implicit $exec
According to this ISA on page 45, 2nd point from the top, it says (in reference to VALU instruction inputs):
At most one literal constant can be used, and only when an SGPR or M0 is not used as a source
So why is it fine when there are 2 immediate values, or one immediate and one global variable, but it's not okay when there are two global variables? Also why does it consider a global variable as a literal?
(PS: I have not as yet done the implementation for FrameInfo literals. I will get to that)
(PPS: Also should I squash all the commits into one? Or should I make one commit for adding the isCommutable attribute and another commit for the changes done in the SIInstrInfo.cpp file?)
(PPPS: I am using this command to execute the modified test case: llc -mtriple=amdgcn -mcpu=gfx1030 -run-pass=machine-cse -verify-machineinstrs testcaseMIRTemp.mir -print-after-all 2> output.txt && code output.txt
)
1. I have just added the isCommutable =1 to V_ADD_NC_U16, V_SUB_NC_U16. I'm looking through the .td for more instructions, but I feel like there must be a better way to find out which instructions need the isCommutable flag. Any suggestions?
It's the usual set of arithmetic operations. VSUB is a funny case, because the opcode needs to change to VSUBREV to perform the commute. V_MAD / V_FMA are also commutable, since you don't need to touch the 3rd operand. V_MIN/MAX.
Case 2: One of the input to the instruction is an immediate value and another one is a global value.
The global value case should not occur. Any use should probably be a verifier error anyway
So why is it fine when there are 2 immediate values, or one immediate and one global variable, but it's not okay when there are two global variables? Also why does it consider a global variable as a literal?
A global variable will ultimately be encoded as a literal. In this case you're also encoding a 64-bit global address into a 16/32-bit operand, which isn't valid either.
Not all immediate values are considered literals. Integers -16-64 (plus some FP values) are free in the encoding and don't count against the constant bus restriction. You should see the same error if you use values outside of that range (e.g. 65 + 123 should also violate the rule). But this is also gfx9, so you can't use literals with VOP3 instructions. This is allowed on gfx10+ (but you still are only allowed to use one literal)
(PS: I have not as yet done the implementation for FrameInfo literals. I will get to that)
There's little practical reason to handle this. We won't want to fold those into 16-bit instructions
(PPS: Also should I squash all the commits into one? Or should I make one commit for adding the isCommutable attribute and another commit for the changes done in the SIInstrInfo.cpp file?)
These are one commit. I would do this as one commit per instruction changed
Hey,
Sorry it took so long. I got stuck in a rabbit hole, and ended up spending wayyyy to much time on other stuff.
I have added the Commutable property to the following instructions:
V_FMA_F16
V_MAD_U16
V_MAD_I16
V_MAD_F16
V_MAD_U32_U16
V_MAD_I32_I16
V_ADD_NC_U16
All other VOP3 operations which should be commutable (based on my knowledge) have already been marked as such. Also I had to modify some test cases as well.
As far as the V_MIN/V_MAX families are concerned, I came across this comment:
// TODO src0 contains the opsel bit for dst, so if we commute, need to mask and swap this
// to the new src0.
So I haven't modified those instructions as yet. Maybe once this issue is closed, I could get working on that.
As far as the V_MIN/V_MAX families are concerned, I came across this comment:
This probably applies to all of the cases
Sorry I didn't really understand. "applies for all cases" meaning for all cases of only V_MIN/V_MAX, or for "all cases" as in all the instructions I have changed, even the V_MAD/V_FMA? Also is there anything in the draft which you see as red flags or can I put it up for review?
some VOP3 instructions are commutable but are not defined as such. e.g.
V_ADD_NC_U16_e64 is commutable but are not merged by machine-cse.
This issue was found during reviewing https://github.com/llvm/llvm-project/pull/106920
Two changes are needed: