Open d0k opened 10 years ago
mentioned in issue llvm/llvm-bugzilla-archive#49406
mentioned in issue llvm/llvm-bugzilla-archive#47530
mentioned in issue llvm/llvm-project#4440
mentioned in issue llvm/llvm-bugzilla-archive#37583
mentioned in issue llvm/llvm-bugzilla-archive#35489
mentioned in issue llvm/llvm-bugzilla-archive#31525
Bug #48750 has been marked as a duplicate of this bug.
[Copy-n-paste of my harebrained idea here]
The major issue with supporting multiple constraints is how we model those constraints until register allocation is complete. (Thank you, Capt. Obvious!) The decision of which constraint to use is made during DAG selection. So there's no (easy) way to change this during register allocation.
Half-baked idea (all very hand-wavy):
Different constraints use different set ups (pre-asm code) and tear downs (post-asm code) for the inline asm. What if we created pseudo-instructions to represent inline asm set up and tear down? Something like this:
INLINEASM_SETUP <representing register/memory setup> INLINEASM <...> INLINEASM_TEARDOWN <representing copy of results into vregs/pregs>
The register allocator could then try different constraints (going from most restrictive to least restrictive) until it find one that works.
One drawback is that the RA needs to process INLINEASM before it can generate the correct code for INLINEASM_SETUP. That might be doable if the three instructions are treated as a single unit.
Bug #46874 has been marked as a duplicate of this bug.
Would not it be better to at least treat 'mr' as 'r' rather than 'm'? This will probably yield better code in many cases.
"mr" isn't being treated as "m". Consider this C function:
void f(int x) {
asm volatile("# x is in %0" :: "mr"(x));
}
With "clang -O3 -m32", it compiles into this mess:
f:
pushl %eax
movl 8(%esp), %eax
movl %eax, (%esp)
# x is in (%esp)
popl %eax
retl
But if I use "m" instead of "mr", then it compiles into what I wanted:
f:
# x is in 4(%esp)
retl
So the presence of "r" is somehow making the codegen worse even though it's putting the value in memory anyway.
Bug #30873 has been marked as a duplicate of this bug.
Bug #36931 has been marked as a duplicate of this bug.
Would not it be better to at least treat 'mr' as 'r' rather than 'm'? This will probably yield better code in many cases.
Bug #34837 has been marked as a duplicate of this bug.
Agreed.
I was thinking of this issue while reading ChooseConstraint
in llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp.
I was thinking of this issue while reading
ChooseConstraint
in llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp.
We need to talk about your reading habits. :-P
@llvm/issue-subscribers-backend-x86
At the llvm dev conf '22 @topperc or @compnerd or @arsenm mentioned that perhaps we could add a new register flag (akin to early-clobber
, def
, or kill
) and then another pass ("memfold"? or possibly just regalloc) could try to DTRT.
EDIT: I think @topperc was referring to TargetInstrInfo::foldMemoryOperand
or InlineSpiller::foldMemoryOperand
.
Ah, looks like we haven't made any progress on this since my last report above 9mo ago. Still an issue though. https://lore.kernel.org/lkml/CAHk-=whVvD05T0yD5DQj803uETLD6qDq-Vx-SiLPcrL=eO77LQ@mail.gmail.com/
So I have something hacked up that is slightly working; it's not ready to be published today but maybe by end of next week I'll have something to demonstrate feasibilty.
The idea is:
Basically, it seems that RA doesn't know how to spill INLINEASM (and INLINEASM_BR) register operands. We need to be able to:
Right now, selectiondag picks one of the last two, lacking the ability to express the first, since RA lacks the machinery from 2 below.
There's probably more I need to do for greedy RA to update the live range info (I don't fully understand LiveRangeEdit quite yet). Right now my hacky code is failing to clean up the previously used virtreg which leads to an assertion failure down the line; I hope to have that fixed by EOD but we'll see.
cc @qcolombet
Basically, it seems that RA doesn't know how to spill INLINEASM (and INLINEASM_BR) register operands.
I don't remember what we do for inlineasm constraints, but naively I would have expected that the only thing we would need to change is teaching foldMemoryOperand
how to handle Kind_SpillableRegister
.
BTW, good thinking with the new kind, I didn't think about adding a kind that would represent a mix of both states.
I've been slowly refactoring the InlineAsm
class; I think we should be able to steal one bit out of the flags. The class uses unsigned
everywhere (and static methods); refactoring it to be a proper class makes it much much more ergonomic and should make us able to add new fields with more confidence (without fear of breaking any of the bit packing its doing manually; god why did this code never use bitfields...smh).
Then we can add/steal a bit to denote that the operand should be placed in a register, but that inlinespiller or greedy reg alloc MAY spill it if necessary.
Then we'll need the transform code for INLINEASM and INLINEASM_BR; I have it hacked up for INLINEASM in greedy reg alloc at the moment, but that might not be the final destination. We'll see (next week).
Basically, it seems that RA doesn't know how to spill INLINEASM (and INLINEASM_BR) register operands.
I don't remember what we do for inlineasm constraints, but naively I would have expected that the only thing we would need to change is teaching
foldMemoryOperand
how to handleKind_SpillableRegister
.BTW, good thinking with the new kind, I didn't think about adding a kind that would represent a mix of both states.
For the record, foldMemoryOperand
won't be enough. It'll work when the pressure point is between the definition of the virtual register and the inline asm, but not when it is ON the inline asm.
E.g.,
v = ...
... // <-- pressure point
... = inline_asm v
Will unfold fine as:
v = ...
v2 = copy v
... // <-- pressure point
v3 = copy v2
... = inline_asm v3
v2
:
v = ...
store @v2, v
... // <-- pressure point
v3 = load @v2
... = inline_asm v3
v = ...
store @v2, v
... // <-- pressure point
... = inline_asm @v2
However, it won't work for:
v = ...
...
... = inline_asm v // <-- pressure point
Because after splitting, we'll get:
v = ...
v2 = copy v
...
v3 = copy v2
... = inline_asm v3 // <-- pressure point
And v3
is not spillable with respect to greedy regalloc nomeclature.
https://godbolt.org/z/Esc7chfWW is an important example (and note to self).
This behavior only occurs with "m" and not any other memory-like constraint FWICT.
This is important because I plan to steal 1 bit from InlineAsm::Flag to denote that a reg {use|def|clobber} is permitted to be spilled.
Otherwise, we'd need the full list of constraints to rematerialize what the next preferrable constraint was, during register allocation. i.e. possibly storing the "next best" MemConstraint in the event of register exhaustion. It also means my lazy ass has fewer tests to write. (register exhaustion :heavy_multiplication_x: each memconstraint)
Clang seems to generate spills if "m"
exists in the constraint list (-Xclang -disable-llvm-passes
) that will need to get cleaned up, too.
aha! I just got this all to work with SelectionDAGISEL + Greedy Regalloc! (And only with inputs, and only inline asm and not inline asm goto (though asm goto should have edge cases for outputs)) Now to get GlobalISEL, FastISel, and RegAllocFast working.
I might be able to break this up into 3 (or more) distinct PRs.
One thing I struggled with is that MachineInstr
isn't ergonomic in terms of replacing MachineOperand
s. In particular, to transform a TargetOpcode::INLINEASM
to use memory operands rather than a register operand involves transforming 2 MachineOperands into 5 (at least for x86; I'll have to write some code to determine how many MachineOperand
s are necessary to refer to a MachineOperandType::MO_FrameIndex
per target). The groundwork I did in 86735a4353aee4a3ba1e2feea173a7cc659c7a60 and 93bd428742f985a4b909dd5efee462ea520c96c0 was helpful. I can understand why MachineInstr
generally doesn't allow for arbitrary MachineOperand
replacement; for most non-pseudo instructions the operands are quite fixed/regular. For a pseudo-instruction like TargetOpcode::INLINEASM
, we might need more flexibility to be able to transform such instructions in order to arbitrarily spill specific operands should we need to scavenge for registers.
Captain's Log, Stardate 43120.8: got my hacky spaghetti to validate with -verify-regalloc
I spent some time looking at the folding idea a bit more.
From the allocator perspective it seems to fit pretty well, the problem is going to actually implement the memory folding in inline asm. Like you saw, @nickdesaulniers, MIR
is pretty rough when it comes to that.
Anyhow, here is the sketch of a patch. inlineasm.patch
To "productize it" I think we should:
canBeMemoryFolded
with isLiveAtStatepointVarArg
canBeMemoryFolded
which inline asm operands are actually foldablellvm/lib/CodeGen/TargetInstrInfo.cpp
)canBeMemoryFolded
we could actually open it up to all memory folding, not just patchpoints and inline asm. The problem is that IIRC the foldMemoryOperand
helpers don't have a "dry-run" version (e.g., canFoldMemoryOperand
). I.e., we would need some refactoring to expose that.Regarding RegAllocFast
, I think it would be best to just use the memory operand directly from ISel
. In other words, no even exposing this problem to this allocator. I haven't looked closely, but given this allocator is only aimed at O0
, I expect it would not be worth it to bother to handle this.
Anyhow, here is the sketch of a patch.
+ // TODO: Check actual operand index for folding. Must have the right inline
+ // asm permission to do that.
Let's chat more in person next week, but IIUC, that approach starts with "m" being chosen, then foldMemoryOperand
is used to try to convert to "r"? One benefit of that approach is it sounds like the logic for preferring "m" over "r" in the instruction selection frameworks would not need to be changed. I do wonder if that could result in "m"'s not being converted to "r" with the same priority (i.e. we always want that, until we cant due to register pressure)?
I have a different approach I've been working on (maybe going down the wrong path, but it seems to be working). (Excessive delay this week because my workstation failed and is being replaced; pushed my branch, though I would prefer to clean it up more before showing). ISEL picks "r" and only when it's about to fail due to register exhaustion does greedy perform one final scan of the operands to see if any of the "r"s are "spill-able." If so, it does. The machinery for swapping out register operands with memory operands (or vice versa) is complicated (not too complicated, but not as simple as replacing one operand) and target specific (so I'm doing that transform in TargetInstructionInfo).
This approach doesn't involve inline spiller or CalcSpillWeights, since greedy also has a reference to the LiveIntervals and can modify them with new LiveRanges as well. This guarantees that registers are always preferred, until the point where we know we are about to imminently exhaust registers unless we spill.
Add the support for folding memory operands into inline asm (right now this is a TODO in llvm/lib/CodeGen/TargetInstrInfo.cpp)
sorry, where? I didn't see any relevant TODO or FIXME.
that approach starts with "m" being chosen, then
foldMemoryOperand
is used to try to convert to "r"?
No, that's the other way around.
We use the r
constraints (sorry not part of that patch, I was assuming that's what you would hand to the allocator).
Then if that one doesn't work, the compiler will try to spill it (hence why it is important to mark the related live-range spillable https://github.com/llvm/llvm-project/files/12819005/inlineasm.patch) and when we spill it we call foldMemoryOperand
.
I have a different approach I've been working on[...]
In the end, I think we do the same thing :).
My patch just makes sure we don't have to actually modify anything but the spill weights, which I think is much simpler assuming you can do the rewrite for InlineAsm
in foldMemoryOperand
.
sorry, where? I didn't see any relevant TODO or FIXME.
I meant in the patch https://github.com/llvm/llvm-project/files/12819005/inlineasm.patch, where I print Ding
.
We use the r constraints (sorry not part of that patch, I was assuming that's what you would hand to the allocator).
ah! I missed that part. In that case then, yes:
In the end, I think we do the same thing :).
My patch just makes sure we don't have to actually modify anything but the spill weights, which I think is much simpler assuming you can do the rewrite for InlineAsm in foldMemoryOperand.
Yes, I think this is the issue with my current approach; I end up needing to update the live ranges manually, which while possible, clutters the implementation a bit and seems error prone. With your approach, it looks like I can skip needing to do that manually and rely on the pre-existing infrastructure a bit more.
I have another branch built on top of your approach; in tree tests are passing. Now I need to get it working for more complex cases that we don't yet have in tree AFAICT. i.e. tied operands
Writing up a bunch of tests now. I suspect I'll need to teach InlineSpiller about INLINEASM_BR
; currently it's emitting the reload just along the fallthrough edge. This is wrong for asm goto
with outputs.
define i32 @inout_pressure_goto (i32 inreg %x) nounwind {
%1 = callbr i32 asm "# $0 $1 $2", "=rm,0,!i,~{ax},~{bx},~{cx},~{dx},~{si},~{di},~{bp}"(i32 %x)
to label %ft [label %ft]
ft:
ret i32 %1
}
becomes:
# *** IR Dump After Greedy Register Allocator (greedy) ***:
# Machine code for function inout_pressure_goto: NoPHIs, TracksLiveness, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
fi#0: size=4, align=4, at location [SP+4]
Function Live Ins: $eax in %3
0B bb.0 (%ir-block.0):
successors: %bb.2(0x80000000), %bb.1(0x00000000); %bb.2(100.00%), %bb.1(0.00%)
liveins: $eax
16B MOV32mr %stack.0, 1, $noreg, 0, $noreg, $eax :: (store (s32) into %stack.0)
64B INLINEASM_BR &"# $0 $1 $2" [mayload] [maystore] [attdialect], $0:[mem:m], %stack.0, 1, $noreg, 0, $noreg, $1:[mem:m], %stack.0, 1, $noreg, 0, $noreg, $2:[imm], %bb.1, $3:[clobber], implicit-def dead early-clobber $ax, $4:[clobber], implicit-def dead early-clobber $bx, $5:[clobber], implicit-def dead early-clobber $cx, $6:[clobber], implicit-def dead early-clobber $dx, $7:[clobber], implicit-def dead early-clobber $si, $8:[clobber], implicit-def dead early-clobber $di, $9:[clobber], implicit-def dead early-clobber $bp :: (store (s32) into %stack.0)
72B %9:gr32 = MOV32rm %stack.0, 1, $noreg, 0, $noreg :: (load (s32) from %stack.0)
80B JMP_1 %bb.2
96B bb.1..ft_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
; predecessors: %bb.0
successors: %bb.2(0x80000000); %bb.2(100.00%)
112B bb.2.ft:
; predecessors: %bb.0, %bb.1
128B $eax = COPY %9:gr32
144B RET 0, $eax
which is not right. There originally was a phi in bb.2.ft
between two copies of the same virtreg in both bb.1..ft_crit_edge
and bb.2.ft
that got replaced by a COPY of the virtreg in the common tail by opt-phis
pass.
Ok, I've hit another stumbling block when doing full integration tests for non-x86, non-riscv targets. There may be a prerequisite yak shave necessary here. Basically, I'm confident that I can get this working x86/riscv at the moment. But every other architecture has a slight difference at this point in MIR (I guess that makes x86/riscv the odd ones out, but I'll argue that x86/riscv are the only ones that makes sense to me).
Consider the following C code:
void foo (void) {
int x = 42;
asm("# %0"::"m"(x));
}
After instruction selection (clang foo.c -S -o - -O2 -fno-asynchronous-unwind-tables -mllvm -stop-after=finalize-isel
) we have the following MIR:
...
MOV32mi %stack.0.x, 1, $noreg, 0, $noreg, 42 :: (store (s32) into %ir.x, !tbaa !4)
INLINEASM &"# $0", ..., 262190 /* mem:m */, %stack.0.x, 1, $noreg, 0, $noreg, ...
...
This makes sense to me. That 262190
corresponds to an InlineAsm::Flag
that says the following operand is a MachineOperandType::MO_FrameIndex
AND the next 5 MachineOperand
s are all logically one operand. That's why we need to be able to splice operands: https://github.com/llvm/llvm-project/pull/67699 to replace 2 MachineOperands (that initial metadata node then a register) with 5.
But let's see if other targets do this too. Same input, same compiler flags with the addition of --target=aarch64-linux-gnu
:
STRWui killed %1, %stack.0.x, 0 :: (store (s32) into %ir.x, !tbaa !5)
%2:gpr64sp = ADDXri %stack.0.x, 0, 0
%3:gpr64sp = COPY %2
INLINEASM &"# $0", ..., 262158 /* mem:m */, killed %3, !9
What?! So aarch64 doesn't use a MachineOperandType::MO_FrameIndex
to represent the stack slot index. Instead it uses an InlineAsm::Flag::Kind::Mem
, but then is followed by a MachineOperandType::MO_Register
operand. Um...ok.(?)
How about 32b ARM (--target=arm-linux-gnueabihf
):
STRi12 killed %4, %stack.0.x, 0, 14 /* CC::al */, $noreg :: (store (s32) into %ir.x, !tbaa !5)
%5:gpr = ADDri %stack.0.x, 0, 14 /* CC::al */, $noreg, $noreg
INLINEASM &"# $0", ..., 262158 /* mem:m */, killed %5, !9
--target=powerpc64le-linux-gnu
:
STW8 killed %3, 0, %stack.0.x :: (store (s32) into %ir.x, !tbaa !4)
%4:g8rc = ADDI8 %stack.0.x, 0
%5:g8rc_nox0 = COPY %4
INLINEASM &"# $0", ..., 262158 /* mem:m */, killed %5
--target=riscv64-linux-gnu
:
SW killed %1, %stack.0.x, 0 :: (store (s32) into %ir.x, !tbaa !6)
INLINEASM &"# $0", ..., 262166 /* mem:m */, %stack.0.x, 0, !10
So x86 and riscv work as expected; arm, aarch64, and powerpc do not. And what I really don't like about the arm/aarch64/powerpc approach is that they appear as if they are uses of virtual registers, when they should be uses of frame indexes.
Further, when greedy asks TargetInstrInfo
to foldMemoryOperand
, it says "see if you can fold a load from stack slot X, given to me by inline spiller." So we can easily splice a MachineOperandType::MO_FrameIndex
for slot X. But for targets that instead use a register, there's no mapping to be able to say "give stack slot X, what the hell virt reg corresponds to that?" (At least, not without some ugly per target code to walk previous MachineInstr
s preceding the INLINEASM
).
I assume that's a historical accident and can be fixed up so that all targets use MachineOperandType::MO_FrameIndex
MachineOperand
s to refer to stack slots, at least for INLINEASM
/INLINEASM_BR
. But that should be fixed first. Perhaps upon starting to shave that yak, I find out there is a good reason for this orthogonality between target backends, and then have to unshave the yak (or put the work done as part of the yak shave in the garbage).
It's also curious why x86 has 5 operands to represent a stack slot (probably something to do with lea
but not sure why that matters for inline asm) and riscv has 2. Why can't we use 1 everywhere? If we could, then we wouldn't even need to support splicing machine operands.
// $ clang --target=arm-linux-gnueabihf
struct foo {
int a, b, c, d, e;
};
int zzz (struct foo my_foo) {
asm ("# %0"::"m"(my_foo.b):"r1","r2","r3","r12","lr","r4","r5","r6","r7","r8","r9","r10","r11");
return my_foo.a;
}
GCC picks [sp, #40]
, clang picks [r0]
. Adding "r0"
to the clobber causes clang to fail to allocate a register for %0
while GCC compiles just fine.
I think I should shave that yak...
though, even if some backbends have this orthogonal behavior, maybe I should just teach them how to lower Frame Indices properly.
So x86 and riscv work as expected; arm, aarch64, and powerpc do not.
Ah, that's not always the case. Both
have cases that disprove that. The flag outputs for example produce:
...
%3:gr32 = MOV32rm %fixed-stack.0, 1, $noreg, 0, $noreg :: (load (s32) from %fixed-stack.0, align 8)
INLINEASM ... $1:[mem:m], killed %3:gr32, 1, $noreg, 0, $noreg, ...
...
so it is possible that we have "mem" (InlineAsm::Kind::Mem) followed by virtual register MachineOperands on x86. So my understanding above wrt. x86 was wrong.
though, even if some backbends have this orthogonal behavior, maybe I should just teach them how to lower Frame Indices properly.
Yeah. After poking at this in https://github.com/llvm/llvm-project/pull/69654, I think the approach I should do instead is handle this curious case, and work on each backend to work through issues they might have translating stack slots into valid addressing modes. For 32b ARM, I can reuse most of https://github.com/llvm/llvm-project/pull/69654 without changing existing codegen behavior for existing test cases.
EDIT: yep, supporting this improved the codegen for those 2 x86 tests listed above. EDIT2: nope, those are indirect memory operands on x86. Can't fold those else we lose a level of indirection.
@qcolombet and I met on Friday to discuss the approach so far. 2 additions he highlighted:
optnone
on a function might be an indicator that we don't want to mark registers as spillable, and just choose "m" when "rm" is observed.canMemFoldInlineAsm
or some such would allow for more incremental adoption. That probably doesn't depend on regalloc changes which can land first, but would be used by instruction selection changes that would make these optimizations visible to clang.Status report:
for fastreg alloc, the presence of optnone on a function might be an indicator that we don't want to mark registers as spillable, and just choose "m" when "rm" is observed.
This isn't going to work. From a discussion with @arsenm and @nikic on discord:
Nick Desaulniers — Yesterday at 12:24 PM @arsenm is there an assumption that FastRegalloc is only used for optnone? or is it expected that fastregalloc works even for non-optnone fn's? nikic — Yesterday at 2:20 PM @nickdesaulniers Not sure if that answer your question, but the the O0 backend must be able to deal with O3 IR. arsenm — Yesterday at 7:09 PM It should always work I know I've fixed assorted verifier failures and pass errors with -O3 + -regalloc=fast
So FastReg alloc MUST also be taught how to fold memory operands into inline asm since the decision to choose registers is made beforehand by the instruction selection framework, which cannot account for which register allocation framework will later be used. I'll start work on that today. I suspect support for Greedy and Fast Regalloc can land separately, and before any isel changes.
and work on each backend to work through issues they might have translating stack slots into valid addressing modes. EDIT2: nope, those are indirect memory operands on x86. Can't fold those else we lose a level of indirection. rather than try to support all targets all at once, perhaps a TargetInstrInfo (or SubTargetInfo) hook for canMemFoldInlineAsm or some such would allow for more incremental adoption.
Ok, this approach is working. I have x86_64, riscv64, arm, Aarch64, and powerpc64le working (tentatively, small tests of just non-goto asm with "rm" inputs). But I'm able in isel to check if an arch supports the the memfold optimization via a new TargetLowering
hook. Adding support for another arch is starting to look the same between the 4 above:
TargetLowering
hook.InstrInfo
target-specific hook for the number of MachineOperands
used to represent a stack slot index.ASMPrinter::PrintAsmMemoryOperand
to handle the number of MachineOperands
for the memory InlineAsm::Kind
. (this is also done for already for STACKMAP
, PATCHPOINT
, and STATEPOINT
psuedo MachineInstr
s, which appear to also use MachineOperand::MachineOperandType::MO_FrameIndex
).RegisterInfo::eliminateFrameIndex
target specific override which MachineOperand
to look at/adjust.All of the above for the last 4 architectures listed was only 10-40 LoC per arch. (excluding tests)
I'll need to retest all of the above with more complex tests such as outputs and tied operands, then again with asm goto
variants. I might even limit the memfold optimization to just vanilla INLINEASM
and not INLINEASM_BR
(asm goto
) to start to land pieces, in order to avoid excessive churn when pieces need to be backed out. (Since I've already identified one miscompile).
I noticed a slight issue for ppc64le. Everything works great on inputs/outputs that are the word size. But use an input smaller than the word size, and suddenly TargetInstrInfo::foldMemoryOperand
optimization is inhibited (it seems some targets zero the upper 32b of the GPR, not sure that all other targets do). I wonder if the zeroing is necessary when a spill is performed (should be easy to test that): https://godbolt.org/z/hPY9Wqe59 (GCC doesn't do that, maybe we can remove that from LLVM).
EOW status report
Ok, I have fastregalloc working well enough that I think it's time to start upstreaming the building blocks shared between greedy and fastregalloc for this all to work. I have fastregalloc working for x86_64 and aarch64 with many different test cases; I didn't implement support for asm goto yet, but that doesn't seem problematic (famous last words). I don't forsee issues with other targets for fastregalloc. Still a bit shell shocked from all of the iterator invalidation issues I had to "wrastle" with to get my fastregalloc impl working (because I'm a bad programmer).
First patch to shed will be machine operand splicing: https://github.com/llvm/llvm-project/pull/67699. I need to add test cases and the downstream fixes I'm carrying for supporting tied operands.
Are helpers. Once those are landed, I think the best approach will be:
asm goto
w/ outputs support. In 2 and 3, I'll probably limit them to INLINEASM and not INLINEASM_BR.For the register allocators I'll probably just add tests for 1 target; I'd like to have more comprehensive integration tests all the way from clang for each target where this basically tests:
for each target cross input/output/tied operands cross asm vs asm goto.
https://github.com/llvm/llvm-project/pull/70832/commits/89dbc57fe61dc90ebe116294644443d0f9578a51 has some very basic MIR tests demonstrating the memory folding in action. There's ongoing discussion in https://github.com/llvm/llvm-project/pull/70738 as to "what to call" this disambiguated machine operand (where the instruction selectors set up the register allocators to dunk), so that part may change/be rebased.
There's ongoing discussion in https://github.com/llvm/llvm-project/pull/70738 as to "what to call" this disambiguated machine operand (where the instruction selectors set up the register allocators to dunk), so that part may change/be rebased.
That's resolved (s/spillable/foldable/).
@topperc and I had a pertinent discussion about this last Wednesday; "rm" only makes sense generally for CISC architectures which support multiple different addressing modes on any given instruction. RISC targets are more likely to only support memory operands for loads+stores. I checked the Linux kernel sources quickly and couldn't find usage of "rm" in inline asm for 32b arm. I'll triple check the other RISC architectures that have kernel ports and that clang can build.
But I suspect I'll only need to support this for x86 and maybe SystemZ. I'll do more due diligence on usage of "rm" in the Linux kernel. Either way, that significantly reduces the test case burden. I have tests in hand for making "rm" spill under pressure for hexagon, lanai, arm, aarch64, ppc64, and riscv; but if none of those even have instructions that could use "r" or "m" then what is the point?
"rm" only makes sense generally for CISC architectures which support multiple different addressing modes on any given instruction.
I will definitely concede this is an extremely weird and slightly dubious thing to do, but my motivation in #71080 was actually to make an optimization barrier, so the asm string doesn't actually read the input. However, I want to force the compiler to generate code as if the input were used. (This is a "please don't optimize this value out" barrier.) So "rm" is ideal in that case because I don't want the compiler to generate an pointless load or store.
FWIW, it does look like GCC gracefully handles "rm" on aarch64 too: https://godbolt.org/z/9614nhs8a
but my motivation in https://github.com/llvm/llvm-project/issues/71080 was actually to make an optimization barrier So "rm" is ideal in that case because I don't want the compiler to generate an pointless load or store.
Sounds fishy, but let's discuss this on https://github.com/llvm/llvm-project/issues/71080?
FWIW, it does look like GCC gracefully handles "rm" on aarch64 too: https://godbolt.org/z/9614nhs8a
Try clobbering all GPRs with GCC and see how graceful it is. ;) The complication wrt. supporting "rm" is not in choosing "r" in the hello world inline asm case; it's in choosing "m" when there's register pressure.
But yeah let me clarify my point you cited; after my changes "r" will be preferred over "m" for all architectures; I suspect that's the grace you're looking for.
@topperc and I had a pertinent discussion about this last Wednesday; "rm" only makes sense generally for CISC architectures which support multiple different addressing modes on any given instruction. RISC targets are more likely to only support memory operands for loads+stores. I checked the Linux kernel sources quickly and couldn't find usage of "rm" in inline asm for 32b arm. I'll triple check the other RISC architectures that have kernel ports and that clang can build.
Ok, it looks like the m68k Linux kernel port uses "g". Otherwise I get no other hits for "rm" or "mr" (even adding "i" and check for all permutations) outside of arch/x86/.
ok, the core parts have all landed for greedy.
(I noticed that peephole-opt
will fold loads BEFORE reaching any register allocator, but not any stores).
Next up:
I noticed that peephole-opt folds loads; we might want to disable that for inline asm...I need to think more about that case.
putting this down for now. If someone else wants to pick up the torch, I can push my WIP branch somewhere which may be useful as a reference (but will probably bit rot).
I can push my WIP branch somewhere which may be useful as a reference (but will probably bit rot).
Please do.
Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review).
https://gist.github.com/nickdesaulniers/8f20bea3bcdd9fe97219428ab6e8bf8b were a handful of end-to-end tests I was using to test support for all architectures. You may find these useful.
regallocfact can fail to allocate regs even without this patch. I think it's not supposed to be used in production, just like fast-isel (might be wrong), so another possibility would be just to revert the old (always in-memory) behavior when fast regalloc is selected.
There is no "revert [to] the old behavior" here. ISEL needs to be changed for this to work. Then which regalloc frameworks runs after needs to be able to handle what ISEL has decided. If we don't change ISEL, we don't fix this bug. If we don't fix regallocfast, then we may fail to compile code that previously we could.
Extended Description
When multiple alternatives in an inline asm constraint are given we ignore all of them but the most "general". This gives nasty artifacts in the code.
The spilling is totally unnecessary. GCC gets this one right. On 32 bit x86 it's even worse:
GCC knows a better way:
The constraint "g" is just as bad, being translated into "imr" internally.