Open d0k opened 10 years ago
@nickdesaulniers is there an issue with making isel "deciding" to treat "rm" as "m" (old behavior) when fast regalloc is selected?
I don't recall where I was told this, but I recall being told in no uncertain terms that all ISEL frameworks need to support all regalloc frameworks. This M:N ISEL frameworks to regalloc frameworks is what makes fixing this bug a PITA.
This doesn't change the support though. Although I agree that tying some isel-level option like "Treat rm constraints as m" with the selected regalloc feels a bit icky, but that's the problem with fast regalloc. It can fail and it's best-effort. Any optimization that puts something that was previously in memory into registers can potentially trigger this. The question is if running fastregalloc on code with inline asm with "rm" constraints is a common enough scenario that it needs to have so much effort spent on trying to make it better (it already works to my understanding).
I don't recall where I was told this, but I recall being told in no uncertain terms that all ISEL frameworks need to support all regalloc frameworks. This M:N ISEL frameworks to regalloc frameworks is what makes fixing this bug a PITA.
Maybe in this comment? https://github.com/llvm/llvm-project/issues/20571#issuecomment-1777650819
I'm thinking of following this course of action:
'rm'
, the current behavior, for non-Greedy register alloctors.'r'
constraint for the Greedy allocator.This will take care of (almost) all uses of this feature.
What about code from another front-end or hand-written LLVM IR?
If supporting this in RegAllocFast is too hard, I think making sure that the problematic patterns don't reach it is okay, assuming this is done in a reliable matter (making this a clang frontend decision is obviously not acceptable). Having all isel work with all regalloc is a nice goal to have, but I don't think it needs to block this feature if it's too hard to achieve. At least that's my 2c.
making this a clang frontend decision is obviously not acceptable
While I agree in general, this is how it's done today, and there aren't a lot of alternative methods of input we are desperate to support. The inline ASM syntax is specific to GCC (and only "mostly" supported in Clang). It requires supporting only those languages which either GCC supports or which use its inline ASM syntax (in which case, "What hath God wrought?"). Given that restrictive set, we're then left with IR as the main thing we need to support. It's possible to handle this while reading the IR, and it's important for testing. But I think focusing on the front-end first will give us a greater benefit for our work.
I haven't followed the details here in this giant thread, but
Okay, if we MUST, we could modify the inline asm instruction before code-gen.
This doesn't sound too horrible to me
Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review).
I don't see the issue here? The FastRA philosophy is to just spill everything all the time to avoid doing anything hard
Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review).
I don't see the issue here? The FastRA philosophy is to just spill everything all the time to avoid doing anything hard
@MatzeB 's feedback seemed to want to avoid doing a pre-emptive scan over every instruction to find inline asm instructions (which probably don't exist in most programs).
Rereading his feedback though, it sounds like he wasn't necessarily against the approach I had proposed (as I had mis-remembered). I don't recall whether I tested his suggestion, and what the results were.
@MatzeB 's feedback seemed to want to avoid doing a pre-emptive scan over every instruction to find inline asm instructions (which probably don't exist in most programs).
MachineFunction already has hasInlineAsm to skip the walk over the function. Could even refine this if really needed
There's a gross error that happens with the current WIP code:
long int __attribute__((noinline)) foo(long int y) {
long int x, z;
asm __volatile__ ("pushf\n\t"
"popq %0\n\t"
"pushq %2\n\t"
"popq %1"
: "=rm" (x), "=rm" (z)
: "rm" (y)
: "ax", "bx", "cx", "dx", "di", "si", "bp",
"r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
);
return x;
}
int main() {
__builtin_printf("flags: 0x%08lx\n", foo(991278));
return 0;
}
What should be generated is:
movq %rdi, -24(%rsp)
#APP
pushfq
popq -16(%rsp)
pushq -24(%rsp)
popq -8(%rsp)
#NO_APP
movq -16(%rsp), %rax
However, this is generated:
movq %rdi, -32(%rsp) # 8-byte Spill
#APP # 8-byte Folded Reload
pushfq
popq -32(%rsp)
pushq -32(%rsp)
popq -24(%rsp)
#NO_APP
movq -32(%rsp), %rax # 8-byte Reload
So I have something that's working...but, my God!...our inline asm implementation is dog crap. There are simple tests that currently fail if you choose to use the fast allocator outside of -O0
.
Anyway, there's one instance that I'm not able to get working in the face of register exhaustion:
struct foo {
int a, b, c, d, e;
};
#define CLOBBERS "ax", "cx", "dx", "si", "di", "r8", "r9", "r10", "r11", "bx", "bp", "r14", "r15", "r12", "r13"
int test6(struct foo *ptr) {
asm __volatile__ ("# tied 'rm' pressure -> %0 %1 %2 %3"
: "+rm" (ptr->b), "+rm" (ptr->d)
:
: CLOBBERS);
return ptr->a;
}
It gives this error:
$ clang -O2 nicks_test.c
nicks_test.c:52:21: error: inline assembly requires more registers than available
52 | asm __volatile__ ("# tied 'rm' pressure -> %0 %1 %2 %3"
| ^
nicks_test.c:52:21: error: inline assembly requires more registers than available
(Ignore the line number.) This happens in the current version of clang too. I'm open to any suggestions people may have on how to deal with this...or if it's not important enough to worry about at this time.
So I have something that's working...but, my God!...our inline asm implementation is dog crap. There are simple tests that currently fail if you choose to use the fast allocator outside of
-O0
.Anyway, there's one instance that I'm not able to get working in the face of register exhaustion:
struct foo { int a, b, c, d, e; }; #define CLOBBERS "ax", "cx", "dx", "si", "di", "r8", "r9", "r10", "r11", "bx", "bp", "r14", "r15", "r12", "r13" int test6(struct foo *ptr) { asm __volatile__ ("# tied 'rm' pressure -> %0 %1 %2 %3" : "+rm" (ptr->b), "+rm" (ptr->d) : : CLOBBERS); return ptr->a; }
It gives this error:
$ clang -O2 nicks_test.c nicks_test.c:52:21: error: inline assembly requires more registers than available 52 | asm __volatile__ ("# tied 'rm' pressure -> %0 %1 %2 %3" | ^ nicks_test.c:52:21: error: inline assembly requires more registers than available
(Ignore the line number.) This happens in the current version of clang too. I'm open to any suggestions people may have on how to deal with this...or if it's not important enough to worry about at this time.
I'd say this is not important enough. Regalloc for inline asm was always best effort in the sense that we know the heuristics can fail for "complex" enough asm.
@nickdesaulniers and @qcolombet:
Further, when greedy asks TargetInstrInfo to foldMemoryOperand, it says "see if you can fold a load from stack slot X, given to me by inline spiller." So we can easily splice a MachineOperandType::MO_FrameIndex for slot X. But for targets that instead use a register, there's no mapping to be able to say "give stack slot X, what the hell virt reg corresponds to that?" (At least, not without some ugly per target code to walk previous MachineInstrs preceding the INLINEASM).
I've been running into this. The only target that supports folding ASM operands is x86. Like you pointed out, ARM has a weird COPY
instruction and uses that register for the frame index (what??). So hair-pulling frustrations.
Ok, this approach is working. I have x86_64, riscv64, arm, Aarch64, and powerpc64le working (tentatively, small tests of just non-goto asm with "rm" inputs). But I'm able in isel to check if an arch supports the the memfold optimization via a new TargetLowering hook. Adding support for another arch is starting to look the same between the 4 above:
Which repo has these changes? I'd like to look at them.
This is what I've done so far:
m
constraint and the fast register allocator should still work.I'm thinking if submitting a PR to add support for x86 only. It'll fix the issue we're seeing in Linux (I think only x86 architectures use the "rm"
constraint). It amounts to a hack though.
What I toyed with doing is adding both versions of the constraint into the initial MIR. So something like:
INLINEASM ... $0:[regdef:GPR foldable], def %1:gpr, $1:[mem:m], %4:gpr, $2:[regdef:GPR foldable], def dead %2:gpr, $3:[mem:m], %3:gpr, ...
With this, if we want to use the memory operand, we have it already set up and ready to go. All we need to do is remove the $0:[regdef:GPR foldable], def %1:gpr
part, and vice versa. (I'm hoping that later passes will remove dead code and eliminate unused stack slots, etc.) However, given how ARM/AArch64 does things, I'm not sure if the correct register will be used afterwards. We might be able to combine them into some franken-operand, but that way lies madness...
I'm really unsure what to do here. I've been staring at this code for a long time now and the only thing I can come up with is we need to rewrite the whole mess. But that's a ton of work just to get good code gen for "rm"
on x86...
Thoughts?
Extended Description
When multiple alternatives in an inline asm constraint are given we ignore all of them but the most "general". This gives nasty artifacts in the code.
The spilling is totally unnecessary. GCC gets this one right. On 32 bit x86 it's even worse:
GCC knows a better way:
The constraint "g" is just as bad, being translated into "imr" internally.