llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.53k stars 11.79k forks source link

inline asm "rm" constraint lowered "m" when "r" would be preferable #20571

Open d0k opened 10 years ago

d0k commented 10 years ago
Bugzilla Link 20197
Version trunk
OS Linux
Blocks llvm/llvm-project#4440
CC @legrosbuffle,@echristo,@isanbard,@josephcsible,@nickdesaulniers,@zygoloid,@tstellar

Extended Description

When multiple alternatives in an inline asm constraint are given we ignore all of them but the most "general". This gives nasty artifacts in the code.

int bsr(unsigned v) {
  int ret;
  __asm__("bsr %1, %0" : "=&r"(ret) : "rm"(v) : "cc");
  return ret;
}

$ clang -O3 -S -o - t.c
bsr:
    movl    %edi, -4(%rsp)
    #APP
    bsrl    -4(%rsp), %eax
    #NO_APP
    retq

The spilling is totally unnecessary. GCC gets this one right. On 32 bit x86 it's even worse:

$ clang -O3 -S -o - t.c -m32
bsr:
    pushl   %eax
    movl    8(%esp), %eax
    movl    %eax, (%esp)
    #APP
    bsrl    (%esp), %eax
    #NO_APP
    popl    %edx
    retl

GCC knows a better way:

$ gcc-4.8 -O3 -S -o - t.c -m32
bsr:
#APP
    bsr 4(%esp), %eax
#NO_APP
    ret

The constraint "g" is just as bad, being translated into "imr" internally.

danilaml commented 6 months ago

@nickdesaulniers is there an issue with making isel "deciding" to treat "rm" as "m" (old behavior) when fast regalloc is selected?

nickdesaulniers commented 6 months ago

I don't recall where I was told this, but I recall being told in no uncertain terms that all ISEL frameworks need to support all regalloc frameworks. This M:N ISEL frameworks to regalloc frameworks is what makes fixing this bug a PITA.

danilaml commented 6 months ago

This doesn't change the support though. Although I agree that tying some isel-level option like "Treat rm constraints as m" with the selected regalloc feels a bit icky, but that's the problem with fast regalloc. It can fail and it's best-effort. Any optimization that puts something that was previously in memory into registers can potentially trigger this. The question is if running fastregalloc on code with inline asm with "rm" constraints is a common enough scenario that it needs to have so much effort spent on trying to make it better (it already works to my understanding).

bwendling commented 6 months ago

I don't recall where I was told this, but I recall being told in no uncertain terms that all ISEL frameworks need to support all regalloc frameworks. This M:N ISEL frameworks to regalloc frameworks is what makes fixing this bug a PITA.

Maybe in this comment? https://github.com/llvm/llvm-project/issues/20571#issuecomment-1777650819

bwendling commented 6 months ago

I'm thinking of following this course of action:

  1. Have Clang generate a memory constraint for 'rm', the current behavior, for non-Greedy register alloctors.
  2. Clang generates a foldable 'r' constraint for the Greedy allocator.

This will take care of (almost) all uses of this feature.

What about code from another front-end or hand-written LLVM IR?

  1. Fuck that shit....
  2. Okay, if we MUST, we could modify the inline asm instruction before code-gen.
nikic commented 6 months ago

If supporting this in RegAllocFast is too hard, I think making sure that the problematic patterns don't reach it is okay, assuming this is done in a reliable matter (making this a clang frontend decision is obviously not acceptable). Having all isel work with all regalloc is a nice goal to have, but I don't think it needs to block this feature if it's too hard to achieve. At least that's my 2c.

bwendling commented 6 months ago

making this a clang frontend decision is obviously not acceptable

While I agree in general, this is how it's done today, and there aren't a lot of alternative methods of input we are desperate to support. The inline ASM syntax is specific to GCC (and only "mostly" supported in Clang). It requires supporting only those languages which either GCC supports or which use its inline ASM syntax (in which case, "What hath God wrought?"). Given that restrictive set, we're then left with IR as the main thing we need to support. It's possible to handle this while reading the IR, and it's important for testing. But I think focusing on the front-end first will give us a greater benefit for our work.

arsenm commented 6 months ago

I haven't followed the details here in this giant thread, but

Okay, if we MUST, we could modify the inline asm instruction before code-gen.

This doesn't sound too horrible to me

Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review).

I don't see the issue here? The FastRA philosophy is to just spill everything all the time to avoid doing anything hard

nickdesaulniers commented 6 months ago

Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review).

I don't see the issue here? The FastRA philosophy is to just spill everything all the time to avoid doing anything hard

@MatzeB 's feedback seemed to want to avoid doing a pre-emptive scan over every instruction to find inline asm instructions (which probably don't exist in most programs).

Rereading his feedback though, it sounds like he wasn't necessarily against the approach I had proposed (as I had mis-remembered). I don't recall whether I tested his suggestion, and what the results were.

arsenm commented 6 months ago

@MatzeB 's feedback seemed to want to avoid doing a pre-emptive scan over every instruction to find inline asm instructions (which probably don't exist in most programs).

MachineFunction already has hasInlineAsm to skip the walk over the function. Could even refine this if really needed

bwendling commented 5 months ago

There's a gross error that happens with the current WIP code:

long int __attribute__((noinline)) foo(long int y) {
  long int x, z;

  asm __volatile__ ("pushf\n\t"
                    "popq %0\n\t"
                    "pushq %2\n\t"
                    "popq %1"
                    : "=rm" (x), "=rm" (z)
                    : "rm" (y)
                    : "ax", "bx", "cx", "dx", "di", "si", "bp",
                      "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"
                    );
  return x;
}

int main() {
  __builtin_printf("flags: 0x%08lx\n", foo(991278));
  return 0;
}

What should be generated is:

    movq    %rdi, -24(%rsp)
    #APP
    pushfq
    popq    -16(%rsp)
    pushq   -24(%rsp)
    popq    -8(%rsp)
    #NO_APP
    movq    -16(%rsp), %rax

However, this is generated:

    movq    %rdi, -32(%rsp)                 # 8-byte Spill
    #APP                                    # 8-byte Folded Reload
    pushfq
    popq    -32(%rsp)
    pushq   -32(%rsp)
    popq    -24(%rsp)
    #NO_APP
    movq    -32(%rsp), %rax                 # 8-byte Reload
bwendling commented 5 months ago

So I have something that's working...but, my God!...our inline asm implementation is dog crap. There are simple tests that currently fail if you choose to use the fast allocator outside of -O0.

Anyway, there's one instance that I'm not able to get working in the face of register exhaustion:

struct foo {
  int a, b, c, d, e;
};

#define CLOBBERS "ax", "cx", "dx", "si", "di", "r8", "r9", "r10", "r11", "bx", "bp", "r14", "r15", "r12", "r13"

int test6(struct foo *ptr) {
  asm __volatile__ ("# tied 'rm' pressure -> %0 %1 %2 %3"
                    : "+rm" (ptr->b), "+rm" (ptr->d)
                    :
                    : CLOBBERS);
  return ptr->a;
}

It gives this error:

$ clang -O2 nicks_test.c
nicks_test.c:52:21: error: inline assembly requires more registers than available
   52 |   asm __volatile__ ("# tied 'rm' pressure -> %0 %1 %2 %3"
      |                     ^
nicks_test.c:52:21: error: inline assembly requires more registers than available

(Ignore the line number.) This happens in the current version of clang too. I'm open to any suggestions people may have on how to deal with this...or if it's not important enough to worry about at this time.

qcolombet commented 5 months ago

So I have something that's working...but, my God!...our inline asm implementation is dog crap. There are simple tests that currently fail if you choose to use the fast allocator outside of -O0.

Anyway, there's one instance that I'm not able to get working in the face of register exhaustion:

struct foo {
  int a, b, c, d, e;
};

#define CLOBBERS "ax", "cx", "dx", "si", "di", "r8", "r9", "r10", "r11", "bx", "bp", "r14", "r15", "r12", "r13"

int test6(struct foo *ptr) {
  asm __volatile__ ("# tied 'rm' pressure -> %0 %1 %2 %3"
                    : "+rm" (ptr->b), "+rm" (ptr->d)
                    :
                    : CLOBBERS);
  return ptr->a;
}

It gives this error:

$ clang -O2 nicks_test.c
nicks_test.c:52:21: error: inline assembly requires more registers than available
   52 |   asm __volatile__ ("# tied 'rm' pressure -> %0 %1 %2 %3"
      |                     ^
nicks_test.c:52:21: error: inline assembly requires more registers than available

(Ignore the line number.) This happens in the current version of clang too. I'm open to any suggestions people may have on how to deal with this...or if it's not important enough to worry about at this time.

I'd say this is not important enough. Regalloc for inline asm was always best effort in the sense that we know the heuristics can fail for "complex" enough asm.

bwendling commented 5 months ago

@nickdesaulniers and @qcolombet:

Further, when greedy asks TargetInstrInfo to foldMemoryOperand, it says "see if you can fold a load from stack slot X, given to me by inline spiller." So we can easily splice a MachineOperandType::MO_FrameIndex for slot X. But for targets that instead use a register, there's no mapping to be able to say "give stack slot X, what the hell virt reg corresponds to that?" (At least, not without some ugly per target code to walk previous MachineInstrs preceding the INLINEASM).

I've been running into this. The only target that supports folding ASM operands is x86. Like you pointed out, ARM has a weird COPY instruction and uses that register for the frame index (what??). So hair-pulling frustrations.

Ok, this approach is working. I have x86_64, riscv64, arm, Aarch64, and powerpc64le working (tentatively, small tests of just non-goto asm with "rm" inputs). But I'm able in isel to check if an arch supports the the memfold optimization via a new TargetLowering hook. Adding support for another arch is starting to look the same between the 4 above:

Which repo has these changes? I'd like to look at them.

This is what I've done so far:

  1. If the register allocator isn't Greedy, I devolve to the current behavior. So ISel picks the m constraint and the fast register allocator should still work.
  2. The current support for x86 seems to work just fine for both normal and register pressure cases. ARM/AArch64 works only when under no register pressure.

I'm thinking if submitting a PR to add support for x86 only. It'll fix the issue we're seeing in Linux (I think only x86 architectures use the "rm" constraint). It amounts to a hack though.

What I toyed with doing is adding both versions of the constraint into the initial MIR. So something like:

INLINEASM ... $0:[regdef:GPR foldable], def %1:gpr, $1:[mem:m], %4:gpr, $2:[regdef:GPR foldable], def dead %2:gpr, $3:[mem:m], %3:gpr, ...

With this, if we want to use the memory operand, we have it already set up and ready to go. All we need to do is remove the $0:[regdef:GPR foldable], def %1:gpr part, and vice versa. (I'm hoping that later passes will remove dead code and eliminate unused stack slots, etc.) However, given how ARM/AArch64 does things, I'm not sure if the correct register will be used afterwards. We might be able to combine them into some franken-operand, but that way lies madness...

I'm really unsure what to do here. I've been staring at this code for a long time now and the only thing I can come up with is we need to rewrite the whole mess. But that's a ton of work just to get good code gen for "rm" on x86...

Thoughts?