llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.53k stars 11.79k forks source link

inline asm "rm" constraint lowered "m" when "r" would be preferable #20571

Open d0k opened 10 years ago

d0k commented 10 years ago
Bugzilla Link 20197
Version trunk
OS Linux
Blocks llvm/llvm-project#4440
CC @legrosbuffle,@echristo,@isanbard,@josephcsible,@nickdesaulniers,@zygoloid,@tstellar

Extended Description

When multiple alternatives in an inline asm constraint are given we ignore all of them but the most "general". This gives nasty artifacts in the code.

int bsr(unsigned v) {
  int ret;
  __asm__("bsr %1, %0" : "=&r"(ret) : "rm"(v) : "cc");
  return ret;
}

$ clang -O3 -S -o - t.c
bsr:
    movl    %edi, -4(%rsp)
    #APP
    bsrl    -4(%rsp), %eax
    #NO_APP
    retq

The spilling is totally unnecessary. GCC gets this one right. On 32 bit x86 it's even worse:

$ clang -O3 -S -o - t.c -m32
bsr:
    pushl   %eax
    movl    8(%esp), %eax
    movl    %eax, (%esp)
    #APP
    bsrl    (%esp), %eax
    #NO_APP
    popl    %edx
    retl

GCC knows a better way:

$ gcc-4.8 -O3 -S -o - t.c -m32
bsr:
#APP
    bsr 4(%esp), %eax
#NO_APP
    ret

The constraint "g" is just as bad, being translated into "imr" internally.

efriedma-quic commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#49406

efriedma-quic commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#47530

edwintorok commented 2 years ago

mentioned in issue llvm/llvm-project#4440

llvmbot commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#37583

llvmbot commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#35489

llvmbot commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#31525

efriedma-quic commented 3 years ago

Bug #48750 has been marked as a duplicate of this bug.

isanbard commented 4 years ago

[Copy-n-paste of my harebrained idea here]

The major issue with supporting multiple constraints is how we model those constraints until register allocation is complete. (Thank you, Capt. Obvious!) The decision of which constraint to use is made during DAG selection. So there's no (easy) way to change this during register allocation.

Half-baked idea (all very hand-wavy):

Different constraints use different set ups (pre-asm code) and tear downs (post-asm code) for the inline asm. What if we created pseudo-instructions to represent inline asm set up and tear down? Something like this:

INLINEASM_SETUP <representing register/memory setup> INLINEASM <...> INLINEASM_TEARDOWN <representing copy of results into vregs/pregs>

The register allocator could then try different constraints (going from most restrictive to least restrictive) until it find one that works.

One drawback is that the RA needs to process INLINEASM before it can generate the correct code for INLINEASM_SETUP. That might be doable if the three instructions are treated as a single unit.

efriedma-quic commented 4 years ago

Bug #46874 has been marked as a duplicate of this bug.

josephcsible commented 4 years ago

Would not it be better to at least treat 'mr' as 'r' rather than 'm'? This will probably yield better code in many cases.

"mr" isn't being treated as "m". Consider this C function:

void f(int x) {
    asm volatile("# x is in %0" :: "mr"(x));
}

With "clang -O3 -m32", it compiles into this mess:

f:
        pushl   %eax
        movl    8(%esp), %eax
        movl    %eax, (%esp)
        # x is in (%esp)
        popl    %eax
        retl

But if I use "m" instead of "mr", then it compiles into what I wanted:

f:
        # x is in 4(%esp)
        retl

So the presence of "r" is somehow making the codegen worse even though it's putting the value in memory anyway.

llvmbot commented 6 years ago

Bug #30873 has been marked as a duplicate of this bug.

llvmbot commented 6 years ago

Bug #36931 has been marked as a duplicate of this bug.

llvmbot commented 6 years ago

Would not it be better to at least treat 'mr' as 'r' rather than 'm'? This will probably yield better code in many cases.

llvmbot commented 6 years ago

Bug #34837 has been marked as a duplicate of this bug.

echristo commented 10 years ago

Agreed.

nickdesaulniers commented 2 years ago

I was thinking of this issue while reading ChooseConstraint in llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp.

bwendling commented 2 years ago

I was thinking of this issue while reading ChooseConstraint in llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp.

We need to talk about your reading habits. :-P

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-x86

nickdesaulniers commented 1 year ago

At the llvm dev conf '22 @topperc or @compnerd or @arsenm mentioned that perhaps we could add a new register flag (akin to early-clobber, def, or kill) and then another pass ("memfold"? or possibly just regalloc) could try to DTRT.

EDIT: I think @topperc was referring to TargetInstrInfo::foldMemoryOperand or InlineSpiller::foldMemoryOperand.

nickdesaulniers commented 1 year ago

Ah, looks like we haven't made any progress on this since my last report above 9mo ago. Still an issue though. https://lore.kernel.org/lkml/CAHk-=whVvD05T0yD5DQj803uETLD6qDq-Vx-SiLPcrL=eO77LQ@mail.gmail.com/

nickdesaulniers commented 1 year ago

So I have something hacked up that is slightly working; it's not ready to be published today but maybe by end of next week I'll have something to demonstrate feasibilty.

The idea is:

  1. have selectiondag ChooseConstraint be more aggressive, but add a new InlineAsm::Kind_* value between Kind_Register and Kind_Mem called Kind_SpillableRegister (or w/e) which basically behaves as Kind_Register in all cases but the one described below, which is what selectiondag will set.
  2. have greedy register allocator check for these kind of operands before imminent register exhaustion; if they exist, spill before the inline asm, reload after the inline asm, and update the inline asm to use the newly created stack slot.

Basically, it seems that RA doesn't know how to spill INLINEASM (and INLINEASM_BR) register operands. We need to be able to:

  1. signal to RA the difference between 3 cases:
    • "the constraint string for this operand allows the operand to be spilled" (i.e. "rm")
    • "the constraint string for this operand does not allow the operand to be spilled (i.e. "r")"
    • "the constraint string for this operand does not allow the operand to exist in a register (i.e. "m").

Right now, selectiondag picks one of the last two, lacking the ability to express the first, since RA lacks the machinery from 2 below.

  1. the transformation machinery for INLINEASM:
    • insert spill before INLINEASM
    • insert reload after INLINEASM (or reloads after every indirect branch target of INLINEASM_BR)
    • transform operand from reguse to mem
    • transform operand from virtreg to stack slot

There's probably more I need to do for greedy RA to update the live range info (I don't fully understand LiveRangeEdit quite yet). Right now my hacky code is failing to clean up the previously used virtreg which leads to an assertion failure down the line; I hope to have that fixed by EOD but we'll see.

cc @qcolombet

qcolombet commented 1 year ago

Basically, it seems that RA doesn't know how to spill INLINEASM (and INLINEASM_BR) register operands.

I don't remember what we do for inlineasm constraints, but naively I would have expected that the only thing we would need to change is teaching foldMemoryOperand how to handle Kind_SpillableRegister.

BTW, good thinking with the new kind, I didn't think about adding a kind that would represent a mix of both states.

nickdesaulniers commented 1 year ago

I've been slowly refactoring the InlineAsm class; I think we should be able to steal one bit out of the flags. The class uses unsigned everywhere (and static methods); refactoring it to be a proper class makes it much much more ergonomic and should make us able to add new fields with more confidence (without fear of breaking any of the bit packing its doing manually; god why did this code never use bitfields...smh).

Then we can add/steal a bit to denote that the operand should be placed in a register, but that inlinespiller or greedy reg alloc MAY spill it if necessary.

Then we'll need the transform code for INLINEASM and INLINEASM_BR; I have it hacked up for INLINEASM in greedy reg alloc at the moment, but that might not be the final destination. We'll see (next week).

qcolombet commented 1 year ago

Basically, it seems that RA doesn't know how to spill INLINEASM (and INLINEASM_BR) register operands.

I don't remember what we do for inlineasm constraints, but naively I would have expected that the only thing we would need to change is teaching foldMemoryOperand how to handle Kind_SpillableRegister.

BTW, good thinking with the new kind, I didn't think about adding a kind that would represent a mix of both states.

For the record, foldMemoryOperand won't be enough. It'll work when the pressure point is between the definition of the virtual register and the inline asm, but not when it is ON the inline asm.

E.g.,

v = ...
... // <-- pressure point
... = inline_asm v

Will unfold fine as:

However, it won't work for:

v = ...
...
... = inline_asm v  // <-- pressure point

Because after splitting, we'll get:

v = ...
v2 = copy v
...
v3 = copy v2
... = inline_asm v3  // <-- pressure point

And v3 is not spillable with respect to greedy regalloc nomeclature.

nickdesaulniers commented 1 year ago

https://godbolt.org/z/Esc7chfWW is an important example (and note to self).

This behavior only occurs with "m" and not any other memory-like constraint FWICT.

This is important because I plan to steal 1 bit from InlineAsm::Flag to denote that a reg {use|def|clobber} is permitted to be spilled.

Otherwise, we'd need the full list of constraints to rematerialize what the next preferrable constraint was, during register allocation. i.e. possibly storing the "next best" MemConstraint in the event of register exhaustion. It also means my lazy ass has fewer tests to write. (register exhaustion :heavy_multiplication_x: each memconstraint)

nickdesaulniers commented 1 year ago

Clang seems to generate spills if "m" exists in the constraint list (-Xclang -disable-llvm-passes) that will need to get cleaned up, too.

nickdesaulniers commented 1 year ago

aha! I just got this all to work with SelectionDAGISEL + Greedy Regalloc! anakin-it-is-working (And only with inputs, and only inline asm and not inline asm goto (though asm goto should have edge cases for outputs)) Now to get GlobalISEL, FastISel, and RegAllocFast working.

I might be able to break this up into 3 (or more) distinct PRs.

  1. register allocator changes + tests
  2. instruction selection changes + tests
  3. clang changes + tests

One thing I struggled with is that MachineInstr isn't ergonomic in terms of replacing MachineOperands. In particular, to transform a TargetOpcode::INLINEASM to use memory operands rather than a register operand involves transforming 2 MachineOperands into 5 (at least for x86; I'll have to write some code to determine how many MachineOperands are necessary to refer to a MachineOperandType::MO_FrameIndex per target). The groundwork I did in 86735a4353aee4a3ba1e2feea173a7cc659c7a60 and 93bd428742f985a4b909dd5efee462ea520c96c0 was helpful. I can understand why MachineInstr generally doesn't allow for arbitrary MachineOperand replacement; for most non-pseudo instructions the operands are quite fixed/regular. For a pseudo-instruction like TargetOpcode::INLINEASM, we might need more flexibility to be able to transform such instructions in order to arbitrarily spill specific operands should we need to scavenge for registers.

nickdesaulniers commented 1 year ago

Captain's Log, Stardate 43120.8: got my hacky spaghetti to validate with -verify-regalloc

qcolombet commented 1 year ago

I spent some time looking at the folding idea a bit more. From the allocator perspective it seems to fit pretty well, the problem is going to actually implement the memory folding in inline asm. Like you saw, @nickdesaulniers, MIR is pretty rough when it comes to that.

Anyhow, here is the sketch of a patch. inlineasm.patch

To "productize it" I think we should:

  1. Merge the added canBeMemoryFolded with isLiveAtStatepointVarArg
  2. Teach canBeMemoryFolded which inline asm operands are actually foldable
  3. Add the support for folding memory operands into inline asm (right now this is a TODO in llvm/lib/CodeGen/TargetInstrInfo.cpp)
  4. Bonus point, in canBeMemoryFolded we could actually open it up to all memory folding, not just patchpoints and inline asm. The problem is that IIRC the foldMemoryOperand helpers don't have a "dry-run" version (e.g., canFoldMemoryOperand). I.e., we would need some refactoring to expose that.

Regarding RegAllocFast, I think it would be best to just use the memory operand directly from ISel. In other words, no even exposing this problem to this allocator. I haven't looked closely, but given this allocator is only aimed at O0, I expect it would not be worth it to bother to handle this.

nickdesaulniers commented 1 year ago

Anyhow, here is the sketch of a patch.

+    // TODO: Check actual operand index for folding. Must have the right inline
+    // asm permission to do that.

Let's chat more in person next week, but IIUC, that approach starts with "m" being chosen, then foldMemoryOperand is used to try to convert to "r"? One benefit of that approach is it sounds like the logic for preferring "m" over "r" in the instruction selection frameworks would not need to be changed. I do wonder if that could result in "m"'s not being converted to "r" with the same priority (i.e. we always want that, until we cant due to register pressure)?

I have a different approach I've been working on (maybe going down the wrong path, but it seems to be working). (Excessive delay this week because my workstation failed and is being replaced; pushed my branch, though I would prefer to clean it up more before showing). ISEL picks "r" and only when it's about to fail due to register exhaustion does greedy perform one final scan of the operands to see if any of the "r"s are "spill-able." If so, it does. The machinery for swapping out register operands with memory operands (or vice versa) is complicated (not too complicated, but not as simple as replacing one operand) and target specific (so I'm doing that transform in TargetInstructionInfo).

This approach doesn't involve inline spiller or CalcSpillWeights, since greedy also has a reference to the LiveIntervals and can modify them with new LiveRanges as well. This guarantees that registers are always preferred, until the point where we know we are about to imminently exhaust registers unless we spill.

Add the support for folding memory operands into inline asm (right now this is a TODO in llvm/lib/CodeGen/TargetInstrInfo.cpp)

sorry, where? I didn't see any relevant TODO or FIXME.

qcolombet commented 1 year ago

that approach starts with "m" being chosen, then foldMemoryOperand is used to try to convert to "r"?

No, that's the other way around. We use the r constraints (sorry not part of that patch, I was assuming that's what you would hand to the allocator). Then if that one doesn't work, the compiler will try to spill it (hence why it is important to mark the related live-range spillable https://github.com/llvm/llvm-project/files/12819005/inlineasm.patch) and when we spill it we call foldMemoryOperand.

I have a different approach I've been working on[...]

In the end, I think we do the same thing :). My patch just makes sure we don't have to actually modify anything but the spill weights, which I think is much simpler assuming you can do the rewrite for InlineAsm in foldMemoryOperand.

sorry, where? I didn't see any relevant TODO or FIXME.

I meant in the patch https://github.com/llvm/llvm-project/files/12819005/inlineasm.patch, where I print Ding.

nickdesaulniers commented 1 year ago

We use the r constraints (sorry not part of that patch, I was assuming that's what you would hand to the allocator).

ah! I missed that part. In that case then, yes:

In the end, I think we do the same thing :).


My patch just makes sure we don't have to actually modify anything but the spill weights, which I think is much simpler assuming you can do the rewrite for InlineAsm in foldMemoryOperand.

Yes, I think this is the issue with my current approach; I end up needing to update the live ranges manually, which while possible, clutters the implementation a bit and seems error prone. With your approach, it looks like I can skip needing to do that manually and rely on the pre-existing infrastructure a bit more.

I have another branch built on top of your approach; in tree tests are passing. Now I need to get it working for more complex cases that we don't yet have in tree AFAICT. i.e. tied operands

nickdesaulniers commented 1 year ago

Writing up a bunch of tests now. I suspect I'll need to teach InlineSpiller about INLINEASM_BR; currently it's emitting the reload just along the fallthrough edge. This is wrong for asm goto with outputs.

define i32 @inout_pressure_goto (i32 inreg %x) nounwind {
   %1 = callbr i32 asm "# $0 $1 $2", "=rm,0,!i,~{ax},~{bx},~{cx},~{dx},~{si},~{di},~{bp}"(i32 %x)
   to label %ft [label %ft]
ft:
   ret i32 %1
 }

becomes:

# *** IR Dump After Greedy Register Allocator (greedy) ***:
# Machine code for function inout_pressure_goto: NoPHIs, TracksLiveness, TiedOpsRewritten, TracksDebugUserValues
Frame Objects:
  fi#0: size=4, align=4, at location [SP+4]
Function Live Ins: $eax in %3

0B      bb.0 (%ir-block.0):
          successors: %bb.2(0x80000000), %bb.1(0x00000000); %bb.2(100.00%), %bb.1(0.00%)
          liveins: $eax
16B       MOV32mr %stack.0, 1, $noreg, 0, $noreg, $eax :: (store (s32) into %stack.0)
64B       INLINEASM_BR &"# $0 $1 $2" [mayload] [maystore] [attdialect], $0:[mem:m], %stack.0, 1, $noreg, 0, $noreg, $1:[mem:m], %stack.0, 1, $noreg, 0, $noreg, $2:[imm], %bb.1, $3:[clobber], implicit-def dead early-clobber $ax, $4:[clobber], implicit-def dead early-clobber $bx, $5:[clobber], implicit-def dead early-clobber $cx, $6:[clobber], implicit-def dead early-clobber $dx, $7:[clobber], implicit-def dead early-clobber $si, $8:[clobber], implicit-def dead early-clobber $di, $9:[clobber], implicit-def dead early-clobber $bp :: (store (s32) into %stack.0)
72B       %9:gr32 = MOV32rm %stack.0, 1, $noreg, 0, $noreg :: (load (s32) from %stack.0)
80B       JMP_1 %bb.2

96B     bb.1..ft_crit_edge (machine-block-address-taken, inlineasm-br-indirect-target):
        ; predecessors: %bb.0
          successors: %bb.2(0x80000000); %bb.2(100.00%)

112B    bb.2.ft:
        ; predecessors: %bb.0, %bb.1

128B      $eax = COPY %9:gr32
144B      RET 0, $eax

which is not right. There originally was a phi in bb.2.ft between two copies of the same virtreg in both bb.1..ft_crit_edge and bb.2.ft that got replaced by a COPY of the virtreg in the common tail by opt-phis pass.

nickdesaulniers commented 12 months ago

Ok, I've hit another stumbling block when doing full integration tests for non-x86, non-riscv targets. There may be a prerequisite yak shave necessary here. Basically, I'm confident that I can get this working x86/riscv at the moment. But every other architecture has a slight difference at this point in MIR (I guess that makes x86/riscv the odd ones out, but I'll argue that x86/riscv are the only ones that makes sense to me).

Consider the following C code:

void foo (void) {
  int x = 42;
  asm("# %0"::"m"(x));
}

After instruction selection (clang foo.c -S -o - -O2 -fno-asynchronous-unwind-tables -mllvm -stop-after=finalize-isel) we have the following MIR:

...
MOV32mi %stack.0.x, 1, $noreg, 0, $noreg, 42 :: (store (s32) into %ir.x, !tbaa !4)
INLINEASM &"# $0", ..., 262190 /* mem:m */, %stack.0.x, 1, $noreg, 0, $noreg, ...
...

This makes sense to me. That 262190 corresponds to an InlineAsm::Flag that says the following operand is a MachineOperandType::MO_FrameIndex AND the next 5 MachineOperands are all logically one operand. That's why we need to be able to splice operands: https://github.com/llvm/llvm-project/pull/67699 to replace 2 MachineOperands (that initial metadata node then a register) with 5.

But let's see if other targets do this too. Same input, same compiler flags with the addition of --target=aarch64-linux-gnu:

STRWui killed %1, %stack.0.x, 0 :: (store (s32) into %ir.x, !tbaa !5)
%2:gpr64sp = ADDXri %stack.0.x, 0, 0
%3:gpr64sp = COPY %2
INLINEASM &"# $0", ..., 262158 /* mem:m */, killed %3, !9

What?! So aarch64 doesn't use a MachineOperandType::MO_FrameIndex to represent the stack slot index. Instead it uses an InlineAsm::Flag::Kind::Mem, but then is followed by a MachineOperandType::MO_Register operand. Um...ok.(?)

How about 32b ARM (--target=arm-linux-gnueabihf):

STRi12 killed %4, %stack.0.x, 0, 14 /* CC::al */, $noreg :: (store (s32) into %ir.x, !tbaa !5)
%5:gpr = ADDri %stack.0.x, 0, 14 /* CC::al */, $noreg, $noreg
INLINEASM &"# $0", ..., 262158 /* mem:m */, killed %5, !9

--target=powerpc64le-linux-gnu:

STW8 killed %3, 0, %stack.0.x :: (store (s32) into %ir.x, !tbaa !4)
%4:g8rc = ADDI8 %stack.0.x, 0
%5:g8rc_nox0 = COPY %4
INLINEASM &"# $0", ..., 262158 /* mem:m */, killed %5

--target=riscv64-linux-gnu:

SW killed %1, %stack.0.x, 0 :: (store (s32) into %ir.x, !tbaa !6)
INLINEASM &"# $0", ..., 262166 /* mem:m */, %stack.0.x, 0, !10

So x86 and riscv work as expected; arm, aarch64, and powerpc do not. And what I really don't like about the arm/aarch64/powerpc approach is that they appear as if they are uses of virtual registers, when they should be uses of frame indexes.

Further, when greedy asks TargetInstrInfo to foldMemoryOperand, it says "see if you can fold a load from stack slot X, given to me by inline spiller." So we can easily splice a MachineOperandType::MO_FrameIndex for slot X. But for targets that instead use a register, there's no mapping to be able to say "give stack slot X, what the hell virt reg corresponds to that?" (At least, not without some ugly per target code to walk previous MachineInstrs preceding the INLINEASM).

I assume that's a historical accident and can be fixed up so that all targets use MachineOperandType::MO_FrameIndex MachineOperands to refer to stack slots, at least for INLINEASM/INLINEASM_BR. But that should be fixed first. Perhaps upon starting to shave that yak, I find out there is a good reason for this orthogonality between target backends, and then have to unshave the yak (or put the work done as part of the yak shave in the garbage).

It's also curious why x86 has 5 operands to represent a stack slot (probably something to do with lea but not sure why that matters for inline asm) and riscv has 2. Why can't we use 1 everywhere? If we could, then we wouldn't even need to support splicing machine operands.

nickdesaulniers commented 12 months ago
// $ clang --target=arm-linux-gnueabihf
struct foo {
  int a, b, c, d, e;
};
int zzz (struct foo my_foo) {
    asm ("# %0"::"m"(my_foo.b):"r1","r2","r3","r12","lr","r4","r5","r6","r7","r8","r9","r10","r11");
    return my_foo.a;
}

GCC picks [sp, #40], clang picks [r0]. Adding "r0" to the clobber causes clang to fail to allocate a register for %0 while GCC compiles just fine.

I think I should shave that yak...

though, even if some backbends have this orthogonal behavior, maybe I should just teach them how to lower Frame Indices properly.

nickdesaulniers commented 11 months ago

So x86 and riscv work as expected; arm, aarch64, and powerpc do not.

Ah, that's not always the case. Both

have cases that disprove that. The flag outputs for example produce:

...
%3:gr32 = MOV32rm %fixed-stack.0, 1, $noreg, 0, $noreg :: (load (s32) from %fixed-stack.0, align 8)
INLINEASM ... $1:[mem:m], killed %3:gr32, 1, $noreg, 0, $noreg, ...
...

so it is possible that we have "mem" (InlineAsm::Kind::Mem) followed by virtual register MachineOperands on x86. So my understanding above wrt. x86 was wrong.

though, even if some backbends have this orthogonal behavior, maybe I should just teach them how to lower Frame Indices properly.

Yeah. After poking at this in https://github.com/llvm/llvm-project/pull/69654, I think the approach I should do instead is handle this curious case, and work on each backend to work through issues they might have translating stack slots into valid addressing modes. For 32b ARM, I can reuse most of https://github.com/llvm/llvm-project/pull/69654 without changing existing codegen behavior for existing test cases.

EDIT: yep, supporting this improved the codegen for those 2 x86 tests listed above. EDIT2: nope, those are indirect memory operands on x86. Can't fold those else we lose a level of indirection.


@qcolombet and I met on Friday to discuss the approach so far. 2 additions he highlighted:

  1. for fastreg alloc, the presence of optnone on a function might be an indicator that we don't want to mark registers as spillable, and just choose "m" when "rm" is observed.
  2. rather than try to support all targets all at once, perhaps a TargetInstrInfo (or SubTargetInfo) hook for canMemFoldInlineAsm or some such would allow for more incremental adoption. That probably doesn't depend on regalloc changes which can land first, but would be used by instruction selection changes that would make these optimizations visible to clang.
nickdesaulniers commented 11 months ago

Status report:

for fastreg alloc, the presence of optnone on a function might be an indicator that we don't want to mark registers as spillable, and just choose "m" when "rm" is observed.

This isn't going to work. From a discussion with @arsenm and @nikic on discord:

Nick Desaulniers — Yesterday at 12:24 PM @arsenm is there an assumption that FastRegalloc is only used for optnone? or is it expected that fastregalloc works even for non-optnone fn's? nikic — Yesterday at 2:20 PM @nickdesaulniers Not sure if that answer your question, but the the O0 backend must be able to deal with O3 IR. arsenm — Yesterday at 7:09 PM It should always work I know I've fixed assorted verifier failures and pass errors with -O3 + -regalloc=fast

So FastReg alloc MUST also be taught how to fold memory operands into inline asm since the decision to choose registers is made beforehand by the instruction selection framework, which cannot account for which register allocation framework will later be used. I'll start work on that today. I suspect support for Greedy and Fast Regalloc can land separately, and before any isel changes.


and work on each backend to work through issues they might have translating stack slots into valid addressing modes. EDIT2: nope, those are indirect memory operands on x86. Can't fold those else we lose a level of indirection. rather than try to support all targets all at once, perhaps a TargetInstrInfo (or SubTargetInfo) hook for canMemFoldInlineAsm or some such would allow for more incremental adoption.

Ok, this approach is working. I have x86_64, riscv64, arm, Aarch64, and powerpc64le working (tentatively, small tests of just non-goto asm with "rm" inputs). But I'm able in isel to check if an arch supports the the memfold optimization via a new TargetLowering hook. Adding support for another arch is starting to look the same between the 4 above:

  1. add override for TargetLowering hook.
  2. add InstrInfo target-specific hook for the number of MachineOperands used to represent a stack slot index.
  3. adjust ASMPrinter::PrintAsmMemoryOperand to handle the number of MachineOperands for the memory InlineAsm::Kind. (this is also done for already for STACKMAP, PATCHPOINT, and STATEPOINT psuedo MachineInstrs, which appear to also use MachineOperand::MachineOperandType::MO_FrameIndex).
  4. (optional) teach RegisterInfo::eliminateFrameIndex target specific override which MachineOperand to look at/adjust.
  5. fix other smaller target specific bugs that result.

All of the above for the last 4 architectures listed was only 10-40 LoC per arch. (excluding tests)

I'll need to retest all of the above with more complex tests such as outputs and tied operands, then again with asm goto variants. I might even limit the memfold optimization to just vanilla INLINEASM and not INLINEASM_BR (asm goto) to start to land pieces, in order to avoid excessive churn when pieces need to be backed out. (Since I've already identified one miscompile).


I noticed a slight issue for ppc64le. Everything works great on inputs/outputs that are the word size. But use an input smaller than the word size, and suddenly TargetInstrInfo::foldMemoryOperand optimization is inhibited (it seems some targets zero the upper 32b of the GPR, not sure that all other targets do). I wonder if the zeroing is necessary when a spill is performed (should be easy to test that): https://godbolt.org/z/hPY9Wqe59 (GCC doesn't do that, maybe we can remove that from LLVM).

nickdesaulniers commented 11 months ago

EOW status report

Ok, I have fastregalloc working well enough that I think it's time to start upstreaming the building blocks shared between greedy and fastregalloc for this all to work. I have fastregalloc working for x86_64 and aarch64 with many different test cases; I didn't implement support for asm goto yet, but that doesn't seem problematic (famous last words). I don't forsee issues with other targets for fastregalloc. Still a bit shell shocked from all of the iterator invalidation issues I had to "wrastle" with to get my fastregalloc impl working (because I'm a bad programmer).

First patch to shed will be machine operand splicing: https://github.com/llvm/llvm-project/pull/67699. I need to add test cases and the downstream fixes I'm carrying for supporting tied operands.

nickdesaulniers commented 11 months ago

Are helpers. Once those are landed, I think the best approach will be:

  1. publish greedy and fastregalloc support as separate PRs. Those can land independent of one another. Still won't get set up by instruction selection yet, but I can start adding unit tests.
  2. publish GlobalISEL support. At this point, clang will start making codegen differences, but only for limited target & optimization levels. BOTH register allocation framework changes from 1 must land first.
  3. publish SelectionDAGISel support. Technically doesn't depend on 2, but landing 2 first will de-risk 3 and give us a chance to find more bugs. I might do this in a per target manner to de-risk this further.
  4. asm goto w/ outputs support. In 2 and 3, I'll probably limit them to INLINEASM and not INLINEASM_BR.

For the register allocators I'll probably just add tests for 1 target; I'd like to have more comprehensive integration tests all the way from clang for each target where this basically tests:

  1. "rm" with no register pressure chooses "r" over "m" (change in behavior)
  2. "rm" with register pressure chooses "m"

for each target cross input/output/tied operands cross asm vs asm goto.

nickdesaulniers commented 11 months ago

https://github.com/llvm/llvm-project/pull/70832/commits/89dbc57fe61dc90ebe116294644443d0f9578a51 has some very basic MIR tests demonstrating the memory folding in action. There's ongoing discussion in https://github.com/llvm/llvm-project/pull/70738 as to "what to call" this disambiguated machine operand (where the instruction selectors set up the register allocators to dunk), so that part may change/be rebased.

nickdesaulniers commented 11 months ago

There's ongoing discussion in https://github.com/llvm/llvm-project/pull/70738 as to "what to call" this disambiguated machine operand (where the instruction selectors set up the register allocators to dunk), so that part may change/be rebased.

That's resolved (s/spillable/foldable/).


@topperc and I had a pertinent discussion about this last Wednesday; "rm" only makes sense generally for CISC architectures which support multiple different addressing modes on any given instruction. RISC targets are more likely to only support memory operands for loads+stores. I checked the Linux kernel sources quickly and couldn't find usage of "rm" in inline asm for 32b arm. I'll triple check the other RISC architectures that have kernel ports and that clang can build.

But I suspect I'll only need to support this for x86 and maybe SystemZ. I'll do more due diligence on usage of "rm" in the Linux kernel. Either way, that significantly reduces the test case burden. I have tests in hand for making "rm" spill under pressure for hexagon, lanai, arm, aarch64, ppc64, and riscv; but if none of those even have instructions that could use "r" or "m" then what is the point?

davidben commented 11 months ago

"rm" only makes sense generally for CISC architectures which support multiple different addressing modes on any given instruction.

I will definitely concede this is an extremely weird and slightly dubious thing to do, but my motivation in #71080 was actually to make an optimization barrier, so the asm string doesn't actually read the input. However, I want to force the compiler to generate code as if the input were used. (This is a "please don't optimize this value out" barrier.) So "rm" is ideal in that case because I don't want the compiler to generate an pointless load or store.

FWIW, it does look like GCC gracefully handles "rm" on aarch64 too: https://godbolt.org/z/9614nhs8a

nickdesaulniers commented 11 months ago

but my motivation in https://github.com/llvm/llvm-project/issues/71080 was actually to make an optimization barrier So "rm" is ideal in that case because I don't want the compiler to generate an pointless load or store.

Sounds fishy, but let's discuss this on https://github.com/llvm/llvm-project/issues/71080?

FWIW, it does look like GCC gracefully handles "rm" on aarch64 too: https://godbolt.org/z/9614nhs8a

Try clobbering all GPRs with GCC and see how graceful it is. ;) The complication wrt. supporting "rm" is not in choosing "r" in the hello world inline asm case; it's in choosing "m" when there's register pressure.

But yeah let me clarify my point you cited; after my changes "r" will be preferred over "m" for all architectures; I suspect that's the grace you're looking for.

nickdesaulniers commented 11 months ago

@topperc and I had a pertinent discussion about this last Wednesday; "rm" only makes sense generally for CISC architectures which support multiple different addressing modes on any given instruction. RISC targets are more likely to only support memory operands for loads+stores. I checked the Linux kernel sources quickly and couldn't find usage of "rm" in inline asm for 32b arm. I'll triple check the other RISC architectures that have kernel ports and that clang can build.

Ok, it looks like the m68k Linux kernel port uses "g". Otherwise I get no other hits for "rm" or "mr" (even adding "i" and check for all permutations) outside of arch/x86/.

nickdesaulniers commented 10 months ago

ok, the core parts have all landed for greedy.

(I noticed that peephole-opt will fold loads BEFORE reaching any register allocator, but not any stores).

Next up:

I noticed that peephole-opt folds loads; we might want to disable that for inline asm...I need to think more about that case.

nickdesaulniers commented 6 months ago

putting this down for now. If someone else wants to pick up the torch, I can push my WIP branch somewhere which may be useful as a reference (but will probably bit rot).

bwendling commented 6 months ago

I can push my WIP branch somewhere which may be useful as a reference (but will probably bit rot).

Please do.

nickdesaulniers commented 6 months ago

Are some of my branches (I have 6 locally; they're free) but those 3 appear to be the latest/most developed. Really, the issue I ran into with RegAllocFast was more so: do we try to reassign a register to a stack slot and mutate the inline asm upon discovering we're in a state of register exhaustion (Regalloc fast was not designed to do this, everything is a const reference, and I'd wind up with duplicated instructions inserted in a bunch of places), or try to spill inline asm "rm" variables proactively (shot down in code review).

https://gist.github.com/nickdesaulniers/8f20bea3bcdd9fe97219428ab6e8bf8b were a handful of end-to-end tests I was using to test support for all architectures. You may find these useful.

danilaml commented 6 months ago

regallocfact can fail to allocate regs even without this patch. I think it's not supposed to be used in production, just like fast-isel (might be wrong), so another possibility would be just to revert the old (always in-memory) behavior when fast regalloc is selected.

nickdesaulniers commented 6 months ago

There is no "revert [to] the old behavior" here. ISEL needs to be changed for this to work. Then which regalloc frameworks runs after needs to be able to handle what ISEL has decided. If we don't change ISEL, we don't fix this bug. If we don't fix regallocfast, then we may fail to compile code that previously we could.