Multiple calls to llvm.prefetch.p0i8 on aarch64 can cause apparently-unnecessary register spills

steven-johnson commented 3 years ago


Bugzilla Link	51172
Version	trunk
OS	All
Attachments	prefetch.ll
CC	@alinas,@brads55,@dsharletg,@sjoerdmeijer,@TNorthover

Extended Description

Steps to repeat:

(1) See enclosed file prefetch.ll -- notice that there are several calls to @llvm.prefetch.p0i8, with all but the first commented out

(2) Compile to aarch64 assembly with llc -march=aarch64 ~/prefetch.ll -o - -O3 -mattr=+dotprod > ~/prefetch.s

(3) Examine output, search for prfm, see block similar to

    add x27, x26, x16
    prfm    pldl1strm, [x27]
    ldr q28, [x27]
    ldp q8, q29, [x12, #-32]
    ldr q31, [x28, x16]
    ldr q10, [x29, x16]
    ldr q11, [x30, x16]
    ldr q12, [x8, x16]
    ldp q9, q30, [x18, #-32]
    udot    v26.4s, v8.16b, v28.4b[0]
    udot    v24.4s, v8.16b, v31.4b[0]
    udot    v22.4s, v8.16b, v10.4b[0]
    udot    v20.4s, v8.16b, v11.4b[0]
    udot    v18.4s, v8.16b, v12.4b[0]

(4) Edit prefetch.ll to uncomment the call to llvm.prefetch.p0i8 on line 459, re-run llc

(5) Search for prfm again, and note that the start of the block now contains numerous vector-register spills that appear to be completely unnecessary:

    add x27, x26, x16
    prfm    pldl1strm, [x27]
    ldp q31, q30, [x18]
    ldr q0, [x12]
    ldr q1, [x27]
    str q30, [sp, #&#8203;496]                 // 16-byte Folded Spill
    ldr q30, [x21]
    str q0, [sp, #&#8203;400]                  // 16-byte Folded Spill
    ldr q0, [x12, #&#8203;16]
    add x27, x28, x16
    stp q31, q30, [sp, #&#8203;416]            // 32-byte Folded Spill
    ldr q30, [x21, #&#8203;16]
    ldp q11, q29, [x12, #-32]
    str q0, [sp, #&#8203;512]                  // 16-byte Folded Spill
    ldp q10, q0, [x18, #-32]
    str q30, [sp, #&#8203;480]                 // 16-byte Folded Spill
    ldp q30, q31, [x14, #-32]
    ldp q8, q15, [x21, #-32]
    udot    v17.4s, v10.16b, v1.4b[0]
    udot    v17.4s, v0.16b, v1.4b[1]
    str q31, [sp, #&#8203;384]                 // 16-byte Folded Spill
    ldr q31, [x14]
    udot    v16.4s, v30.16b, v1.4b[0]
    udot    v26.4s, v11.16b, v1.4b[0]
    udot    v26.4s, v29.16b, v1.4b[1]
    str q31, [sp, #&#8203;448]                 // 16-byte Folded Spill
    ldr q31, [x14, #&#8203;16]
    udot    v27.4s, v8.16b, v1.4b[0]
    udot    v27.4s, v15.16b, v1.4b[1]
    subs    x20, x20, #&#8203;1                    // =1
    str q31, [sp, #&#8203;464]                 // 16-byte Folded Spill
    prfm    pldl1strm, [x27]
    ldr q9, [x27]
    ldr q31, [x29, x16]
    ldr q12, [x30, x16]
    ldr q13, [x8, x16]
    udot    v2.4s, v10.16b, v9.4b[0]
    udot    v3.4s, v10.16b, v31.4b[0]
    udot    v14.4s, v10.16b, v12.4b[0]
    udot    v4.4s, v10.16b, v13.4b[0]
    udot    v2.4s, v0.16b, v9.4b[1]
    udot    v3.4s, v0.16b, v31.4b[1]
    udot    v14.4s, v0.16b, v12.4b[1]
    udot    v4.4s, v0.16b, v13.4b[1]

These spills don't seem to make any sense: the only instruction that should have been added here was the second prfm instruction, and it doesn't depend on any of the vector registers being spilled and reloaded. Is something about prefetch affecting this (e.g., confusing the lifetime analysis for registers loaded from the prefetch location)?

b4d93b01-5a04-4b02-9638-265208e07b76 commented 3 years ago

Thanks for pointing that out, that was a bug that we've hopefully since fixed. When troubleshooting performance, the first thing that stood out was the stack spills, I didn't get as far as noticing that the prefetch and load were from the same address.

sjoerdmeijer commented 3 years ago

There are probably 2 things going on here: how the prefetch intrinsic effects codegen as has been discussed so far, and the potential performance uplift.

Software prefetching is probably completely counter productive if the next instruction performs the load:

prfm    pldl1strm, [x27]
ldr q28, [x27]

Modern cores and prefetchers should hopefully predict this access, making the prefetch instruction unnecessary, possibly even hurting performance. On smaller cores, with less sophisticated prefetchers, there might be a use case for this though. This is all very hand waivy though (and unfortunately not described in e.g. the Cortex software optimisation guides), but basically what I want to say is that it is worth experimenting if you actually need this and helps your case.

steven-johnson commented 3 years ago

FYI: I made an experimental change to Halide's codegen so that all our calls to the Prefetch primitive in LLVM get hoisted to the top of the enclosing loop (rather than interleaved with other instructions in the loop).

The good news: as I suspected, this eliminates all the spills I see before, and I end up with nice clean aarch64 assembly that looks something like the code below.

The not-good news: unsurprisingly, pushing the prefetch calls further away from the instructions that actually need the prefetch result dramatically degrades any benefit the prefetch would give us. (This varies some depending on the type of ARM core running the code; e.g. on an A55 core, I saw essentially no speed difference from with vs without prefetches.)

Because of the limited performance improvement, I don't think this approach is likely to be worth pursuing as a long-term fix, unfortunately.

I'd still love to find a way to insert prefetch calls in an interleaved fashion without injecting unnecessary spills, but it's not clear to me how doable this is without nontrivial effort.

(snippet:)

.LBB0_103:            // %"for convolved.s1.r19$z.r124.us.us.us"
add x4, x18, x30
add x13, x11, x30
add x6, x14, x30
add x15, x8, x30
add x26, x28, x30
prfm    pldl1strm, [x4, #&#8203;16]
prfm    pldl1strm, [x13, #&#8203;16]
prfm    pldl1strm, [x6, #&#8203;16]
prfm    pldl1strm, [x15, #&#8203;16]
prfm    pldl1strm, [x26, #&#8203;16]
ldr q7, [x4]
ldp q19, q16, [x3, #-32]
ldr q18, [x13]
ldr q2, [x6]
ldr q4, [x15]
ldr q0, [x26]
ldp q1, q17, [x19, #-32]
udot    v12.4s, v19.16b, v7.4b[0]
udot    v8.4s, v19.16b, v18.4b[0]
udot    v28.4s, v19.16b, v2.4b[0]
udot    v24.4s, v19.16b, v4.4b[0]
udot    v20.4s, v19.16b, v0.4b[0]
ldp q3, q19, [x24, #-32]
udot    v13.4s, v1.16b, v7.4b[0]
udot    v9.4s, v1.16b, v18.4b[0]
udot    v29.4s, v1.16b, v2.4b[0]
udot    v14.4s, v3.16b, v7.4b[0]
udot    v10.4s, v3.16b, v18.4b[0]
udot    v30.4s, v3.16b, v2.4b[0]
udot    v25.4s, v1.16b, v4.4b[0]
udot    v21.4s, v1.16b, v0.4b[0]
udot    v26.4s, v3.16b, v4.4b[0]
udot    v22.4s, v3.16b, v0.4b[0]
    ...etc...

steven-johnson commented 3 years ago

Patch to mark prfm as mayStore instead of hasSideEffects Here's my attempt at marking the aarch64 PRFM instruction as mayStore instead of hasSideEffect. (This may be totally wrong -- I've never had occasion to edit .td files before so this is a bit of guesswork.)

Unfortunately, running llc on prefetch_2.ll as described earlier has no change with this patch in place -- still getting spills. (Ditto for mayLoad instead of mayStore, which I patched in the same way.)

steven-johnson commented 3 years ago

Thanks for the tips, I'll try out mayLoad and/or mayStore and see how the results look.

llvmbot commented 3 years ago

mayStore might work as well.

llvmbot commented 3 years ago

I wouldn't bet on that. At the codegen layer, I don't think we can remove dead loads because we might be changing trap behavior.

TNorthover commented 3 years ago

I did wonder that, but my bet is a mayLoad that doesn't actually produce any values is fair game for dead removal. We could change that edge-case of course.

llvmbot commented 3 years ago

Would it be viable to mark prefetches as mayLoad rather than hasSideEffects?

TNorthover commented 3 years ago

That does sound plausible. I suppose a "proper" fix would be to add hasPerfSideEffects (a.k.a. please don't delete me) as another level intrinsics might sit at.

steven-johnson commented 3 years ago

After more debugging, I'm now wondering if this may not be a bug at all, but a side-effect of instruction scheduling.

The prfm instruction is modeled as "has side effects" (which is true and necessary, otherwise it would get optimized away entirely), which constrains the optimizer from being aggressive about moving around surrounding instructions.

I think what may be happening here is that in the no-prefetch case (or one-prefetch-at-top-of-loop), the machine-instruction scheduler is able to shuffle around instructions effectively, whereas the addition of the extra prfm splits it into two basic blocks, so we end up with suboptimal register allocation across them, either out of necessity (literally out of registers) or just the fact that optimal register allocation is hard.

One workaround I'm looking at is pushing all of the prfm instructions to the top of the inner loop, in the hopes that the rest of the loop can avoid spills; this isn't ideal, of course, as interleaving the prefetch instructions would be preferable for fairly obvious reasons.

Despite my suspicions, I'm going to keep this issue open for the time being, as I'm still not 100% convinced that the situation can't be improved.

steven-johnson commented 3 years ago

Looking over the disassembly of prefetch_2.ll in more detail, it makes even less sense (see below); it appears that q0 is "spilled" several times in the inner loop, but never reloaded (at least not in a way I can see).

.LBB0_2: // %"for convolved.s1.rdom$x" // Parent Loop BB0_1 Depth=1 // => This Inner Loop Header: Depth=2 prfm pldl1strm, [x4] ldp q11, q0, [x2] add x5, x2, x11 ldp q13, q31, [x5] ldr q24, [x4] str q0, [sp, #16] // 16-byte Folded Spill ldr q0, [x2, #32] subs x3, x3, #1 udot v23.4s, v11.16b, v24.4b[0] udot v21.4s, v13.16b, v24.4b[0] str q0, [sp, #32] // 16-byte Folded Spill ldr q0, [x2, #48] udot v21.4s, v31.16b, v24.4b[1] add x2, x2, x13 str q0, [sp, #48] // 16-byte Folded Spill ldp q29, q0, [x5, #32] add x5, x5, x11 ldp q26, q10, [x5] ldp q8, q30, [x5, #32] add x5, x5, x11 ldp q25, q15, [x5] ldp q12, q9, [x5, #32] add x5, x4, x12 str q0, [sp] // 16-byte Folded Spill prfm pldl1strm, [x5] ldr q14, [x5] ldr q0, [x4, x16] ldr q27, [x4, x17] udot v20.4s, v25.16b, v24.4b[0] udot v7.4s, v25.16b, v14.4b[0] udot v2.4s, v25.16b, v0.4b[0] udot v3.4s, v25.16b, v27.4b[0] ldr q25, [sp, #16] // 16-byte Folded Reload udot v17.4s, v11.16b, v14.4b[0] udot v16.4s, v11.16b, v0.4b[0] udot v4.4s, v11.16b, v27.4b[0] udot v23.4s, v25.16b, v24.4b[1] udot v17.4s, v25.16b, v14.4b[1] udot v16.4s, v25.16b, v0.4b[1] udot v4.4s, v25.16b, v27.4b[1] ldr q25, [sp, #32] // 16-byte Folded Reload udot v22.4s, v26.16b, v24.4b[0] udot v19.4s, v26.16b, v14.4b[0] udot v18.4s, v26.16b, v0.4b[0] udot v5.4s, v26.16b, v27.4b[0] udot v23.4s, v25.16b, v24.4b[2] udot v17.4s, v25.16b, v14.4b[2] udot v16.4s, v25.16b, v0.4b[2] udot v4.4s, v25.16b, v27.4b[2] ldr q25, [sp, #48] // 16-byte Folded Reload ldr q26, [sp] // 16-byte Folded Reload udot v6.4s, v13.16b, v14.4b[0] udot v28.4s, v13.16b, v0.4b[0] udot v1.4s, v13.16b, v27.4b[0] udot v6.4s, v31.16b, v14.4b[1] udot v28.4s, v31.16b, v0.4b[1] udot v1.4s, v31.16b, v27.4b[1] udot v22.4s, v10.16b, v24.4b[1] udot v19.4s, v10.16b, v14.4b[1] udot v18.4s, v10.16b, v0.4b[1] udot v5.4s, v10.16b, v27.4b[1] udot v20.4s, v15.16b, v24.4b[1] udot v7.4s, v15.16b, v14.4b[1] udot v2.4s, v15.16b, v0.4b[1] udot v3.4s, v15.16b, v27.4b[1] udot v21.4s, v29.16b, v24.4b[2] udot v6.4s, v29.16b, v14.4b[2] udot v28.4s, v29.16b, v0.4b[2] udot v1.4s, v29.16b, v27.4b[2] udot v22.4s, v8.16b, v24.4b[2] udot v19.4s, v8.16b, v14.4b[2] udot v18.4s, v8.16b, v0.4b[2] udot v5.4s, v8.16b, v27.4b[2] udot v20.4s, v12.16b, v24.4b[2] udot v7.4s, v12.16b, v14.4b[2] udot v2.4s, v12.16b, v0.4b[2] udot v3.4s, v12.16b, v27.4b[2] udot v23.4s, v25.16b, v24.4b[3] udot v21.4s, v26.16b, v24.4b[3] udot v22.4s, v30.16b, v24.4b[3] udot v20.4s, v9.16b, v24.4b[3] udot v17.4s, v25.16b, v14.4b[3] udot v6.4s, v26.16b, v14.4b[3] udot v19.4s, v30.16b, v14.4b[3] udot v7.4s, v9.16b, v14.4b[3] udot v16.4s, v25.16b, v0.4b[3] udot v28.4s, v26.16b, v0.4b[3] udot v18.4s, v30.16b, v0.4b[3] udot v2.4s, v9.16b, v0.4b[3] udot v4.4s, v25.16b, v27.4b[3] udot v1.4s, v26.16b, v27.4b[3] udot v5.4s, v30.16b, v27.4b[3] udot v3.4s, v9.16b, v27.4b[3] mov x4, x5 b.ne .LBB0_2

steven-johnson commented 3 years ago

Sorry, I didn't realize the size of the example was the issue. I've enclosed one that is about half the size of the first one. Instructions are the same as before:

(1) llc -march=aarch64 ~/prefetch_2.ll -o - -O3 -mattr=+dotprod > ~/prefetch_2.s (2) Note that the inner loop (for convolved.s1.rdom$x) has a single prfm and no spills (3) Uncomment the call to llvm.prefetch.p0i8 on line 206 of prefetch_2.ll, rerun llc (4) Note that in addition to a second prfm, there are also several spills/reloads in the inner loop now, but none seem to be related to the new prfm instruction.

steven-johnson commented 3 years ago

Smaller repro case

TNorthover commented 3 years ago

I had a very quick look a couple of weeks back. The attached test-case is still pretty big. You might have more luck drumming up support if you can reduce it to something simpler.

Though it's quite possible reducing it is just as difficult as solving it. If you've tried and failed it'd be worth mentioning, might get some sympathy.

steven-johnson commented 3 years ago

Pinging this a couple of weeks later to see if anyone cc'ed on this bug can help point us in the right direction (or point us at someone who could).

alinas commented 3 years ago

+arm folks who may be better suited to redirect this.

b4d93b01-5a04-4b02-9638-265208e07b76 commented 3 years ago

Hi all, is there any chance we could get some help with this bug? It would really help us out to unblock some further experiments if we could fix or work around this. I am hoping that maybe the fix is something like telling LLVM that prefetch doesn't affect vector registers? If that is something that makes sense, could someone point us in the right direction?

llvm / llvm-project

Multiple calls to llvm.prefetch.p0i8 on aarch64 can cause apparently-unnecessary register spills #50516

Extended Description