Open steven-johnson opened 3 years ago
Thanks for pointing that out, that was a bug that we've hopefully since fixed. When troubleshooting performance, the first thing that stood out was the stack spills, I didn't get as far as noticing that the prefetch and load were from the same address.
There are probably 2 things going on here: how the prefetch intrinsic effects codegen as has been discussed so far, and the potential performance uplift.
Software prefetching is probably completely counter productive if the next instruction performs the load:
prfm pldl1strm, [x27]
ldr q28, [x27]
Modern cores and prefetchers should hopefully predict this access, making the prefetch instruction unnecessary, possibly even hurting performance. On smaller cores, with less sophisticated prefetchers, there might be a use case for this though. This is all very hand waivy though (and unfortunately not described in e.g. the Cortex software optimisation guides), but basically what I want to say is that it is worth experimenting if you actually need this and helps your case.
FYI: I made an experimental change to Halide's codegen so that all our calls to the Prefetch primitive in LLVM get hoisted to the top of the enclosing loop (rather than interleaved with other instructions in the loop).
The good news: as I suspected, this eliminates all the spills I see before, and I end up with nice clean aarch64 assembly that looks something like the code below.
The not-good news: unsurprisingly, pushing the prefetch calls further away from the instructions that actually need the prefetch result dramatically degrades any benefit the prefetch would give us. (This varies some depending on the type of ARM core running the code; e.g. on an A55 core, I saw essentially no speed difference from with vs without prefetches.)
Because of the limited performance improvement, I don't think this approach is likely to be worth pursuing as a long-term fix, unfortunately.
I'd still love to find a way to insert prefetch calls in an interleaved fashion without injecting unnecessary spills, but it's not clear to me how doable this is without nontrivial effort.
(snippet:)
.LBB0_103: // %"for convolved.s1.r19$z.r124.us.us.us"
add x4, x18, x30
add x13, x11, x30
add x6, x14, x30
add x15, x8, x30
add x26, x28, x30
prfm pldl1strm, [x4, #​16]
prfm pldl1strm, [x13, #​16]
prfm pldl1strm, [x6, #​16]
prfm pldl1strm, [x15, #​16]
prfm pldl1strm, [x26, #​16]
ldr q7, [x4]
ldp q19, q16, [x3, #-32]
ldr q18, [x13]
ldr q2, [x6]
ldr q4, [x15]
ldr q0, [x26]
ldp q1, q17, [x19, #-32]
udot v12.4s, v19.16b, v7.4b[0]
udot v8.4s, v19.16b, v18.4b[0]
udot v28.4s, v19.16b, v2.4b[0]
udot v24.4s, v19.16b, v4.4b[0]
udot v20.4s, v19.16b, v0.4b[0]
ldp q3, q19, [x24, #-32]
udot v13.4s, v1.16b, v7.4b[0]
udot v9.4s, v1.16b, v18.4b[0]
udot v29.4s, v1.16b, v2.4b[0]
udot v14.4s, v3.16b, v7.4b[0]
udot v10.4s, v3.16b, v18.4b[0]
udot v30.4s, v3.16b, v2.4b[0]
udot v25.4s, v1.16b, v4.4b[0]
udot v21.4s, v1.16b, v0.4b[0]
udot v26.4s, v3.16b, v4.4b[0]
udot v22.4s, v3.16b, v0.4b[0]
...etc...
Patch to mark prfm as mayStore instead of hasSideEffects Here's my attempt at marking the aarch64 PRFM instruction as mayStore instead of hasSideEffect. (This may be totally wrong -- I've never had occasion to edit .td files before so this is a bit of guesswork.)
Unfortunately, running llc on prefetch_2.ll as described earlier has no change with this patch in place -- still getting spills. (Ditto for mayLoad instead of mayStore, which I patched in the same way.)
Thanks for the tips, I'll try out mayLoad and/or mayStore and see how the results look.
mayStore might work as well.
I wouldn't bet on that. At the codegen layer, I don't think we can remove dead loads because we might be changing trap behavior.
I did wonder that, but my bet is a mayLoad
that doesn't actually produce any values is fair game for dead removal. We could change that edge-case of course.
Would it be viable to mark prefetches as mayLoad rather than hasSideEffects?
That does sound plausible. I suppose a "proper" fix would be to add hasPerfSideEffects
(a.k.a. please don't delete me) as another level intrinsics might sit at.
After more debugging, I'm now wondering if this may not be a bug at all, but a side-effect of instruction scheduling.
The prfm instruction is modeled as "has side effects" (which is true and necessary, otherwise it would get optimized away entirely), which constrains the optimizer from being aggressive about moving around surrounding instructions.
I think what may be happening here is that in the no-prefetch case (or one-prefetch-at-top-of-loop), the machine-instruction scheduler is able to shuffle around instructions effectively, whereas the addition of the extra prfm splits it into two basic blocks, so we end up with suboptimal register allocation across them, either out of necessity (literally out of registers) or just the fact that optimal register allocation is hard.
One workaround I'm looking at is pushing all of the prfm
instructions to the top of the inner loop, in the hopes that the rest of the loop can avoid spills; this isn't ideal, of course, as interleaving the prefetch instructions would be preferable for fairly obvious reasons.
Despite my suspicions, I'm going to keep this issue open for the time being, as I'm still not 100% convinced that the situation can't be improved.
Looking over the disassembly of prefetch_2.ll in more detail, it makes even less sense (see below); it appears that q0 is "spilled" several times in the inner loop, but never reloaded (at least not in a way I can see).
.LBB0_2: // %"for convolved.s1.rdom$x" // Parent Loop BB0_1 Depth=1 // => This Inner Loop Header: Depth=2 prfm pldl1strm, [x4] ldp q11, q0, [x2] add x5, x2, x11 ldp q13, q31, [x5] ldr q24, [x4] str q0, [sp, #16] // 16-byte Folded Spill ldr q0, [x2, #32] subs x3, x3, #1 udot v23.4s, v11.16b, v24.4b[0] udot v21.4s, v13.16b, v24.4b[0] str q0, [sp, #32] // 16-byte Folded Spill ldr q0, [x2, #48] udot v21.4s, v31.16b, v24.4b[1] add x2, x2, x13 str q0, [sp, #48] // 16-byte Folded Spill ldp q29, q0, [x5, #32] add x5, x5, x11 ldp q26, q10, [x5] ldp q8, q30, [x5, #32] add x5, x5, x11 ldp q25, q15, [x5] ldp q12, q9, [x5, #32] add x5, x4, x12 str q0, [sp] // 16-byte Folded Spill prfm pldl1strm, [x5] ldr q14, [x5] ldr q0, [x4, x16] ldr q27, [x4, x17] udot v20.4s, v25.16b, v24.4b[0] udot v7.4s, v25.16b, v14.4b[0] udot v2.4s, v25.16b, v0.4b[0] udot v3.4s, v25.16b, v27.4b[0] ldr q25, [sp, #16] // 16-byte Folded Reload udot v17.4s, v11.16b, v14.4b[0] udot v16.4s, v11.16b, v0.4b[0] udot v4.4s, v11.16b, v27.4b[0] udot v23.4s, v25.16b, v24.4b[1] udot v17.4s, v25.16b, v14.4b[1] udot v16.4s, v25.16b, v0.4b[1] udot v4.4s, v25.16b, v27.4b[1] ldr q25, [sp, #32] // 16-byte Folded Reload udot v22.4s, v26.16b, v24.4b[0] udot v19.4s, v26.16b, v14.4b[0] udot v18.4s, v26.16b, v0.4b[0] udot v5.4s, v26.16b, v27.4b[0] udot v23.4s, v25.16b, v24.4b[2] udot v17.4s, v25.16b, v14.4b[2] udot v16.4s, v25.16b, v0.4b[2] udot v4.4s, v25.16b, v27.4b[2] ldr q25, [sp, #48] // 16-byte Folded Reload ldr q26, [sp] // 16-byte Folded Reload udot v6.4s, v13.16b, v14.4b[0] udot v28.4s, v13.16b, v0.4b[0] udot v1.4s, v13.16b, v27.4b[0] udot v6.4s, v31.16b, v14.4b[1] udot v28.4s, v31.16b, v0.4b[1] udot v1.4s, v31.16b, v27.4b[1] udot v22.4s, v10.16b, v24.4b[1] udot v19.4s, v10.16b, v14.4b[1] udot v18.4s, v10.16b, v0.4b[1] udot v5.4s, v10.16b, v27.4b[1] udot v20.4s, v15.16b, v24.4b[1] udot v7.4s, v15.16b, v14.4b[1] udot v2.4s, v15.16b, v0.4b[1] udot v3.4s, v15.16b, v27.4b[1] udot v21.4s, v29.16b, v24.4b[2] udot v6.4s, v29.16b, v14.4b[2] udot v28.4s, v29.16b, v0.4b[2] udot v1.4s, v29.16b, v27.4b[2] udot v22.4s, v8.16b, v24.4b[2] udot v19.4s, v8.16b, v14.4b[2] udot v18.4s, v8.16b, v0.4b[2] udot v5.4s, v8.16b, v27.4b[2] udot v20.4s, v12.16b, v24.4b[2] udot v7.4s, v12.16b, v14.4b[2] udot v2.4s, v12.16b, v0.4b[2] udot v3.4s, v12.16b, v27.4b[2] udot v23.4s, v25.16b, v24.4b[3] udot v21.4s, v26.16b, v24.4b[3] udot v22.4s, v30.16b, v24.4b[3] udot v20.4s, v9.16b, v24.4b[3] udot v17.4s, v25.16b, v14.4b[3] udot v6.4s, v26.16b, v14.4b[3] udot v19.4s, v30.16b, v14.4b[3] udot v7.4s, v9.16b, v14.4b[3] udot v16.4s, v25.16b, v0.4b[3] udot v28.4s, v26.16b, v0.4b[3] udot v18.4s, v30.16b, v0.4b[3] udot v2.4s, v9.16b, v0.4b[3] udot v4.4s, v25.16b, v27.4b[3] udot v1.4s, v26.16b, v27.4b[3] udot v5.4s, v30.16b, v27.4b[3] udot v3.4s, v9.16b, v27.4b[3] mov x4, x5 b.ne .LBB0_2
Sorry, I didn't realize the size of the example was the issue. I've enclosed one that is about half the size of the first one. Instructions are the same as before:
(1) llc -march=aarch64 ~/prefetch_2.ll -o - -O3 -mattr=+dotprod > ~/prefetch_2.s
(2) Note that the inner loop (for convolved.s1.rdom$x
) has a single prfm and no spills
(3) Uncomment the call to llvm.prefetch.p0i8
on line 206 of prefetch_2.ll, rerun llc
(4) Note that in addition to a second prfm, there are also several spills/reloads in the inner loop now, but none seem to be related to the new prfm instruction.
I had a very quick look a couple of weeks back. The attached test-case is still pretty big. You might have more luck drumming up support if you can reduce it to something simpler.
Though it's quite possible reducing it is just as difficult as solving it. If you've tried and failed it'd be worth mentioning, might get some sympathy.
Pinging this a couple of weeks later to see if anyone cc'ed on this bug can help point us in the right direction (or point us at someone who could).
+arm folks who may be better suited to redirect this.
Hi all, is there any chance we could get some help with this bug? It would really help us out to unblock some further experiments if we could fix or work around this. I am hoping that maybe the fix is something like telling LLVM that prefetch doesn't affect vector registers? If that is something that makes sense, could someone point us in the right direction?
Extended Description
Steps to repeat:
(1) See enclosed file prefetch.ll -- notice that there are several calls to
@llvm.prefetch.p0i8
, with all but the first commented out(2) Compile to aarch64 assembly with llc -march=aarch64 ~/prefetch.ll -o - -O3 -mattr=+dotprod > ~/prefetch.s
(3) Examine output, search for
prfm
, see block similar to(4) Edit prefetch.ll to uncomment the call to
llvm.prefetch.p0i8
on line 459, re-run llc(5) Search for
prfm
again, and note that the start of the block now contains numerous vector-register spills that appear to be completely unnecessary:These spills don't seem to make any sense: the only instruction that should have been added here was the second
prfm
instruction, and it doesn't depend on any of the vector registers being spilled and reloaded. Is something aboutprefetch
affecting this (e.g., confusing the lifetime analysis for registers loaded from the prefetch location)?