llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.43k stars 12.16k forks source link

Support for rematerialization and folding of memory broadcasts as alternative to spilling #30679

Open db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 opened 7 years ago

db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 commented 7 years ago
Bugzilla Link 31331
Version trunk
OS Windows NT
CC @adibiagio,@aymanmusa,@delena,@hfinkel,@igor-breger,@RKSimon,@rotateright

Extended Description

The loop in the function below contains a use of a splat-vector of float 1.0's which becomes a vbroadcastss:

define void @reg_pressure_broadcast(<4 x float>* %arg) local_unnamed_addr nounwind {
bb:
  br label %bb2

bb1:                                              ; preds = %bb2
  ret void

bb2:                                              ; preds = %bb2, %bb
  %tmp = phi i64 [ 0, %bb ], [ %tmp42, %bb2 ]
  %tmp3 = getelementptr inbounds <4 x float>, <4 x float>* %arg, i64 %tmp
  %tmp4 = load volatile <4 x float>, <4 x float>* %tmp3
  %tmp5 = getelementptr inbounds <4 x float>, <4 x float>* %tmp3, i64 1
  %tmp6 = load volatile <4 x float>, <4 x float>* %tmp5
  %tmp7 = getelementptr inbounds <4 x float>, <4 x float>* %tmp3, i64 2
  %tmp8 = load volatile <4 x float>, <4 x float>* %tmp7
  %tmp9 = getelementptr inbounds <4 x float>, <4 x float>* %tmp3, i64 3
  %tmp10 = load volatile <4 x float>, <4 x float>* %tmp9
  %tmp13 = getelementptr inbounds <4 x float>, <4 x float>* %tmp3, i64 4
  %tmp11 = load volatile <4 x float>, <4 x float>* %tmp13
  %tmp23 = getelementptr inbounds <4 x float>, <4 x float>* %tmp3, i64 6
  %tmp21 = load volatile <4 x float>, <4 x float>* %tmp23
  %tmp24 = getelementptr inbounds <4 x float>, <4 x float>* %tmp3, i64 7
  %tmp22 = load volatile <4 x float>, <4 x float>* %tmp24
  %tmp25 = getelementptr inbounds <4 x float>, <4 x float>* %tmp3, i64 8
  %tmp26 = load volatile <4 x float>, <4 x float>* %tmp25
  ; The constant vector here can be generated as a broadcast load
  %tmp12 = fadd <4 x float> %tmp11, <float 1.0, float 1.0, float 1.0, float 1.0>
  %p = getelementptr inbounds <4 x float>, <4 x float>* %tmp3, i64 5
  store volatile <4 x float> %tmp12, <4 x float>* %p
  store volatile <4 x float> %tmp4, <4 x float>* %tmp3
  store volatile <4 x float> %tmp6, <4 x float>* %tmp5
  store volatile <4 x float> %tmp8, <4 x float>* %tmp7
  store volatile <4 x float> %tmp11, <4 x float>* %tmp9
  store volatile <4 x float> %tmp21, <4 x float>* %tmp23
  store volatile <4 x float> %tmp22, <4 x float>* %tmp24
  store volatile <4 x float> %tmp26, <4 x float>* %tmp25
  %tmp42 = add nuw nsw i64 %tmp, 4
  %tmp43 = icmp eq i64 %tmp42, 4096
  br i1 %tmp43, label %bb1, label %bb2
}

LICM will hoist the broadcastss out of the loop, and at register-allocation time register pressure leads to spilling the hoisted-broadcast:

llc -mtriple=i686 -mattr=+avx2

        pushl   %esi                                             
        subl    $24, %esp                                        
        xorl    %eax, %eax                                       
        movl    32(%esp), %ecx                                   
        vbroadcastss    .LCPI0_0, %xmm0                          
        vmovups %xmm0, (%esp)           # 16-byte Spill          
        xorl    %edx, %edx                                       
        .p2align        4, 0x90                                  
.LBB0_1:                                # %bb2                   
                                        # =>This Inner Loop Header: Depth=1
        movl    %eax, %esi                                                 
        shll    $4, %esi                                                   
        vmovaps (%ecx,%esi), %xmm1                                         
        vmovaps 16(%ecx,%esi), %xmm2                                       
        vmovaps 32(%ecx,%esi), %xmm3
        vmovaps 48(%ecx,%esi), %xmm4
        vmovaps 64(%ecx,%esi), %xmm4
        vmovaps 96(%ecx,%esi), %xmm5
        vmovaps 112(%ecx,%esi), %xmm6
        vmovaps 128(%ecx,%esi), %xmm7
        vaddps  (%esp), %xmm4, %xmm0    # 16-byte Folded Reload
        vmovaps %xmm0, 80(%ecx,%esi)
        vmovaps %xmm1, (%ecx,%esi)
        vmovaps %xmm2, 16(%ecx,%esi)
        vmovaps %xmm3, 32(%ecx,%esi)
        vmovaps %xmm4, 48(%ecx,%esi)
        vmovaps %xmm5, 96(%ecx,%esi)
        vmovaps %xmm6, 112(%ecx,%esi)
        vmovaps %xmm7, 128(%ecx,%esi)
        addl    $4, %eax
        adcl    $0, %edx
        movl    %eax, %esi
        xorl    $4096, %esi             # imm = 0x1000
        orl     %edx, %esi
        jne     .LBB0_1
# BB#2:                                 # %bb1
        addl    $24, %esp
        popl    %esi
        retl

Instead of spilling it would be better to fold the load directly from the constant-pool. Maybe like this:

        pushl   %esi                                             
        xorl    %eax, %eax                                       
        movl    8(%esp), %ecx                                    
        xorl    %edx, %edx                                       
        .p2align        4, 0x90                                  
.LBB0_1:                                # %bb2                   
                                        # =>This Inner Loop Header: Depth=1
        movl    %eax, %esi                                                 
        shll    $4, %esi                                                   
        vmovaps (%ecx,%esi), %xmm1                                         
        vmovaps 16(%ecx,%esi), %xmm2
        vmovaps 32(%ecx,%esi), %xmm3
        vmovaps 48(%ecx,%esi), %xmm4
        vmovaps 64(%ecx,%esi), %xmm4
        vmovaps 96(%ecx,%esi), %xmm5
        vmovaps 112(%ecx,%esi), %xmm6
        vmovaps 128(%ecx,%esi), %xmm7
        vaddps  .LCPI0_0, %xmm4, %xmm0 <-----Load directly from const-pool
        vmovaps %xmm0, 80(%ecx,%esi)
        vmovaps %xmm1, (%ecx,%esi)
        vmovaps %xmm2, 16(%ecx,%esi)
        vmovaps %xmm3, 32(%ecx,%esi)
        vmovaps %xmm4, 48(%ecx,%esi)
        vmovaps %xmm5, 96(%ecx,%esi)
        vmovaps %xmm6, 112(%ecx,%esi)
        vmovaps %xmm7, 128(%ecx,%esi)
        addl    $4, %eax
        adcl    $0, %edx
        movl    %eax, %esi
        xorl    $4096, %esi             # imm = 0x1000
        orl     %edx, %esi
        jne     .LBB0_1
# BB#2:                                 # %bb1
        popl    %esi
        retl
adibiagio commented 7 years ago

Hi Andrea, thanks for the update.

Another problem is that i don't think that these rematerialized broadcasts will be folded into instructions like what happened in the example above. Will the follow-up patch you mentioned address these two issues?

That was the idea. It would have required another patch to identify those broadcast patterns and fold them into a vector load from constant pool.

That being said - as I wrote in my last comment - I am not convinced that this is the right way to fix this issue.

adibiagio commented 7 years ago

To be more clear, imagine that the folded refill in the example above were replaced with a VBROADCAST, then the throughput of the entire loop may be reduced (not saying this necessarily will happen in the above, but we can try hard to construct a real measurable example).

I think you are right. I was also under the impression that this approach might have degraded performance. That's why I was waiting before uploading my patch on phabricator.

Maybe we should go back to the idea described in bug 28505 (from comment 5 to comment 7)?

Thanks for your feedback Zvi.

db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 commented 7 years ago

To be more clear, imagine that the folded refill in the example above were replaced with a VBROADCAST, then the throughput of the entire loop may be reduced (not saying this necessarily will happen in the above, but we can try hard to construct a real measurable example).

db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 commented 7 years ago

Hi Andrea, thanks for the update.

Rematerializing a broadcast as a broadcast instruction (as opposed to a full VMOV*) may be bad for performance. As i explained earlier broadcast instructions on AVX2 targets may be throughput bottlenecks. Another problem is that i don't think that these rematerialized broadcasts will be folded into instructions like what happened in the example above. Will the follow-up patch you mentioned address these two issues?

adibiagio commented 7 years ago

Hi Zvi,

one thing that can be easily done is to teach the code generator that a vbroadcast of a load from constant pool can be safely rematerialized.

If we do so, then the InlineSpiller would start rematerializing a VBROADCASTSSrm of a constant instead of emitting spill code.

I have a patch (which I plan to upload on phabricator soon) that fixes the issue with the spill of a broadcasted constant.

We would stll need a follow up patch to teach the code generato how to simplify the following sequence on targets that prefer to widen the constant value and get rid of the vbroadcastss entirely:

X = VBROADCASTSSrm (%constant_pool_scalar_elt) USE_OF(X)

into: USE_OF(%constant_pool_vector_elt)

In case, that could be done as a separate patch (provided that people agree with this approach).

db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 commented 7 years ago

That's a good point and yes on AVX2 processors such as Haswell and Broadwell full loads have higher throughput and shorter latencies than the broadcast instructions. For examples, movaps can reach a throughput of 2 instructions/cycle while vbroadcastss can reach a throughput of 1 instruction/cycle (source: Optimization Reference Manual). The broadcasts from constant-pool have the benefit of reducing code size by reducing the loaded value size and its alignment requirement. In some (uncommon) cases reduction of code size may result in faster execution. Having said that, assuming these constant-splat broadcasts will be either hoisted out of loops when register pressure permits, or folded/rematerialized as full vector loads, then we may be closer to the sweetpost of minimizing both execution time and code size.

adibiagio commented 7 years ago

I wonder why do we prefer to use a "vbroadcast" to materialize a splat constant instead of just emitting a plain vector load from constant pool?

Shouldn't we check if the splat value is a constant, and in case avoid to select a vbroadcast (if the BUILD_VECTOR is legal for the target)?

Maybe this is a stupid question. However, is there a real advantage in selecting a vbroadcast to materialize a constant (other than saving some space in the constant pool)?

adibiagio commented 7 years ago

This is a very interesting case.

It is slightly more complicated than what you described, since the broadcasted constant is not a vector.

.LCPI0_0: .long 1065353216 # float 1

So, a direct folding of .LCPI0_0 would not work in this particular case. To be able to fold the load into the operand of vaddps we would need a new constant (a vector value) in the constant pool. Essentially, we would need fold away the broadcastss by materializing a new splat vector constant.