Open db4cdc3a-4335-4c8c-9bf1-3141f9c3cba2 opened 7 years ago
Hi Andrea, thanks for the update.
Another problem is that i don't think that these rematerialized broadcasts will be folded into instructions like what happened in the example above. Will the follow-up patch you mentioned address these two issues?
That was the idea. It would have required another patch to identify those broadcast patterns and fold them into a vector load from constant pool.
That being said - as I wrote in my last comment - I am not convinced that this is the right way to fix this issue.
To be more clear, imagine that the folded refill in the example above were replaced with a VBROADCAST, then the throughput of the entire loop may be reduced (not saying this necessarily will happen in the above, but we can try hard to construct a real measurable example).
I think you are right. I was also under the impression that this approach might have degraded performance. That's why I was waiting before uploading my patch on phabricator.
Maybe we should go back to the idea described in bug 28505 (from comment 5 to comment 7)?
Thanks for your feedback Zvi.
To be more clear, imagine that the folded refill in the example above were replaced with a VBROADCAST, then the throughput of the entire loop may be reduced (not saying this necessarily will happen in the above, but we can try hard to construct a real measurable example).
Hi Andrea, thanks for the update.
Rematerializing a broadcast as a broadcast instruction (as opposed to a full VMOV*) may be bad for performance. As i explained earlier broadcast instructions on AVX2 targets may be throughput bottlenecks. Another problem is that i don't think that these rematerialized broadcasts will be folded into instructions like what happened in the example above. Will the follow-up patch you mentioned address these two issues?
Hi Zvi,
one thing that can be easily done is to teach the code generator that a vbroadcast of a load from constant pool can be safely rematerialized.
If we do so, then the InlineSpiller would start rematerializing a VBROADCASTSSrm of a constant instead of emitting spill code.
I have a patch (which I plan to upload on phabricator soon) that fixes the issue with the spill of a broadcasted constant.
We would stll need a follow up patch to teach the code generato how to simplify the following sequence on targets that prefer to widen the constant value and get rid of the vbroadcastss entirely:
X = VBROADCASTSSrm (%constant_pool_scalar_elt) USE_OF(X)
into: USE_OF(%constant_pool_vector_elt)
In case, that could be done as a separate patch (provided that people agree with this approach).
That's a good point and yes on AVX2 processors such as Haswell and Broadwell full loads have higher throughput and shorter latencies than the broadcast instructions. For examples, movaps can reach a throughput of 2 instructions/cycle while vbroadcastss can reach a throughput of 1 instruction/cycle (source: Optimization Reference Manual). The broadcasts from constant-pool have the benefit of reducing code size by reducing the loaded value size and its alignment requirement. In some (uncommon) cases reduction of code size may result in faster execution. Having said that, assuming these constant-splat broadcasts will be either hoisted out of loops when register pressure permits, or folded/rematerialized as full vector loads, then we may be closer to the sweetpost of minimizing both execution time and code size.
I wonder why do we prefer to use a "vbroadcast" to materialize a splat constant instead of just emitting a plain vector load from constant pool?
Shouldn't we check if the splat value is a constant, and in case avoid to select a vbroadcast (if the BUILD_VECTOR is legal for the target)?
Maybe this is a stupid question. However, is there a real advantage in selecting a vbroadcast to materialize a constant (other than saving some space in the constant pool)?
This is a very interesting case.
It is slightly more complicated than what you described, since the broadcasted constant is not a vector.
.LCPI0_0: .long 1065353216 # float 1
So, a direct folding of .LCPI0_0 would not work in this particular case. To be able to fold the load into the operand of vaddps we would need a new constant (a vector value) in the constant pool. Essentially, we would need fold away the broadcastss by materializing a new splat vector constant.
Extended Description
The loop in the function below contains a use of a splat-vector of float 1.0's which becomes a vbroadcastss:
LICM will hoist the broadcastss out of the loop, and at register-allocation time register pressure leads to spilling the hoisted-broadcast:
llc -mtriple=i686 -mattr=+avx2
Instead of spilling it would be better to fold the load directly from the constant-pool. Maybe like this: