Open heltonmc opened 1 year ago
Maybe considering a different algorithm but still considering the first approach. An alternate would be to use the first approach to load the values efficiently into an LVec and then use two shufflevector instructions to get the values into a ComplexVec. (Of course this can be generalized).
So this approach has two ld2
instructions compared to two ldp
instructions. This seems like an optimal load pattern for the first approach at least and I don't think would be too much slower than the second approach.
@oscardssmith I'm curious what you think about this approach? Going with the second approach instead would require a rewrite though (so I'm less inclined to push through with that)
function ComplexVec3(z::NTuple{4, Complex{T}}) where {T <: FloatTypes}
x = LVec{8, Float64}((reim(z[1])..., reim(z[2])..., reim(z[3])..., reim(z[4])...))
b = shufflevector(x, Val(0:2:6))
c = shufflevector(x, Val(1:2:7))
return ComplexVec{4, Float64}(b, c)
end
julia> @code_llvm debuginfo=:none SIMDMath.ComplexVec3((a, b, c, d))
define void @julia_ComplexVec3_2553([2 x <4 x double>]* noalias nocapture sret([2 x <4 x double>]) %0, [4 x [2 x double]]* nocapture nonnull readonly align 8 dereferenceable(64) %1) #0 {
top:
%2 = bitcast [4 x [2 x double]]* %1 to <8 x double>*
%3 = load <8 x double>, <8 x double>* %2, align 8
%res.i = shufflevector <8 x double> %3, <8 x double> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%res.i2 = shufflevector <8 x double> %3, <8 x double> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
%.sroa.0.0..sroa_idx = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 0
store <4 x double> %res.i, <4 x double>* %.sroa.0.0..sroa_idx, align 32
%.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x <4 x double>], [2 x <4 x double>]* %0, i64 0, i64 1
store <4 x double> %res.i2, <4 x double>* %.sroa.2.0..sroa_idx1, align 32
ret void
}
julia> @code_native debuginfo=:none SIMDMath.ComplexVec3((a, b, c, d))
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 11, 0
.globl _julia_ComplexVec3_2577 ; -- Begin function julia_ComplexVec3_2577
.p2align 2
_julia_ComplexVec3_2577: ; @julia_ComplexVec3_2577
.cfi_startproc
; %bb.0: ; %top
ld2 { v0.2d, v1.2d }, [x0], #32
ld2 { v2.2d, v3.2d }, [x0]
stp q0, q2, [x8]
stp q1, q3, [x8, #32]
ret
.cfi_endproc
; -- End function
.subsections_via_symbols
I don't know much about optimal vectorization strategies here. @celrod, do you have thoughts on this?
Ya I think there has to be some compromise (I think you meant to tag @chriselrod here?).
Also, I'm on ARM here (Apple M1). I do quite enjoy how simple complex multiplies (and therefor muladd) instructions are with the current approach https://github.com/heltonmc/SIMDMath.jl/blob/abf7ec4b35d614c274c42cada34a60eaea5b06ef/src/complex.jl#L13-L24
The second approach would require more faddsub
instructions which aren't as widely supported so I would have to optimize for that if I went with the second approach. I'm inclined to think that ComplexVec3
I show in the second comment is vastly superior to the current approach in the library in the first comment.
Looking more into the optimization guide compared to standard loads, an extra cycle is
required to forward results to FP/ASIMD pipelines. So ld2
has a latency of 7 and throughput of 1 whereas ldp
q form has latency of 7 and throughput of 1 as well. So it seems like these approaches would actually be similar in runtimes which is great!
Generic implementation at https://github.com/heltonmc/SIMDMath.jl/pull/9. Which should work for any length tuple (needs tests etc).
The second approach would be good on x64 CPUs with the fma instruction set, since they have fmaddsub instructions. Note that ARM also has helper instructions https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/FCMLA--Floating-point-Complex-Multiply-Accumulate-
EDIT: I'd have to read this document (and this issue) more carefully to say whether FCMLA actually does something useful here.
You can see here for an idea of how to wrap special instructions without resorting to ASM: https://github.com/JuliaSIMD/VectorizationBase.jl/blob/e2480f9490f91a2b1e47decab87f07feb7170146/src/llvm_intrin/vfmaddsub.jl PRs are of course welcome to VectorizationBase.jl to add more specialized instructions. Ideally, we should add higher level wrappers for approach 2 that work on ARM and x64, and have some fallback otherwise.
Thank you for the links! I'll try to play around with FCMLA instruction to see if it is helpful. Though, I've been a little hesitant to develop separate routines for x64 and ARM just on the standpoint of wanting to put as much time into the special function implementation as I can. Though, the downside is I'm relying pretty heavily on LLVM to do the appropriate optimizations.
The use of the fmaddsub instruction will be very helpful but I think we'll also need a shuffle between these instruction within a complex multiply whereas the current approach does not. This was one of the reasons we were able to compute four complex evalpolys faster than two complex evalpolys in base (which were able to autovectorize) because if we were computing a string of multiplies (say like 15 muladds in a row) it seems that it was faster not to have to perform any shuffles within the hot loop. Again, I'd have to better benchmark both approaches to see but it does seem helpful if you are performing a lot of work with the vectors. I'm also not sure exactly how well the approach would interplay with muladd(a, b, c)
instructions could be any permutation of real or complex vecs or scalars. I think there would also have to be some constantvector instructions with the second approach to get this to work appropriately.
But ya, I guess if the architectures are going to go with a (re, im, re, im) approach then it might be hard to beat native instructions....
Thank you for the tips!
This was one of the reasons we were able to compute four complex evalpolys faster than two complex evalpolys in base (which were able to autovectorize) because if we were computing a string of multiplies (say like 15 muladds in a row) it seems that it was faster not to have to perform any shuffles within the hot loop.
Ah, yes, this may be better on ARM/with the FCMLA instruction. I think it can avoid the shuffles.
Depending on what you're doing, fmaddsub + fmsubadd reduces the amount of shufflying you need to do vs needing to shuffle on load when you have arrays of Complex
(that aren't StructArray
).
It also has the advantage of not needing so much unrolling.
I'll try to test this out over the next few weeks. There isn't too much information on what the FCMLA is actually doing.
Here is a quote from https://arxiv.org/pdf/2103.03013.pdf
"The fcmla instruction is decomposed into two shuffle instructions and one floating-point instruction. Hence the throughput of this instruction is limited at most to one issue every two cycles."
So perhaps it is doing shuffling?
From another paper.... (https://arxiv.org/pdf/2103.03013.pdf)
"OSACA reveals that this is predominantly due to high occupation of floating-point ports on the A64FX caused by the high cost of SVE instructions for complex arithmetics (fcmla and fcadd). These instructions block the floating-point ports for three and two cycles, respectively. Furthermore, the fcmla instruction is imbalanced between the FLA and FLB ports; two out of the three cycles are scheduled to the FLA port. OSACA indicates that FLB has a 35% lower occupancy than FLA. One option to mitigate the pipeline imbalance is to avoid the SVE complex instructions and to use ordinary floating-point instructions instead. In this case the cost would only be two cycles for a complex FMA and one cycle for a complex ADD operation. We discuss the implementation of this option in the next subsection."
It seems that that paper also recommended the split layout.
Finally, this paper (https://csrc.nist.gov/csrc/media/Events/2022/fourth-pqc-standardization-conference/documents/papers/fast-falcon-signature-generation-and-verification-pqc2022.pdf)
"To find the highest performance gain, we implemented vectorized versions of the first and second approaches mentioned above. The former approach offers better cache locality for small l ≤ 4. However, it introduces a vector permutation overhead to compute..... ventually, we chose the second approach as our best-vectorized implementation "
I'll have to really dive into these papers more but I'm not sure there is a real clear winner but it seemed like they all preferred in some instances the split format.... Again, I just very quickly skimmed these
There are really two choices for complex arithmetic mainly in how we want to represent them.
Now, initially I liked the first approach better as it's more natural (especially coming from physics) to consider the real and imaginary pieces separate. This does make multiplication much simpler which is great and add/subtract I think are a wash between the two.
On the other hand there is something I didn't initially consider is that if you only have two imaginary numbers then in the first case you are only operating with <2 x double> width vectors compared to if you store it in the second approach you can operate on <4 x double> vectors. So if you only have two numbers the second approach will be able to fill the registers better and be much faster. Of course for four complex numbers then this advantage immediately falls off because now you can operate on two <4 x double> vectors instead of an <8 x double> vector (unless using AVX-512). So for algorithms where we only can vectorize for two complex numbers the second method is superior.
The final piece comes from how we can cast complex numbers in julia to complex vectors. Let's say we want to compute vector cross products or dot products or any really BLAS type routines where we need to be able to load a small subset of larger array and perform operations on that. The casting becomes quite significant especially if it has to be done at runtime.
For example, the current approach isn't great at this...
The first issue to tackle with this issue is can we improve this? That would help alleviate some of the concerns with this approach.
Consider more seriously second approach?
The second issue is perhaps more fundamental if we should more strongly consider the second approach. For example, the casting could look something like...
Which is significantly cleaner if we have to do this many times within a hot loop. I'm wondering if it would be possible to better reinterpret the first approach. Again, multiplication I think is more efficient with the first approach but that comes with significant downsides.