google / highway

Performance-portable, length-agnostic SIMD with runtime dispatch
Apache License 2.0
4.17k stars 319 forks source link

Support for complex arithmetics #2047

Open Ryo-not-rio opened 6 months ago

Ryo-not-rio commented 6 months ago

Hi,

I would like to propose the addition of complex arithmetic instructions to highway. This would allow us to take advantage of the SVE complex arithmetic instructions (svcadd, svcmla and svcdot), improving the performance of complex arithmetics on arm. I imagine the difficulty would be the need to implement and maintain equivalent functions for x86 and NEON where these instructions do not exist natively.

jan-wassenberg commented 6 months ago

We are happy to maintain contributed functions. Assuming only SVE supports these instructions natively, it is actually pretty easy to implement a fallback for other platforms because it can be done just once, without repeating for each platform, by putting it in generic_ops-inl.h.

One general principle is that we want the code to be reasonably efficient on all platforms. I wonder whether it would be better, if we did not have the SVE instructions, to organize complex numbers into two regs re and im, rather than in odd/even lanes of one vector?

Let's imagine an app willing to have a special case for SVE, and a second codepath for other platforms. Would this be faster than if we always used odd/even layout for Z numbers? If so, it sounds like an #if might be a better fit; if not, then a single function with either SVE or emulated implementation sounds reasonable.

Ryo-not-rio commented 6 months ago

I see your point, we indeed found that de-interleaving the complex numbers first was faster for highway on NEON & SVE. I'm not sure about the x86 side of things though. Even if this is the case, it would be nice to be able to access the SVE instructions from highway since they seem to perform significantly better. Either way, needs further investigation on x86 it sounds like

johnplatts commented 6 months ago

Hi,

I would like to propose the addition of complex arithmetic instructions to highway. This would allow us to take advantage of the SVE complex arithmetic instructions (svcadd, svcmla and svcdot), improving the performance of complex arithmetics on arm. I imagine the difficulty would be the need to implement and maintain equivalent functions for x86 and NEON where these instructions do not exist natively.

F32 AddSub(a, b) is equivalent to SVE svcadd_f32_x(svptrue_f32(), a, Reverse2(d, b), 90) and SSSE3 _mm_addsub_ps(a.raw, b.raw).

The F16/F32/F64 AddSub op should be re-implemented using svcadd on SVE targets as svcadd is more efficient than the default AddSub implementation in generic_ops-inl.h on SVE targets.

F16/F32/F64 MulAddSub(a, b, c) should be re-implemented as MulAdd(a, b, AddSub(Set(DFromV<decltype(c)>(), -0.0), c)) on SVE targets (which allows the MulAddSub to be carried out using a svcadd op followed by a svmad op).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 0) is equivalent to MulAdd(DupEven(b), c, a).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 90) is equivalent to MulAdd(DupOdd(b), Reverse2(d, AddSub(Set(DFromV<decltype(b)>(), -0.0), c)), a).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 180) is equivalent to NegMulAdd(DupEven(b), c, a).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 270) is equivalent to NegMulAdd(DupOdd(b), Reverse2(d, AddSub(Set(DFromV<decltype(b)>(), -0.0), c)), a).

jan-wassenberg commented 6 months ago

Thanks @johnplatts for pointing out that we can already target svcadd with existing (Mul)AddSub. @Ryo-not-rio , how close does that get us to what you had in mind?

johnplatts commented 6 months ago

Thanks @johnplatts for pointing out that we can already target svcadd with existing (Mul)AddSub.

I have re-implemented AddSub and MulAddSub on SVE using svcadd in pull request #2054.

Ryo-not-rio commented 6 months ago

It's good to know that svcadd is already being used in highway! I think we're still missing a direct link to the svcmla instructions. Even when there are equivalent ways of writing things in highway, we've seen a performance hit due to the extra instructions required. For example svcmla_f32_m(pg, acc0, vec_a, vec_b, 90); requires an extra reverse instruction on highway

jan-wassenberg commented 6 months ago

hm. It seems that the CMLA instruction is 'exotic' in the sense that other ISAs do not provide such an instruction. Do you have any suggestion on how we could handle that without performance cliffs in one ISA?

johnplatts commented 6 months ago

hm. It seems that the CMLA instruction is 'exotic' in the sense that other ISAs do not provide such an instruction. Do you have any suggestion on how we could handle that without performance cliffs in one ISA?

Here is a link to a generic implementation of the ComplexAddRot90/270 ops (equivalent to SVE svcadd_*_x) and ComplexMulAdd[Rot90/180/270] (equivalent to SVE svcmla_*_x): https://godbolt.org/z/1zn949a5f

There are also vcaddq_rot90/270_f16/f32/f64 (equivalent to SVE svcadd_*_x) and vcmlaq[_rot90/180/270]_f16/f32/f64 intrinsics (equivalent to SVE svcmla_*_x) intrinsics available with the FCADD extension available on Armv8.3 or later.

The generic implementation of the ComplexAdd/ComplexMulAdd ops linked above is efficient on most SIMD targets, including SSSE3/SSE4/AVX2/AVX3/NEON.

SSSE3/SSE4/AVX2/AVX3 have AddSub instructions for F32/F64 vectors that are 32 bytes or smaller that helps improve the performance of the ComplexAdd/ComplexMulAdd ops.

jan-wassenberg commented 6 months ago

Thanks, those implementations look good to me! Are we proposing to add those as new ops, with single-instruction implementations for SVE?

That seems fine provided we are confident that apps would want to use those ops as defined. One remaining concern I have (because not familiar with complex arithmetic): are there perhaps other equivalent ways of implementing the desired formulas, that would be more efficient than these generic implementations when run on non-SVE?