google / highway

Performance-portable, length-agnostic SIMD with runtime dispatch
Apache License 2.0
4.16k stars 319 forks source link

Adding new HWY_AVX10_2 target #2348

Open johnplatts opened 1 week ago

johnplatts commented 1 week ago

The upcoming Intel AVX10.2 instruction set (which is described in the specification that can be found at https://www.intel.com/content/www/us/en/content-details/828965/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html) adds the following operations:

GCC 15 and Clang 20, which are currently under development and scheduled to be released in Spring 2025, will have support for the new AVX10.2 intrinsics.

The new _mm_cvttsp[h,s,d]_epi intrinsics available on AVX10.2 should also fix the undefined behavior that is there with the conversion of out-of-range floating-point vectors to integer vectors with GCC (and this issue was described at https://github.com/google/highway/issues/2183).

Also need to move some of the ops for 256-bit or smaller vectors that are currently implemented in the hwy/ops/x86_512-inl.h header on AVX3 targets into a separate header as support for 512-bit vectors is optional on AVX10.2.

jan-wassenberg commented 1 week ago

Thanks for starting the discussion! Looks like GNR has also just been introduced/launched, but that supports 10.1, I think.

Min/MaxNumber (Min with proper NaN handling per IEEE754:2019) and Min/MaxMagnitude look useful, as does F16 WidenMulPairwiseAdd. Would be very happy to see those added :) I don't see a burning need for bf16 ops. This target is AFAIK the only platform that has them, and just about the only demand I see for bf16 is mul/add, which is mostly covered by the existing WidenMul.

I agree we'd want to split the "AVX3" and "512-bit" aspects of x86_512-inl.h.

How about I make a TODO for around 2025-03 to lay the groundwork by creating the HWY_AVX10_2 (or HWY_AVX102?) target/boilerplate? Would you later like to add some of its functionality?

johnplatts commented 1 week ago

MinMagnitude/MaxMagnitude ops are implemented in pull request #2353.