Slow code generated for _mm256_mulhi_epi16

turalcar commented 18 hours ago

I suspect this is an issue in upstream LLVM. The sse2 version and the unsigned version (_mm256_mulhi_epu16) show the same problem. If a wider register is available (xmm -> ymm -> zmm) that will be used instead of splitting the values between 2 different ones.

Code

https://godbolt.org/z/9Eqb45Keq

I tried this code:

pub unsafe fn bad(a: __m256i) -> __m256i {
    let a = _mm256_and_si256(a, _mm256_set1_epi16(0x7FFF));
    _mm256_mulhi_epi16(a, _mm256_set1_epi16(1000))
}

I expected to see this happen: more or less the same codegen as with a -1000 in multiplier

Instead, this happened: it looks like the vector is widened to i32 for no good reason.

Version it worked on

It most recently worked on: Rust 1.74

Version with regression

I checked on godbolt with 1.75-1.81 and whatever beta and nightly are today.

nikic commented 18 hours ago

Upstream issue: https://github.com/llvm/llvm-project/issues/109790

nikic commented 11 hours ago

Fixed upstream in LLVM 20.

rust-lang / rust