_bzhi_u32/_bzhi_u64 has poor codegen without bmi2 feature

As reported in this article about some issues encountered using simd with rust, calls to bzhi intrinsics made without the bmi2 cpu target feature enabled gives some odd codegen. The bzhi instruction isn't emulated and is still executed directly - but the the intrinsic is never inlined resulting in a completely unnecessary function call (which may be in a hot path).

e.g.

#[target_feature(enable = "avx2")]
pub unsafe fn bzhi(num: u32) -> u32 {
    core::arch::x86_64::_bzhi_u32(num, 31)
}

compiles to

core::core_arch::x86::bmi2::_bzhi_u32:
        sub     rsp, 4
        bzhi    eax, edi, esi
        mov     dword ptr [rsp], eax
        mov     eax, dword ptr [rsp]
        add     rsp, 4
        ret

example::bzhi:
        push    rax
        mov     esi, 31
        call    core::core_arch::x86::bmi2::_bzhi_u32
        mov     dword ptr [rsp + 4], eax
        mov     eax, dword ptr [rsp + 4]
        pop     rcx
        ret

(and this is what it looks like with optimizations enabled when it can't just jmp to the intrinsic: godbolt)

Other intrinsics get inlined after emulation all the time (e.g. ctlz); this one isn't emulated but it's not inlined, either.

I'm not sure if there's a good reason for this or not, so please pardon me if I'm just missing something obvious. Is it intentional to prevent a #UD in some odd cases where the mere presence of the unrecognized instruction, even if not called, is a problem but it can be moved into another function without a problem?

(Also, it would be amazing if we could discuss having a feature like avx2 unlock guaranteed available (but not synonymous) features like bmi1 and bmi2 but this is not the place for that.)

@rustbot label +T-compiler +A-LLVM +C-bug

I think this is deliberate. If an instruction is unavailable, we can't inline it because various code motion transformations may move it outside a conditionally executed branch -- for example, we don't want if is_x86_feature_enabled!("bmi2") { call_function_using_bmi2(); } to end up with bmi2 instructions outside the branch, and can't really guarantee that that won't happen if these things are inlined.

So, the only real solution here I think would be

it would be amazing if we could discuss having a feature like avx2 unlock guaranteed available (but not synonymous) features like bmi1 and bmi2

If you can show where it's documented that these are actually guaranteed available, that would help... But given that this kind of thing can be enabled or disabled by the OS in a fine-grained way, I kind of expect that these have to be treated separately. I could be mistaken though, and it would be nice to simplify this if possible.

CC @workingjubilee who is interested in target feature stuff at the moment.

... because various code motion transformations may move it outside a conditionally executed branch

~~That's a good consideration, but I wonder if it is a provable mitigation for the issue in question?~~

~~Presumably, you have~~

#[inline(never)]
fn _bzhi_u32() { asm!("bzhi    eax, edi, esi; ....") }

but I think that only guarantees that the specific bzhi instruction won't be inlined, right? If you have a second function using this wrapper function (in lieu of a bzhi instruction directly), that wouldn't stop the calling function, with its call to _bzhi_u32(), from being inlined itself, right? (not that inlining is exactly the case where this problem appears, but let's just pretend it is).

EDIT: I see your point - the function call itself is an implicit compiler fence, so the _bzhi_u32 call would never be reordered above the if block

If is_x86_feature_enabled(...) issues a core::sync::atomic::compiler_fence(Ordering::SeqCst), I think that would suffice to prevent anything (inlined or not, via a function call or as a direct instruction) from being speculatively executed in advance of the jump - would that be good enough?

If you can show where it's documented that these are actually guaranteed available, that would help

I did some research (first on Wikipedia then in the assembly reference manuals by Intel and AMD) for my article on SIMD pitfalls in rust, the salient part is summarized here:

AMD introduced support for the BMI2 instructions at the same time as they first introduced support for AVX2 (as part of the Excavator microarchitecture, in 2015). Intel likewise introduced BMI2 support (along with BMI1 support, as a matter of fact) as part of the Haswell microarchitecture in 2013, also at the same time they debuted support for AVX2. No AVX2 CPU has shipped from either company without BMI2 since then, and it’s pretty unfathomable that any would in the future.

But you're right, there's nothing stopping a microcode update from disabling bmi separately from avx2 due to some weird errata (even if I can't imagine such a case right now), some new vulnerability could be found tomorrow that causes an OS to disable bzhi (and again, somehow not avx2), or a virtualization platform could virtualize avx2 but not bmi2. To be honest though, I can't imagine any of those cases not also trapping bmi2 instructions and emulating them in software (given the amount of hand-written simd code out there, if nothing else); but still, tomorrow a new CPU manufacturer could release a soft (or even hard) core with avx2 and no bmi2 - I guess it's all a question of how much risk you want to take and whether or not the benefits would be worth it.

(In my article, I mention and give examples of how many developers might not realize a particular instruction isn't automatically available if a technical superset of it is, but that's what profiling and disassembly are for, I guess.)

Thanks for giving me something to think about and for cc'ing others; I think the original issue and this tangent are both worth having a conversation about, even if we end up with the same status quo. I'm eager to hear some other thoughts on how this could work (or fail to work), especially from anyone that's had more compiler-wrangling experience!

Another interesting point regarding whether or not a feature like avx2 unlocks predecessor features is how many people actually test for all of them before branching. I just realized that my code ended up looking like this:

At the site of dispatch: https://github.com/neosmart/tac/blob/b9e134adf4fbb97b09594de05a226d24df6de6a7/src/tac.rs#L128-L139

#[allow(unreachable_code)]
fn search_auto<W: Write>(bytes: &[u8], mut output: &mut W) -> Result<(), std::io::Error> {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    if is_x86_feature_detected!("avx2") {
        return unsafe { search256(bytes, &mut output) };
    }

    #[cfg(all(feature = "nightly", target_arch = "aarch64"))]
    return search128(bytes, &mut output);

    search(bytes, &mut output)
}

Followed by the actual implementation: https://github.com/neosmart/tac/blob/b9e134adf4fbb97b09594de05a226d24df6de6a7/src/tac.rs#L186-L203

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
#[target_feature(enable = "lzcnt")]
#[target_feature(enable = "bmi2")]
#[allow(unused_unsafe)]
/// This isn't in the hot path, so prefer dynamic dispatch over a generic `Write` output.
/// This is an AVX2-optimized newline search function that searches a 32-byte (256-bit) window
/// instead of scanning character-by-character (once aligned). This is a *safe* function, but must
/// be adorned with `unsafe` to guarantee it's not called without first checking for AVX2 support.
///
/// We need to explicitly enable lzcnt support for u32::leading_zeros() to use the `lzcnt`
/// instruction instead of an extremely slow combination of branching + BSR. We do not need to test
/// for lzcnt support before calling this method as lzcnt was introduced by AMD alongside SSE4a, long
/// before AVX2, and by Intel on Haswell.
///
/// BMI2 is explicitly opted into to inline the BZHI instruction; otherwise a call to the intrinsic
/// function is added and not inlined.
unsafe fn search256<W: Write>(bytes: &[u8], mut output: &mut W) -> Result<(), std::io::Error> {

So I'm only testing for avx2 before branching, but I've had to add both bmi2 and lzcnt as explicit codegen features at the implementation site (lzcnt is another that you're not going to find an avx2 processor not supporting). I guess that means my dispatch site should actually check for avx2, lzcnt, and bmi2 before branching! I wonder how much code there is out there like this or if I'm the only one.

If is_x86_feature_enabled(...) issues a core::sync::atomic::compiler_fence(Ordering::SeqCst), I think that would suffice to prevent anything (inlined or not, via a function call or as a direct instruction) from being speculatively executed in advance of the jump - would that be good enough?

It would not. compiler_fence only prevents reordering of memory accesses (and even then only volatile and atomic accesses). In general it does not prevent code motion in the way you would need for this, nor is there anything else we have available that could.

rust-lang / rust

_bzhi_u32/_bzhi_u64 has poor codegen without bmi2 feature #88566