Open mqudsi opened 3 years ago
@rustbot label +T-compiler +A-LLVM +C-bug
I think this is deliberate. If an instruction is unavailable, we can't inline it because various code motion transformations may move it outside a conditionally executed branch -- for example, we don't want if is_x86_feature_enabled!("bmi2") { call_function_using_bmi2(); }
to end up with bmi2 instructions outside the branch, and can't really guarantee that that won't happen if these things are inlined.
So, the only real solution here I think would be
it would be amazing if we could discuss having a feature like
avx2
unlock guaranteed available (but not synonymous) features likebmi1
andbmi2
If you can show where it's documented that these are actually guaranteed available, that would help... But given that this kind of thing can be enabled or disabled by the OS in a fine-grained way, I kind of expect that these have to be treated separately. I could be mistaken though, and it would be nice to simplify this if possible.
CC @workingjubilee who is interested in target feature stuff at the moment.
... because various code motion transformations may move it outside a conditionally executed branch
That's a good consideration, but I wonder if it is a provable mitigation for the issue in question?
Presumably, you have
#[inline(never)]
fn _bzhi_u32() { asm!("bzhi eax, edi, esi; ....") }
but I think that only guarantees that the specific bzhi
instruction won't be inlined, right? If you have a second function using this wrapper function (in lieu of a bzhi
instruction directly), that wouldn't stop the calling function, with its call to _bzhi_u32()
, from being inlined itself, right? (not that inlining is exactly the case where this problem appears, but let's just pretend it is).
EDIT: I see your point - the function call itself is an implicit compiler fence, so the _bzhi_u32
call would never be reordered above the if
block
If is_x86_feature_enabled(...)
issues a core::sync::atomic::compiler_fence(Ordering::SeqCst)
, I think that would suffice to prevent anything (inlined or not, via a function call or as a direct instruction) from being speculatively executed in advance of the jump - would that be good enough?
If you can show where it's documented that these are actually guaranteed available, that would help
I did some research (first on Wikipedia then in the assembly reference manuals by Intel and AMD) for my article on SIMD pitfalls in rust, the salient part is summarized here:
AMD introduced support for the BMI2 instructions at the same time as they first introduced support for AVX2 (as part of the Excavator microarchitecture, in 2015). Intel likewise introduced BMI2 support (along with BMI1 support, as a matter of fact) as part of the Haswell microarchitecture in 2013, also at the same time they debuted support for AVX2. No AVX2 CPU has shipped from either company without BMI2 since then, and it’s pretty unfathomable that any would in the future.
But you're right, there's nothing stopping a microcode update from disabling bmi separately from avx2 due to some weird errata (even if I can't imagine such a case right now), some new vulnerability could be found tomorrow that causes an OS to disable bzhi (and again, somehow not avx2), or a virtualization platform could virtualize avx2 but not bmi2. To be honest though, I can't imagine any of those cases not also trapping bmi2 instructions and emulating them in software (given the amount of hand-written simd code out there, if nothing else); but still, tomorrow a new CPU manufacturer could release a soft (or even hard) core with avx2 and no bmi2 - I guess it's all a question of how much risk you want to take and whether or not the benefits would be worth it.
(In my article, I mention and give examples of how many developers might not realize a particular instruction isn't automatically available if a technical superset of it is, but that's what profiling and disassembly are for, I guess.)
Thanks for giving me something to think about and for cc'ing others; I think the original issue and this tangent are both worth having a conversation about, even if we end up with the same status quo. I'm eager to hear some other thoughts on how this could work (or fail to work), especially from anyone that's had more compiler-wrangling experience!
Another interesting point regarding whether or not a feature like avx2 unlocks predecessor features is how many people actually test for all of them before branching. I just realized that my code ended up looking like this:
At the site of dispatch: https://github.com/neosmart/tac/blob/b9e134adf4fbb97b09594de05a226d24df6de6a7/src/tac.rs#L128-L139
#[allow(unreachable_code)]
fn search_auto<W: Write>(bytes: &[u8], mut output: &mut W) -> Result<(), std::io::Error> {
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
if is_x86_feature_detected!("avx2") {
return unsafe { search256(bytes, &mut output) };
}
#[cfg(all(feature = "nightly", target_arch = "aarch64"))]
return search128(bytes, &mut output);
search(bytes, &mut output)
}
Followed by the actual implementation: https://github.com/neosmart/tac/blob/b9e134adf4fbb97b09594de05a226d24df6de6a7/src/tac.rs#L186-L203
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
#[target_feature(enable = "lzcnt")]
#[target_feature(enable = "bmi2")]
#[allow(unused_unsafe)]
/// This isn't in the hot path, so prefer dynamic dispatch over a generic `Write` output.
/// This is an AVX2-optimized newline search function that searches a 32-byte (256-bit) window
/// instead of scanning character-by-character (once aligned). This is a *safe* function, but must
/// be adorned with `unsafe` to guarantee it's not called without first checking for AVX2 support.
///
/// We need to explicitly enable lzcnt support for u32::leading_zeros() to use the `lzcnt`
/// instruction instead of an extremely slow combination of branching + BSR. We do not need to test
/// for lzcnt support before calling this method as lzcnt was introduced by AMD alongside SSE4a, long
/// before AVX2, and by Intel on Haswell.
///
/// BMI2 is explicitly opted into to inline the BZHI instruction; otherwise a call to the intrinsic
/// function is added and not inlined.
unsafe fn search256<W: Write>(bytes: &[u8], mut output: &mut W) -> Result<(), std::io::Error> {
So I'm only testing for avx2 before branching, but I've had to add both bmi2 and lzcnt as explicit codegen features at the implementation site (lzcnt is another that you're not going to find an avx2 processor not supporting). I guess that means my dispatch site should actually check for avx2, lzcnt, and bmi2 before branching! I wonder how much code there is out there like this or if I'm the only one.
If is_x86_feature_enabled(...) issues a core::sync::atomic::compiler_fence(Ordering::SeqCst), I think that would suffice to prevent anything (inlined or not, via a function call or as a direct instruction) from being speculatively executed in advance of the jump - would that be good enough?
It would not. compiler_fence only prevents reordering of memory accesses (and even then only volatile and atomic accesses). In general it does not prevent code motion in the way you would need for this, nor is there anything else we have available that could.
As reported in this article about some issues encountered using simd with rust, calls to
bzhi
intrinsics made without thebmi2
cpu target feature enabled gives some odd codegen. Thebzhi
instruction isn't emulated and is still executed directly - but the the intrinsic is never inlined resulting in a completely unnecessary function call (which may be in a hot path).e.g.
compiles to
(and this is what it looks like with optimizations enabled when it can't just
jmp
to the intrinsic: godbolt)Other intrinsics get inlined after emulation all the time (e.g. ctlz); this one isn't emulated but it's not inlined, either.
I'm not sure if there's a good reason for this or not, so please pardon me if I'm just missing something obvious. Is it intentional to prevent a
#UD
in some odd cases where the mere presence of the unrecognized instruction, even if not called, is a problem but it can be moved into another function without a problem?(Also, it would be amazing if we could discuss having a feature like
avx2
unlock guaranteed available (but not synonymous) features likebmi1
andbmi2
but this is not the place for that.)