Inlining causes miscompilation of code that mixes target features

RalfJung commented 11 months ago

The following code ought to be completely fine and UB-free:

use std::mem::transmute;
#[cfg(target_arch = "x86")]
use std::arch::x86::*;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

extern "C" fn no_target_feature(_dummy: f32, x: __m256) {
    let val = unsafe { transmute::<_, [u32; 8]>(x) };
    dbg!(val);
}

#[inline(always)] 
fn no_target_feature_intermediate(dummy: f32, x: __m256) {
    no_target_feature(dummy, x);
}

#[target_feature(enable = "avx")]
unsafe fn with_target_feature(x: __m256) {
  // Critical call: caller and callee have different target features.
  // However, we use the Rust ABI, so this is fine.
  no_target_feature_intermediate(0.0, x);
}

fn main() {
    assert!(is_x86_feature_detected!("avx"));
    // SAFETY: we checked that the `avx` feature is present.
    unsafe {
        with_target_feature(transmute([1; 8]));
    }
}

There's some unsafe going on, but the safety comment explains why that is okay. We are even taking care to follow the target-feature related ABI rules (see https://github.com/rust-lang/rust/pull/115476); all calls between functions with different target-features use the "Rust" ABI.

And yet, this prints (when built without optimizations)

[src/main.rs:9] val = [
    1,
    1,
    1,
    1,
    538976288,
    538976288,
    538976288,
    538976288,
]

The value got clobbered while being passed through the various functions.

Replacing inline(always) by inline(never) makes the issue disappear. But inline attributes must never cause miscompilation, so there's still a soundness bug here.

I don't know if this is the MIR inliner (Cc @rust-lang/wg-mir-opt) or the LLVM inliner going wrong.

saethlin commented 11 months ago

This still miscompiles with -Zmir-opt-level=0. The playground does not let you pass flags (https://github.com/rust-lang/rust-playground/pull/781), so I generally advise against using it. godbolt supports flags, execution, setting environment variables, and picking among old toolchains: https://godbolt.org/z/WcMq14MPG

RalfJung commented 11 months ago

Okay so it's an LLVM bug then it seems. Cc @nikic

nikic commented 11 months ago

Is there a way to reproduce this without #[inline(always)]? Forcing inlining disables target-feature safety checks in LLVM.

(Incidentally, there was an attempt to not do that in LLVM 17, but this was reverted due to the large amount of regressions it caused. People rely on that a lot, including in Rust.)

saethlin commented 11 months ago

Forcing inlining disables target-feature safety checks in LLVM.

Are you saying #[inline(always)] is unsound?

RalfJung commented 11 months ago

Yeah that is no good, we can't have (safe!) attributes just override checks which are needed for soundness.

I don't know a reproducer without inline(always), but I consider this a critical bug even with inline(always).

RalfJung commented 11 months ago

(Incidentally, there was an attempt to not do that in LLVM 17, but this was reverted due to the large amount of regressions it caused. People rely on that a lot, including in Rust.)

Perf regressions are acceptable when fixing soundness bugs. We then have to see how much of the perf we can get back without compromising soundness.

briansmith commented 11 months ago

There wouldn't necessarily need to be a perf regression. I would expect it to compile down to the code that would exist as if the intermediate function were not there:

use std::mem::transmute;
#[cfg(target_arch = "x86")]
use std::arch::x86::*;
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

extern "C" fn no_target_feature(_dummy: f32, x: __m256) {
    let val = unsafe { transmute::<_, [u32; 8]>(x) };
    dbg!(val);
}

#[target_feature(enable = "avx")]
unsafe fn with_target_feature(x: __m256) {
  // Critical call: caller and callee have different target features.
  // The compiler needs to deal with the ABI transition here.
  no_target_feature(0.0, x);
}

fn main() {
    assert!(is_x86_feature_detected!("avx"));
    // SAFETY: we checked that the `avx` feature is present.
    unsafe {
        with_target_feature(transmute([1; 8]));
    }
}

I would expect that a function, when inlined, doesn't have effects on ABI issues in itself.

BTW, I hope to be writing code just like this very soon, but instead of extern "C" we'll sometimes force the sysv ABI (even on Windows), ideally.

RalfJung commented 11 months ago

I would expect it to compile down to the code that would exist as if the intermediate function were not there:

That's what it does, and that's the bug. That code is wrong, see https://github.com/rust-lang/rust/issues/116558.

Basically LLVM tied together flags affecting ABI and flags relevant for codegen, and I think that was a huge mistake. This issue and https://github.com/rust-lang/rust/issues/116558 show why.

BTW, I hope to be writing code just like this very soon, but instead of extern "C" we'll sometimes force the sysv ABI (even on Windows), ideally.

This issue affects all non-"Rust" ABIs.

workingjubilee commented 11 months ago

If inline(always) is unsound, we need to castrate it so it's just inline. We can do that on our end, without any need to consult LLVM for its preferences.

nikic commented 11 months ago

(Incidentally, there was an attempt to not do that in LLVM 17, but this was reverted due to the large amount of regressions it caused. People rely on that a lot, including in Rust.)

Perf regressions are acceptable when fixing soundness bugs. We then have to see how much of the perf we can get back without compromising soundness.

It's a bit more complex than that. Examples of regressions this caused are:

Compiler crashes due to selection failures. Builtins were not inlined into functions with target features that would allow them to select, causing a crash.
Failure to inline function with inline assembly where all functions were consistently annotated with target features.
Failure to inline platform vector intrinsics into functions.

You might call this "just a perf issue", but inlining of platform vector intrinsics is an important part of their semantics. They are useless if this does not happen reliably.

These issues are not fundamental, but caused by target feature checks being too conservative, especially for non-X86 targets.

The semantics of always_inline can be changed, but it would require some work to make sure we have at least somewhat accurate compatibility checks across targets.

I believe something that was discussed in the past but never happened, is that we should add a lint for calling a function with less target features, while passing vector values to it. Independent of the soundness issues discussed here, the lack of inlining makes this a performance footgun, and it's almost certainly not what people want to do.

briansmith commented 11 months ago

If inline(always) is unsound, we need to castrate it so it's just inline.

Pretty much anybody who would write code like the above would very much appreciate at least a warning if that is going to happen. If/when I see such a warning I would remove the "intermediate" wrapper. Then I would rewrite the code into the form I shared.

I would expect it to compile down to the code that would exist as if the intermediate function were not there:

That's what it does, and that's the bug. That code is wrong, see https://github.com/rust-lang/rust/issues/116558.

Then how is this particular issue a distinct bug from #116558, especially considering that nobody wants their #[inline(always)] function to not be inlined?

workingjubilee commented 11 months ago

@briansmith: Pretty much anybody who would write code like the above would very much appreciate at least a warning if that is going to happen. If/when I see such a warning I would remove the "intermediate" wrapper. Then I would rewrite the code into the form I shared.

Completely understandable. We should design a lint that will fire on all cases here.

@nikic: I believe something that was discussed in the past but never happened, is that we should add a lint for calling a function with less target features, while passing vector values to it. Independent of the soundness issues discussed here, the lack of inlining makes this a performance footgun, and it's almost certainly not what people want to do.

Sounds good to me.

@nikic: You might call this "just a perf issue", but inlining of platform vector intrinsics is an important part of their semantics. They are useless if this does not happen reliably.

And yeah, nikic is right here. We might have to hack in an #[rustc_REALLY_always_inline] for usage by core::arch, while we work on fixing the soundness issues. I think that's fine.

RalfJung commented 11 months ago

Then how is this particular issue a distinct bug from https://github.com/rust-lang/rust/issues/116558, especially considering that nobody wants their #[inline(always)] function to not be inlined?

This is a clear soundness bug IMO, https://github.com/rust-lang/rust/issues/116558 is "just" very odd semantics and ABI footguns. I think we should resolve #116558 by refusing to compile the example there but I'm not convinced that will suffice to fix this soundness bug.

Failure to inline platform vector intrinsics into functions.

If the caller had the target feature, they should still get inlined, no? And if someone calls an AVX2 intrinsic from a function that doesn't have the AVX2 feature then surely exploding that code is fine, it should probably not even compile...

workingjubilee commented 11 months ago

People use dynamic feature dispatch, however?

nikic commented 11 months ago

Failure to inline platform vector intrinsics into functions.

If the caller had the target feature, they should still get inlined, no? And if someone calls an AVX2 intrinsic from a function that doesn't have the AVX2 feature then surely exploding that code is fine, it should probably not even compile...

The relevant case is more along the lines of: The caller has features +a,+b and the platform intrinsic has +a. LLVM refuses to inline because this is potentially unsafe. LLVM's default assumption about what is safe to inline are very conservative. If the target doesn't tell it that e.g. subset inlining is always safe, it's only going to inline if the target features are exactly the same. Not all targets implement the necessary hook to provide a more precise compatibility check.

Or to give a less obvious example, you have a function with +armv8-a and an intrinsic with +armv7-a. That's not a case of subset inlining and requires special handling (and I wouldn't be able to say off the top of my head whether that is universally safe in the first place or not).

workingjubilee commented 11 months ago

Inlining across Arm "major versions" is honestly pretty dangerous because they routinely retire older instructions on the majors.

RalfJung commented 11 months ago

Okay so it sounds like the inlining check LLVM has is too naive and that's why properly enforcing it failed? It's "just" more work (such as enough support for subset detection) to e.g. get the vendor intrinsics to actually be inlined?

Is inlining only concerned with ABI here or are there other issues? When I do something like

if have_feataure {
  function_with_more_target_features();
}

Then ignoring ABI that should be entirely fine to inline since all the potentially problematic instructions are inside the if, so a CPU without the feature will never see them. (Except on wasm which verifies functions before they are called... I'm inclined to ignore wasm for now.)

Is that how it woks? Or is the issue that LLVM can only represent "which features are available" on a whole-function basis and so the instructions might then subsequently "leak out" from the if to the surrounding code?

RalfJung commented 11 months ago

An alternative to the inlining check would be to stop treating ABI so implicitly -- a function call shouldn't just use the target features from its context to determine the ABI, it should be given an explicit list of target features to use. Then inlining couldn't possibly affect ABI, and frontends would also have a much easier time dealing with the ABI implications of target features. Is that something that could realistically be done in LLVM?

briansmith commented 11 months ago

Is that how it woks? Or is the issue that LLVM can only represent "which features are available" on a whole-function basis and so the instructions might then subsequently "leak out" from the if to the surrounding code?

That is how it is in C.

An alternative to the inlining check would be to stop treating ABI so implicitly -- a function call shouldn't just use the target features from its context to determine the ABI, it should be given an explicit list of target features to use. Then inlining couldn't possibly affect ABI, and frontends would also have a much easier time dealing with the ABI implications of target features. Is that something that could realistically be done in LLVM?

The simplest workable model is that when a function is actually inlined then it will be compiled as though it has the target features of the function it was inlined into. (If it is inlined multiple times into multiple functions then each inlined copy will potentially be using different target features.) Note that this applies even to functions that aren't marked #[inline]. (If an #[inline] function has #[cfg(target_feature)] that is not a subset of a caller's target features then it can't be inlined into that function.)

It only makes sense to talk about "ABI" or calling convention of a copy of a function that isn't inlined and that is in some way exposed to some kind of separate compilation. The (exported version of the) function must be called using the ABI/calling convention that it is declared with when it is not inlined. If the function is inlined then anything can happen. The compiler can also generate alternative out-of-line versions of the function that it calls within the compilation unit with custom calling conventions/ABI.

In the example you gave, everything should ideally be inlined into with_target_feature because there's no reason to generate out-of-line versions of the functions it calls (transitively). with_target_feature won't be inlined into main by LLVM-based compilers because the target feature selection of main doesn't contain avx and the feature selection is per-function. In theory it could be smarter. If this were a library and no_target_feature were exported from the library then things might be different.

It seems like the bug here is that the compiler assumes that any function that a target_feature(enable = "avx") function calls also has that target feature set even when calling a not-inlined version of the function.

briansmith commented 11 months ago

a function call shouldn't just use the target features from its context to determine the ABI, it should be given an explicit list of target features to use.

The explicit list of target features to use is already given in the declaration/definition of the function being called. If the calling convention/ABI is a function c of the declared ABI a + declared target features f then the compiler, when generating the function call, can determine which ABI to use by c(a(x), f(x)) where x is the function being called.

Then inlining couldn't possibly affect ABI, and frontends would also have a much easier time dealing with the ABI implications of target features. Is that something that could realistically be done in LLVM?

Clang has the same kinds of features that we're trying to implement here and it seems to work the way I described, so yes.

RalfJung commented 11 months ago

The simplest workable model is that when a function is actually inlined then it will be compiled as though it has the target features of the function it was inlined into. (

This is not the problem. The problem is functions that are being called from the code that is being inlined. The behavior of the call operation depends on its context, and so inlining call is unsound unless you can prove that in the other context that you are moving the call to, the ABI used by this call will be the same.

In my example, the problematic call is that of no_target_feature: it's originally a call in a no-target-feature function, and that works fine, but when no_target_feature_intermediate gets inlined then the call to no_target_feature moves into a different context; it's now in a function with the AVX target feature, meaning it behaves differently. That's why inlining is wrong.

This is an LLVM bug. What LLVM should do when inlining a call is remember the set of target features that were enabled at the place where the call was originally located, and then after inlining it can still generate a call with the right ABI.

The explicit list of target features to use is already given in the declaration/definition of the function being called

We can't check the target features of the declaration, since we don't know the declaration. Remember this has to work with function pointers.

apiraino commented 11 months ago

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-critical

briansmith commented 11 months ago

The simplest workable model is that when a function is actually inlined then it will be compiled as though it has the target features of the function it was inlined into. (

This is not the problem. The problem is functions that are being called from the code that is being inlined. The behavior of the call operation depends on its context, and so inlining call is unsound unless you can prove that in the other context that you are moving the call to, the ABI used by this call will be the same.

Let's think this through. Let's say I have a function no_target_feature written in assembly language so that it is clear that Rust/LLVM code generation has no effect on how no_target_feature is supposed to be called.

Now consider two functions, one that has the AVX feature activated and one that doesn't. Each function calls no_target_feature. They need to generate the same machine code for calling no_target_feature, regardless of which target features they (the callers) have. How do each of the functions decide which calling convention/ABI to use? They have to use the declaration of no_target_feature to decide. There is no other choice.

From here I hope you can see that inlining shouldn't affect the way somebody calls no_target_feature.

Further, people writing code like this absolutely want all the inlining to be done. Especially because no_target_feature might be written in assembly language and so might be unsafe, so I would have a safe #[inline(always)] wrapper around it to make it convenient to use with NO performance cost. So solving whatever problems we have by prohibiting inlining would be terrible.

briansmith commented 11 months ago

It might also be instructive to consider __vectorcall. In MSVC, the parameter passing mechanism for vectors to a function differs depending on whether the function is declared as __vectorcall or not--not on how the caller is compiled. It has to be the same for no_target_feature in Rust (just as it is in Clang, IIUC); the function's declaration has to encode the information for how to pass it parameters.

RalfJung commented 11 months ago

I can't tell where you are disagreeing with me. Inlining would be fine if LLVM was able to represent "target features used for this call' ABI" separately from "target features available in this function". But since LLVM equates those to things, currently disabling inlining is the only sound option.

RalfJung commented 11 months ago

Now consider two functions, one that has the AVX feature activated and one that doesn't. Each function calls no_target_feature. They need to generate the same machine code for calling no_target_feature, regardless of which target features they (the callers) have. How do each of the functions decide which calling convention/ABI to use? They have to use the declaration of no_target_feature to decide. There is no other choice.

As I already said, using the features at declaration site is impossible. We don't know the declaration site, since these could be indirect calls via function pointers. You are pre-supposing that we add target features to our fn types so that the caller can know about them, or something like that? And then we'd still not be able to represent those call semantics to LLVM.

But I think we're completely talking past each other at this point. The inlining LLVM does is unsound, and I have explained above why. I never claimed that it was impossible to do inlining of such calls, I just claimed that LLVM's inlining is wrong and within the confines of what LLVM IR can represent at the moment, there's no sound way to do cross-target-feature inlining. Talking about assembly completely misses the point. So (I think) you're trying to convince me of something that I always agreed with.

briansmith commented 11 months ago

We don't know the declaration site, since these could be indirect calls via function pointers.

None of the examples here use function pointers. I think the discussion would be much clearer if it did,

A function pointer type has to encode the calling convention in order to allow a function to be called through it. That's why __vectorcall is a type modifier in MSVC so that you can do typedef __m256 (__vectorcall * vcfnptr)(double, double, double, double); and the like.

RalfJung commented 11 months ago

None of the examples here use function pointers.

Sure, but so what? They can be trivially rewritten to use function pointers, and any solution obviously has to handle function pointers.

A function pointer type has to encode the calling convention

It should do that, yes. But it doesn't. That's the entire problem. (Well, that's https://github.com/rust-lang/rust/issues/116558. This one here is about inlining. But they share the common cause of "LLVM can't represent certain kinds of function calls".)

The function pointer type extern "C" fn(f32, __m256) does not encode whether AVX was used or not used to generate this function. For better or worse we stabilized __m256 and x86 target features without resolving this mess, so this is where we are now.

Furthermore, even if extern "C" fn(f32, __m256) did indicate whether AVX was used or not, LLVM currently (to my knowledge) has no way to use that information: when inside an AVX function, __m256 arguments are passed via AVX registers; even if the fn type tells us that it should be using a different ABI, we can't tell LLVM about that.

That's why I said above that LLVM needs an explicit way to control ABI. Then the inliner could do inlining without changing what a call means, and we could consider tracking target features in fn pointers to resolve #116558. (Though I'm not sure that's the solution we want. But it'd be a possible solution, at least.)

briansmith commented 11 months ago

Furthermore, even if extern "C" fn(f32, m256) did indicate whether AVX was used or not, LLVM currently (to my knowledge) has no way to use that information: when inside an AVX function, m256 arguments are passed via AVX registers; even if the fn type tells us that it should be using a different ABI, we can't tell LLVM about that.

Clang can do it. See https://clang.llvm.org/docs/AttributeReference.html#vectorcall. We also use __attribute__((sysv_abi)) in ring's C code already and we hope to move the callers from C to Rust (thus my interest in these issues).

The function pointer type extern "C" fn(f32, m256) does not encode whether AVX was used or not used to generate this function. For better or worse we stabilized m256 and x86 target features without resolving this mess, so this is where we are now.

That has nothing to do with inlining though. That issue should be solved independently of inlining. Function pointer types need to encode calling convention and it's simply wrong for it not to. You cannot solve this problem with some hack in Rust because a Rust program can pass a function pointer to C code and that C code needs to know which ABI to use to call that function.

RalfJung commented 11 months ago

Clang can do it. See https://clang.llvm.org/docs/AttributeReference.html#vectorcall. We also use attribute((sysv_abi)) in ring's C code already and we hope to move the callers from C to Rust (thus my interest in these issues).

How can that represent "use the ABI exactly like extern C but without AVX"? That's not just a binary flag, there are levels here: having SSE2, having AVX, having AVX2 -- each of these change the ABI. A single boolean "vectorcall yes/no" is insufficient.

We have extern "vectorcall" fn (unstably). It's not enough.

That has nothing to do with inlining though. That issue should be solved independently of inlining. Function pointer types need to encode calling convention and it's simply wrong for it not to. You cannot solve this problem with some hack in Rust because a Rust program can pass a function pointer to C code and that C code needs to know which ABI to use to call that function.

C function pointers also don't have that information though so that case is completely impossible for us to resolve.

Also I don't think we're even talking about inlining any more. I explained above what's going on with inlining. The function pointer discussion could have been helpful since it illustrates the point (a lack of expressiveness in LLVM IR) but looks like it just confused everyone even more. Assembly and C are all completely irrelevant here, this is an LLVM IR bug. Let's leave function pointers to https://github.com/rust-lang/rust/issues/116558 then.

RalfJung commented 11 months ago

Here we can just sit and wait until LLVM fixes their bug. I'm not sure if we have any way of working around it from the Rust side. Never emit inline(always) if any argument could have its ABI affected? That's the only way I can think of that would work. I still think this is the best way to resolve the problem properly, but that might take a while...

Clang has the same kinds of features that we're trying to implement here and it seems to work the way I described, so yes.

I am fairly certain clang has exactly the same bug as we do. It's an LLVM bug, the frontend can do little to avoid it. I just don't know enough C to construct the example. How does one set target features? How does one get access to __m256? How does one set inline(always)?

workingjubilee commented 11 months ago

How does one get access to __m256?

Including the intrinsic header should be sufficient.

#include <ymmintrin.h>

How does one set target features? How does one set inline(always)?

C has real attributes now as of C23, and they are written [[vendor::attribute]], so:

[[gnu::target("avx2")]] [[clang::always_inline]]

Yes, clang respects certain gnu attributes with an in-theory-identical implementation, that's why they're namespaced in the first place. There's a much grosser way to write it but for your sanity I recommend the way that might not work on older compilers.

RalfJung commented 11 months ago

I started modifying some example by @chorman0773 to try to reproduce this with clang, but the compiler is realizing what I am doing and stopping me. Example Link

<source>:14:5: warning: AVX vector argument of type '__m256' (vector of 8 'float' values) without 'avx' enabled changes the ABI [-Wpsabi]
   14 |     no_target_feature(x, y);
      |     ^
<source>:18:5: error: AVX vector argument of type '__m256' (vector of 8 'float' values) without 'avx' enabled changes the ABI
   18 |     no_target_feature_intermediate(0.0, y);
      |     ^

I don't know under which exact conditions this is an error in clang, but it seems totally justified for Rust to also refuse to compile this code.

RalfJung commented 10 months ago

I am trying to catch LLVM in the act of moving a call instruction between functions with different target features, but so far I have not succeeded. Somehow when I translate the example here to LLVM IR and pass that to clang, it doesn't get the optimizations I am hoping for. Here is what I got so far -- does anyone have an idea how to produce such an example?

Here's another version, still doesn't get inlined though.

RalfJung commented 10 months ago

I think I finally got it. Not sure what is different about this than my previous attempts...

RalfJung commented 10 months ago

Here's an LLVM issue for the problem: https://github.com/llvm/llvm-project/issues/70563

sarah-ek commented 9 months ago

i found an example that doesn't use extern "C"

it should print (0, 1, 2, 3), but instead, when executed in release mode on the playground, it shows (0, 1, 206158430224, 140735013294640) for me

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=f9070ae872e66ba389fcba256e4f00fc

use core::arch::x86_64::__m256i;
use core::hint::black_box;
use core::mem::transmute;

#[allow(non_camel_case_types)]
#[derive(Copy, Clone, Debug)]
pub struct u64x4(u64, u64, u64, u64);

#[inline(never)]
#[target_feature(enable = "avx")]
unsafe fn return_as_is_avx(a: __m256i) -> __m256i {
    a
}

#[inline(never)]
unsafe fn return_as_is(a: u64x4) -> u64x4 {
    transmute(return_as_is_avx(transmute(a)))
}

#[target_feature(enable = "avx")]
#[inline]
unsafe fn imbue_avx<F: Fn()>(f: F) -> F::Output {
    f()
}

pub unsafe fn buggy() {
    imbue_avx(
        #[inline(always)]
        || {
            dbg!(return_as_is(black_box(u64x4(0, 1, 2, 3))));
        },
    );
}

pub fn main() {
    assert!(is_x86_feature_detected!("avx"));
    unsafe {
        buggy();
    }
}

RalfJung commented 9 months ago

On Zulip, someone suggested this might be due to LLVM turning a ptr argument into a by-val argument as an optimization.

(Please mention such observations when carrying issues from Zulip to Github, or else people will have to waste time re-discovering the same thing!)

sarah-ek commented 9 months ago

im not sure if that's what's causing the issue. even when passing the argument with multiple indirections and black_boxing the reference so it doesn't get promoted, i still get the same issue

https://godbolt.org/z/EaGxGjWhT

#[inline(never)]
#[target_feature(enable = "avx")]
unsafe fn return_as_is_avx(a: &&__m256i) -> u64x4 {
    transmute(**black_box(a))
}

#[inline(never)]
unsafe fn return_as_is(a: u64x4) -> u64x4 {
    return_as_is_avx(&&transmute(a))
}

this is the asm for return_as_is_avx, so it is performing the pointer dereferences

example::return_as_is_avx:
  mov qword ptr [rsp - 8], rsi
  lea rax, [rsp - 8]
  mov rax, qword ptr [rsp - 8]
  mov rax, qword ptr [rax]
  vmovaps ymm0, ymmword ptr [rax]
  vmovups ymmword ptr [rdi], ymm0
  vzeroupper
  ret

output

[/app/example.rs:30] return_as_is(black_box(u64x4(0, 1, 2, 3))) = u64x4(
    0,
    1,
    206158430224,
    140726132754736,
)

sarah-ek commented 9 months ago

this part looks suspicious to me

i might be misreading this, but it looks like return_as_is is expecting the input to be split in xmm0 and xmm1

example::return_as_is:
  push rbp
  mov rbp, rsp
  and rsp, -32
  sub rsp, 96
  movaps xmmword ptr [rsp + 48], xmm1  // <--
  movaps xmmword ptr [rsp + 32], xmm0  // <--
  lea rax, [rsp + 32]
  mov qword ptr [rsp + 24], rax
  lea rsi, [rsp + 24]
  call example::return_as_is_avx
  mov rsp, rbp
  pop rbp
  ret

but in imbue_avx it might be getting passed in one register ymm0 (no mention of xmm1 or ymm1)

example::imbue_avx:
  push r14
  push rbx
  sub rsp, 168
  vmovaps ymm0, ymmword ptr [rip + .LCPI6_0]
  vmovups ymmword ptr [rsp + 80], ymm0
  lea r14, [rsp + 80]
  vmovups ymm0, ymmword ptr [rsp + 80]  // <--
  lea rbx, [rsp + 136]
  mov rdi, rbx
  call example::return_as_is

RalfJung commented 9 months ago

Hm, strange. Maybe the ABI for closures is buggy and doesn't do the "ptr" indirection the way our normal ABI does.

sarah-ek commented 9 months ago

i don't think it's a closure issue, still happens if i get rid of it https://godbolt.org/z/cW9GdPWdM

RalfJung commented 9 months ago

Then the only other idea I have is that LLVM tries to optimize passing u64x4 (which is defined as a regular tuple struct here) but applies the optimization in an inconsistent way. That might be worth an LLVM bug report, if someone can turn this into an LLVM IR example.

Interestingly, one can even remove the target-feature from return_as_is_avx, the issue remains.

sarah-ek commented 9 months ago

Interestingly, one can even remove the target-feature from return_as_is_avx, the issue remains.

could you post an example? i can't reproduce this

RalfJung commented 9 months ago

Here you go: https://godbolt.org/z/nqf8Ee9PM

sarah-ek commented 9 months ago

thanks! i tried reproducing the issue outside of godbolt/playground and i noticed an interesting pattern.

this is the project structure

// src/lib.rs
use std::arch::x86_64::__m256i;
use std::hint::black_box;
use std::mem::transmute;

#[allow(non_camel_case_types)]
#[derive(Copy, Clone)]
pub struct u64x4(u64, u64, u64, u64);

#[inline(never)]
pub unsafe fn return_as_is_avx(a: &&__m256i) -> u64x4 {
    transmute(**black_box(a))
}

#[inline(never)]
pub unsafe fn return_as_is(a: u64x4) -> u64x4 {
    return_as_is_avx(&&transmute(a))
}

#[inline(always)]
pub unsafe fn buggy_intermediate() {
    let result = return_as_is(black_box(u64x4(13, 14, 15, 16)));
    println!("({}, {}, {}, {})", result.0, result.1, result.2, result.3)
}

#[target_feature(enable = "avx")]
#[inline(never)]
pub unsafe fn buggy_avx() {
    buggy_intermediate();
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    pub fn test_inner() {
        if !is_x86_feature_detected!("avx") {
            return;
        }
        unsafe { buggy_avx() };
    }
}

// tests/bug.rs
use abi_bug::*;

#[test]
pub fn test_outer() {
    if !is_x86_feature_detected!("avx") {
        return;
    }
    unsafe { buggy_avx() };
}

test_inner shows the wrong result, but test_outer shows the correct one.

after disassembling the test binaries, it looks like the one in src/lib.rs uses the fastcc calling convention, since it sees that everything is in the same crate. but when it's exported and used from tests/bug.rs, it uses the usual calling convention.

so this might be a bug with fastcc

sarah-ek commented 9 months ago

here's an example of the buggy llvm-ir (i think, im not very familiar with llvm)

https://godbolt.org/z/hTfKxcETY

sarah-ek commented 9 months ago

actually, i don't think fastcc is the issue. it looks like u64x4 is being turned into <4 x i64> at some point in the signature of return_as_is, which results in the abi mismatch between the caller and the callee

im not sure how it's being promoted to <4 x i64>, but since (i assume) it's rustc that generates this IR, maybe the bug is fixable on our side?

RalfJung commented 9 months ago

I'm pretty sure rustc doesn't automatically do such transformations, so it's likely an LLVM optimization.

sarah-ek commented 9 months ago

looks like you're right, with RUSTFLAGS="-C llvm-args=-print-after-all", it looks like the transformation is being done by the argument promotion pass

rust-lang / rust

Inlining causes miscompilation of code that mixes target features #116573