Open newpavlov opened 2 years ago
Yeah, the 32-bit cpuid implementation at https://doc.rust-lang.org/src/core/up/up/stdarch/crates/core_arch/src/x86/cpuid.rs.html#62 is compiling to something wrong. This is compiling to
0x565ba631 <+193>: mov %ebx,%eax
0x565ba633 <+195>: cpuid
0x565ba635 <+197>: xchg %eax,%ebx
on rustc 1.63.0 (4b91a6ea7 2022-08-08)
which of course does actually clobber %ebx, because apparently a lateout
register can overlap an explicitly-chosen inlateout
register. I don't know if that's an intentional LLVM thing, an LLVM bug, or what. According to godbolt this asm!() results in
%0 = tail call { i32, i32, i32, i32 } asm sideeffect "movl %ebx, ${0:k}\0Acpuid\0Axchgl %ebx, ${0:k}", "=r,={ax},={cx},={dx},1,2,~{memory}"(i32 1, i32 1) #1, !dbg !10, !srcloc !16
Yeah, after your post in the RustCrypto issue I also tried it in godbolt and got the same weird assembly: https://rust.godbolt.org/z/4z9nrY1Eh
Rust:
use std::arch::x86::__cpuid;
pub unsafe fn foo() -> u32 {
// get manufacture ID
let res = __cpuid(0);
res.ebx
}
Generated assembly which is obviously wrong:
example::foo:
xor eax, eax ; set EAX to zero
xor ecx, ecx ; set ECX to zero
mov eax, ebx ; "cache" calee-saved EBX to EAX
cpuid ; WHOOPS not only we now use EAX which contains EBX's value,
; but also CPUID overwrites EAX, making the caching useless
xchg eax, ebx ; move EBX into EAX as return result
ret
Bad news is that on x68-64 we get the same wrong assembly:
example::foo:
xor eax, eax
xor ecx, ecx
mov rax, rbx
cpuid
xchg rax, rbx
ret
I think correct codegen should cache EBX to stack. Honestly, I am amazed that this issue has not surfaced earlier.
If foo
returns res.eax
(or any other register) instead of res.ebx
codegen works correctly:
example::foo:
push esi
xor eax, eax
xor ecx, ecx
mov esi, ebx
cpuid
xchg esi, ebx
pop esi
ret
It's supposed to pick a different register and save it there, and that both should work (with proper clobbers) and should be better than pushing to the stack. Changing it to out(reg) ebx,
works in godbolt, though I can't swear that it is guaranteed as opposed to just avoiding this example. out(reg) ebx, lateout("eax") eax,
does still work, so it presumably still won't be able to conflict with the non-input edx
. I'm still not sure if this is an LLVM bug or intentionally part of the rules of inline asm (constraint rules have never been very clear).
https://rust.godbolt.org/z/Gr1ve77a6 suggests that it might be that LLVM considers unused lateout arguments to be fair game to delete along with the clobber.
I think the issue is with incorrect use of lateout(reg) ebx,
together with inlateout("eax") leaf => eax,
. Because of the late
specifiers LLVM assumes that it's legal to use the same register for both of them (i.e. it assigns {0}
to eax
).
This code produces incorrect result: https://rust.godbolt.org/z/j9v6v69Pz
But by changing lateout(reg) ebx,
to out(reg) ebx,
we get correct assembly: https://rust.godbolt.org/z/nWEEa9YsE
Yeah, the thing is that that doesn't really make sense in general - lateout(reg) ebx, lateout("eax") eax
(which still fails) really shouldn't be able to assign them to be the same register any more than it could for lateout(reg) x, lateout(reg) y
.
Indeed, LLVM does something weird with lateout
registers when one of them does not get used.
That one is probably just pop arbitrary_register
- it needs to have push/poped something around the asm for stack alignment reasons since it isn't marked nostack.
You are right, options(nostack)
removes the weird pop/push.
A better example, which probably can be used as a minified demonstration of the issue: https://rust.godbolt.org/z/3xMenGjMe
pub fn foo() -> u32 {
let t1: u32;
let t2: u32;
unsafe {
asm!(
"mov {0}, 1",
"mov eax, 42",
lateout(reg) t1,
lateout("eax") t2, // `t2` can be replaced by `_`
options(nostack),
);
}
t1
}
example::foo:
mov rax, 1
mov eax, 42
ret
Either way, the late
specifiers are useless in the case of CPUID intrinsics and I think they should be simply removed. We only need one register to cache EBX and return result stored in it, which is not EAX, EBX, ECX, or EDX. There are no opportunities for register reuse.
For what it's worth, the original asm!
block seems to me like it would misbehave if {0}
happened to be assigned as ebx
itself. Why does it do this backup and restore rather than just specifying ebx
as a clobber a constraint for the output?
But that's not what's happening here, so it does seem like LLVM is miscompiling the code.
You can't specify ebx
as a constraint because LLVM uses it internally for something (maybe exactly this plt cache? I forget) and doesn't handle clobbers of it like a sane compiler (iirc you get compile errors sometimes).
That said, I'm not sure that that's true anymore - while I vaguely recall some sort of error along those lines, I can't trigger one from the PLT usage of ebx
. https://github.com/rust-lang/rust/issues/90083 doesn't give details and I can't find them quickly re LLVM/clang.
Oh it's 64-bit that uses rbx and 32-bit uses esi, so the 32-bit cpuid could just do that.
Trying to clobber rbx
in the usual way currently results in the following compilation error:
error: cannot use register `bx`: rbx is used internally by LLVM and cannot be used as an operand for inline asm
Honestly, it's quite annoying and I wish we did not have such exception.
WG-prioritization assigning priority (Zulip discussion).
@rustbot label -I-prioritize +P-critical T-compiler
Does the change that landed in rust-lang/stdarch#1329 resolve this? (In other words, do we just need to pull in those changes in some manner to resolve this issue?)
Or is there something else that will need to happen?
also, given that this is a P-critical issue that affects 1.57 and later: Should we be talking about beta backports of the changes associated with fixing this?
That change should resolve this, yes.
The issue affects all uses of asm!
where lateout
operands are used. As far as I can tell cpuid
is the only one that is affected in the standard library, but uses of asm!
in other crates could also be affected by this bug.
Issue was visited during T-compiler meeting on Zulip (notes). Demoting priority to P-high
since #101861 reduced the impact and the LLVM upstream fix seems to be formalized
@rustbot label -P-critical P-high
T-compiler P-high review. Nikita volunteered to be default assignee for LLVM miscompilations.
@rustbot assign @nikic
(issue is loosely owned, in terms of P-high tracking, by @wesleywiser and @pnkfelix keeping their eyes on https://github.com/llvm/llvm-project/issues/57550)
Updated bug description
The following Rust function:
Gets compiled into this obviously incorrect assembly:
Godbolt link: https://rust.godbolt.org/z/Yb9v7WobM
LLVM incorrectly reuses register for a pair of
lateout
s if it can see that one of those does not get used later.Original description below
We get spurious segfaults in the
chacha20
crate when we run tests compiled for i686 target (i686-unknown-linux-gnu
to be exact), see https://github.com/RustCrypto/stream-ciphers/issues/304 for more information. Interestingly enough, Rust 1.56 does not have this issue, only 1.57 and later. Changes in thecpufeatures
crate which revealed the issue look quite innocent. The issue also disappears ifzeroize
feature is disabled, which is quite tangential tocpufeatures
(it only addsDrop
impls). Even weirder, @aumetra's findings show that segfault happens at the following line:Granted, the
chacha20
crate contains a fair bit of unsafe code, as well as its dependencies, so the issue may be caused by unsoundness somewhere in our crates. But the circumstantial evidence makes a potential compiler bug quite probable.