rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.8k stars 12.77k forks source link

with_exposed_provenance(0).with_addr(addr) is compiled as gep null #131741

Open saethlin opened 1 month ago

saethlin commented 1 month ago

Miri executes this program without error, but I believe our lowering to LLVM IR adds UB:

#![feature(strict_provenance, exposed_provenance)]
#[no_mangle]
pub fn from_exposed_null(addr: usize) -> *const u8 {
    let ptr = 0x0 as *const u8;
    ptr.with_addr(addr)
}

// Main for Miri
fn main() {
    let data = 0u8;
    let addr = std::ptr::addr_of!(data).expose_provenance();
    let ptr = from_exposed_null(addr);
    unsafe {
        println!("{}", *ptr);
    }
}

With optimizations enabled, LLVM cleans up from_exposed_null to

define noalias noundef ptr @from_exposed_null(i64 noundef %addr) unnamed_addr #0 !dbg !7 {
  %0 = getelementptr i8, ptr null, i64 %addr, !dbg !12
  ret ptr %0, !dbg !28
}

With -Cno-prepopulate-passes I believe the codegen also returns a pointer that definitely has no provenance, though it's just harder to read. godbolt: https://godbolt.org/z/s414rfK44

I think this happens because codegen is implicitly const-propagating through the int-to-pointer cast via OperandValue. At least I don't see any other way to get from this MIR:

    bb0: {
        _2 = const 0_usize as *const u8 (PointerWithExposedProvenance);
        StorageLive(_3);
        StorageLive(_5);
        StorageLive(_6);
        StorageLive(_4);
        _4 = copy _2 as usize (Transmute);
        _3 = move _4 as isize (IntToInt);
        StorageDead(_4);
        _5 = copy _1 as isize (IntToInt);
        _6 = Sub(copy _5, copy _3);
        _0 = arith_offset::<u8>(move _2, move _6) -> [return: bb1, unwind unreachable];
    }

    bb1: {
        StorageLive(_7);
        _7 = copy _0 as *const () (PtrToPtr);
        StorageDead(_7);
        StorageDead(_6);
        StorageDead(_5);
        StorageDead(_3);
        return;
    }

To this LLVM IR

define noundef ptr @from_exposed_null(i64 noundef %addr) unnamed_addr #0 !dbg !7 {
start:
  %0 = alloca [8 x i8], align 8, !dbg !12
  %offset = sub i64 %addr, 0, !dbg !12
  call void @llvm.lifetime.start.p0(i64 8, ptr %0), !dbg !28
  %1 = getelementptr i8, ptr null, i64 %offset, !dbg !28
  store ptr %1, ptr %0, align 8, !dbg !28
  %self = load ptr, ptr %0, align 8, !dbg !28
  call void @llvm.lifetime.end.p0(i64 8, ptr %0), !dbg !28
  ret ptr %self, !dbg !34
}

godbolt: https://godbolt.org/z/MjPvcdj9h

jwong101 commented 1 month ago

This ends up not mattering, since LLVM also folds inttoptr i64 0 to ptr to null:

define void @src(i64 noundef) {
start:
  %1 = inttoptr i64 0 to ptr
  %2 = getelementptr i8, ptr %1, i64 %0
  store i8 0, ptr %2
  ret void
}

With -passes='early-cse,instcombine' -debug:

; EarlyCSE Simplify:   %1 = inttoptr i64 0 to ptr  to: ptr null
define void @src(i64 noundef %0) {
  %1 = getelementptr i8, ptr null, i64 %0
  store i8 poison, ptr %1, align 1
  ret void
}

Alive2 thinks this is illegal:

https://alive2.llvm.org/ce/z/EJxazG

However, their LangRef also says that:

  • An integer constant other than zero or a pointer value returned from a function not defined within LLVM may be associated with address ranges allocated through mechanisms other than those provided by LLVM.
  • A pointer value formed from a scalar getelementptr operation is based on the pointer-typed operand of the getelementptr.
  • A pointer value formed by an inttoptr is based on all pointer values that contribute (directly or indirectly) to the computation of the pointer’s value.

So they might have a special case where inttoptr i64 0 to ptr has no provenance? I'm not sure what their intended semantics are for that.

saethlin commented 1 month ago

This ends up not mattering

It definitely matters. There's no hope of the full compiler getting this right if rustc corrupts the program. We should have an LLVM bug filed for this; can you file one or link an existing issue?

jwong101 commented 1 month ago

There appears to be another issue about this in the UCG repo here: https://github.com/rust-lang/unsafe-code-guidelines/issues/507.

Also, I think this is a similar issue about i2p 0 != null: https://github.com/rust-lang/rust/issues/107326

On the LLVM side:

I'll see if I can find any issues about i2p 0 != null. If not I'll open a new issue.

I think the underlying cause is this line of code: https://github.com/llvm/llvm-project/blob/ec78f0da0e9b1b8e2b2323e434ea742e272dd913/llvm/lib/IR/ConstantFold.cpp#L146

  if (V->isNullValue() && !DestTy->isX86_AMXTy() &&
      opc != Instruction::AddrSpaceCast)
    return Constant::getNullValue(DestTy);

which gets called by simplifyInstruction on int2ptr.

RalfJung commented 2 weeks ago

Cc @rust-lang/opsem @nikic

nikic commented 2 weeks ago

Yeah, the fact that inttoptr 0 folds to null is a "well known" issue in LLVM, and as usual, hard to fix :) We do go out of the way not to produce ptrtoint 0 in cases where this is likely to matter in practice.

See also https://github.com/AliveToolkit/alive2/issues/929 for some related discussion from an alive2 perspective. I guess the memset(0) question at the end is resolved on the Rust side by saying that transmuting that memset to a pointer wouldn't have provenance anyway, but on the LLVM side this is still an unresolved issue (cf byte type).