rust-lang / wg-allocators

Home of the Allocators working group: Paving a path for a standard set of allocator traits to be used in collections!
http://bit.ly/hello-wg-allocators
203 stars 9 forks source link

Bug Report: Incorrect Optimization on `Global.allocate{_zeroed}`+`core::mem::forget` Loop #119

Open Coekjan opened 8 months ago

Coekjan commented 8 months ago

Recently, I tried to write a unit-test about memory allocation & deallocation for my embedded system, with usage of Global.allocate_zeroed (in fact Global.allocate also reproduces this bug) and #[global_allocator]. When I wanted to test the behavior of memory-exhaustion, I found that rustc seems to "optimized" my loop into dead-loop. Details are as follows.

I tried this code (as a sample that can reproduce the bug) in current nightly:

#![feature(allocator_api)]

use std::{alloc::{Global, Allocator, Layout}, sync::Arc, mem::forget};

fn main() {
    while let Ok(f) = Global.allocate_zeroed(Layout::new::<u8>()).map(|ptr| {
        Arc::new(ptr.cast::<u8>())
    }) {
        // Allocate memory and forget it (prevent it from deallocaion)
        forget(f)
    }
    // It indicates memory-exhaustion if the program reaches here.  
}

I expected to see this happen: main function reaches its end (not dead loop)

Instead, this happened: main function never returns

I used RUSTFLAGS='--emit=llvm-ir' to check its LLVM-IR, which seemed to show that the allocating-loop was optimized wrongly into a dead-loop.

; ModuleID = 'rust_global_allocator_test.f136ea4df42ba696-cgu.0'
source_filename = "rust_global_allocator_test.f136ea4df42ba696-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

@vtable.0 = private unnamed_addr constant <{ ptr, [16 x i8], ptr, ptr, ptr }> <{ ptr @"_ZN4core3ptr85drop_in_place$LT$std..rt..lang_start$LT$$LP$$RP$$GT$..$u7b$$u7b$closure$u7d$$u7d$$GT$17hc6c148eab5de7a06E", [16 x i8] c"\08\00\00\00\00\00\00\00\08\00\00\00\00\00\00\00", ptr @"_ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17hf8901dcec16a4c23E", ptr @"_ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17h4ebbb110649ea1cfE", ptr @"_ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17h4ebbb110649ea1cfE" }>, align 8
@__rust_no_alloc_shim_is_unstable = external global i8

; std::sys_common::backtrace::__rust_begin_short_backtrace
; Function Attrs: noinline nonlazybind uwtable
define internal fastcc void @_ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17hc156c1d87387bc95E(ptr nocapture noundef nonnull readonly %f) unnamed_addr #0 {
start:
  tail call void %f()
  tail call void asm sideeffect "", "~{memory}"() #7, !srcloc !4
  ret void
}

; std::rt::lang_start
; Function Attrs: noinline nonlazybind uwtable
define hidden noundef i64 @_ZN3std2rt10lang_start17h317e8b9ef2918859E(ptr noundef nonnull %main, i64 noundef %argc, ptr noundef %argv, i8 noundef %sigpipe) unnamed_addr #0 {
start:
  %_8 = alloca ptr, align 8
  call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %_8)
  store ptr %main, ptr %_8, align 8
; call std::rt::lang_start_internal
  %0 = call noundef i64 @_ZN3std2rt19lang_start_internal17h84518abf65cfd370E(ptr noundef nonnull align 1 %_8, ptr noalias noundef nonnull readonly align 8 dereferenceable(24) @vtable.0, i64 noundef %argc, ptr noundef %argv, i8 noundef %sigpipe)
  call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %_8)
  ret i64 %0
}

; std::rt::lang_start::{{closure}}
; Function Attrs: inlinehint nonlazybind uwtable
define internal noundef i32 @"_ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17h4ebbb110649ea1cfE"(ptr noalias nocapture noundef readonly align 8 dereferenceable(8) %_1) unnamed_addr #1 {
start:
  %_4 = load ptr, ptr %_1, align 8, !nonnull !5, !noundef !5
; call std::sys_common::backtrace::__rust_begin_short_backtrace
  tail call fastcc void @_ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17hc156c1d87387bc95E(ptr noundef nonnull %_4)
  ret i32 0
}

; core::ops::function::FnOnce::call_once{{vtable.shim}}
; Function Attrs: inlinehint nonlazybind uwtable
define internal noundef i32 @"_ZN4core3ops8function6FnOnce40call_once$u7b$$u7b$vtable.shim$u7d$$u7d$17hf8901dcec16a4c23E"(ptr nocapture noundef readonly %_1) unnamed_addr #1 personality ptr @rust_eh_personality {
start:
  %0 = load ptr, ptr %_1, align 8, !nonnull !5, !noundef !5
; call std::sys_common::backtrace::__rust_begin_short_backtrace
  tail call fastcc void @_ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17hc156c1d87387bc95E(ptr noundef nonnull %0), !noalias !6
  ret i32 0
}

; core::ptr::drop_in_place<std::rt::lang_start<()>::{{closure}}>
; Function Attrs: inlinehint mustprogress nofree norecurse nosync nounwind nonlazybind willreturn memory(none) uwtable
define internal void @"_ZN4core3ptr85drop_in_place$LT$std..rt..lang_start$LT$$LP$$RP$$GT$..$u7b$$u7b$closure$u7d$$u7d$$GT$17hc6c148eab5de7a06E"(ptr noalias nocapture readnone align 8 %_1) unnamed_addr #2 {
start:
  ret void
}

; rust_global_allocator_test::main
; Function Attrs: nofree norecurse noreturn nounwind nonlazybind memory(readwrite, argmem: none) uwtable
define internal void @_ZN26rust_global_allocator_test4main17h8211586a067c7b05E() unnamed_addr #3 personality ptr @rust_eh_personality {
start:
  br label %bb1

bb1:                                              ; preds = %bb1, %start
  %0 = load volatile i8, ptr @__rust_no_alloc_shim_is_unstable, align 1, !noalias !9
  br label %bb1
}

; std::rt::lang_start_internal
; Function Attrs: nonlazybind uwtable
declare noundef i64 @_ZN3std2rt19lang_start_internal17h84518abf65cfd370E(ptr noundef nonnull align 1, ptr noalias noundef readonly align 8 dereferenceable(24), i64 noundef, ptr noundef, i8 noundef) unnamed_addr #4

; Function Attrs: nonlazybind uwtable
declare noundef i32 @rust_eh_personality(i32 noundef, i32 noundef, i64 noundef, ptr noundef, ptr noundef) unnamed_addr #4

; Function Attrs: nonlazybind
define i32 @main(i32 %0, ptr %1) unnamed_addr #5 {
top:
  %2 = sext i32 %0 to i64
; call std::rt::lang_start
  %3 = tail call i64 @_ZN3std2rt10lang_start17h317e8b9ef2918859E(ptr @_ZN26rust_global_allocator_test4main17h8211586a067c7b05E, i64 %2, ptr %1, i8 0)
  %4 = trunc i64 %3 to i32
  ret i32 %4
}

; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite)
declare void @llvm.lifetime.start.p0(i64 immarg, ptr nocapture) #6

; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite)
declare void @llvm.lifetime.end.p0(i64 immarg, ptr nocapture) #6

attributes #0 = { noinline nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #1 = { inlinehint nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #2 = { inlinehint mustprogress nofree norecurse nosync nounwind nonlazybind willreturn memory(none) uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #3 = { nofree norecurse noreturn nounwind nonlazybind memory(readwrite, argmem: none) uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #4 = { nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #5 = { nonlazybind "target-cpu"="x86-64" }
attributes #6 = { mustprogress nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) }
attributes #7 = { nounwind }

!llvm.module.flags = !{!0, !1, !2}
!llvm.ident = !{!3}

!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 7, !"PIE Level", i32 2}
!2 = !{i32 2, !"RtLibUseGOT", i32 1}
!3 = !{!"rustc version 1.75.0-nightly (2f1bd0729 2023-10-27)"}
!4 = !{i32 2481009}
!5 = !{}
!6 = !{!7}
!7 = distinct !{!7, !8, !"_ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17h4ebbb110649ea1cfE: %_1"}
!8 = distinct !{!8, !"_ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17h4ebbb110649ea1cfE"}
!9 = !{!10}
!10 = distinct !{!10, !11, !"_ZN5alloc5boxed12Box$LT$T$GT$3new17hc4eca2e43385ad69E: %x"}
!11 = distinct !{!11, !"_ZN5alloc5boxed12Box$LT$T$GT$3new17hc4eca2e43385ad69E"}
Amanieu commented 8 months ago

This is an allowed optimization for Global: the compiler can see that the allocation is unused and optimizes it away. This is documented here: https://doc.rust-lang.org/nightly/alloc/alloc/trait.GlobalAlloc.html#safety

However if you call your allocator directly without going through #[global_allocator] then this optimization doesn't apply.

elichai commented 8 months ago

The docs here state: https://doc.rust-lang.org/beta/std/alloc/trait.GlobalAlloc.html#safety

You must not rely on allocations actually happening, even if there are explicit heap allocations in the source. The optimizer may detect unused allocations that it can either eliminate entirely or move to the stack and thus never invoke the allocator. The optimizer may further assume that allocation is infallible, so code that used to fail due to allocator failures may now suddenly work because the optimizer worked around the need for an allocation.

I think removing unnecessary heap allocations is an important enough optimization to put these kind of restrictions on GlobalAlloc If you want to test your allocator, passing the object through core::hint::black_box before forgetting should prevent the optimizer from removing the allocation

Coekjan commented 8 months ago

This is an allowed optimization for Global: the compiler can see that the allocation is unused and optimizes it away. This is documented here: https://doc.rust-lang.org/nightly/alloc/alloc/trait.GlobalAlloc.html#safety

However if you call your allocator directly without going through #[global_allocator] then this optimization doesn't apply.

Thanks for your explanation!

But when I define the #[global_allocator] in an other crate (as one of the dependencies), this optimization still happens.

Coekjan commented 8 months ago

The docs here state: https://doc.rust-lang.org/beta/std/alloc/trait.GlobalAlloc.html#safety

You must not rely on allocations actually happening, even if there are explicit heap allocations in the source. The optimizer may detect unused allocations that it can either eliminate entirely or move to the stack and thus never invoke the allocator. The optimizer may further assume that allocation is infallible, so code that used to fail due to allocator failures may now suddenly work because the optimizer worked around the need for an allocation.

I think removing unnecessary heap allocations is an important enough optimization to put these kind of restrictions on GlobalAlloc If you want to test your allocator, passing the object through core::hint::black_box before forgetting should prevent the optimizer from removing the allocation

Thanks for your advice! I will try the black_box later.

Coekjan commented 8 months ago

core::hint::black_box helps to prevent the optimization.

However, I am still confused by the optimizer behavior. When I mark my allocator with #[global_allocator] inside my binary crate, the optimization doesn't apply (my test function reaches its endpoint). Whereas, when I move the allocator outside of my crate as a library, this optimization applies (my test function is optimized into a dead-loop).