rust-lang / miri

An interpreter for Rust's mid-level intermediate representation
Apache License 2.0
4.29k stars 323 forks source link

Massive slowdown caused by stacked borrows on code involving raw pointers + UnsafeCell #3735

Open newpavlov opened 1 month ago

newpavlov commented 1 month ago

I was trying to minimize a massive slowdown which is observed in our code and got the following snippet:

#![feature(new_uninit)]
use std::sync::atomic::{AtomicU32, Ordering::*};

const PAGE_SIZE: usize = 1 << 12;
const PAGES: usize = 16;
const ITEMS: usize = PAGE_SIZE * PAGES;

type AtomicPage = [AtomicU32; PAGE_SIZE];

#[test]
fn init_atomic() {
    let pages: [*mut AtomicPage; PAGES] = [(); PAGES]
        .map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() })
        .map(Box::into_raw);

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        let page: &AtomicPage = unsafe { &*(pages[page_idx]) };
        page[inpage_idx].store(u32::MAX, Release);
    }

    for page_ptr in pages {
        drop(unsafe { Box::from_raw(page_ptr) });
    }
}

On my laptop I get the following results (more than 50x slowdown):

$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test

real    0m5,049s
user    0m4,786s
sys 0m0,265s
$ time cargo miri test

real    4m19,256s
user    4m18,831s
sys 0m0,334s

Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.

saethlin commented 1 month ago

Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.

I cannot reproduce this result. Can you post the version that uses Box<AtomicPage>?

newpavlov commented 1 month ago
#[test]
fn init_atomic() {
    let pages = [(); PAGES].map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() });

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        pages[page_idx][inpage_idx].store(u32::MAX, Release);
    }
}
$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri tes

real    0m5,215s
user    0m4,970s
sys 0m0,245s

$ time cargo miri test

real    0m7,015s
user    0m6,743s
sys 0m0,274s

Interestingly, keeping the page variable results in a ~30x slowdown:

#[test]
fn init_atomic() {
    let pages = [(); PAGES].map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() });

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        let page: &AtomicPage = &pages[page_idx];
        page[inpage_idx].store(u32::MAX, Release);
    }
}
$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test

real    0m5,174s
user    0m4,862s
sys 0m0,314s

$ time cargo miri test

real    2m26,782s
user    2m26,401s
sys 0m0,361s
saethlin commented 1 month ago

Interestingly, keeping the page variable results in a ~30x slowdown:

There is an extremely subtle difference between those two programs. You're specifying the type to be &AtomicPage, but without the type specified that variable would be a &Box<AtomicPage>. Specifying that type is changing the borrowing structure of this program by creating additional SharedReadWrite tags in the bytes of the AtomicPages, which stresses this loop:

https://github.com/rust-lang/miri/blob/8f40dd256384883b78700bc348e7f5f272f131ca/src/borrow_tracker/stacked_borrows/mod.rs#L207-L215

saethlin commented 1 month ago

Hunh. In fact adding the type annotation at all, even if it's fully redundant, still produces an extra reborrow. That is new to me.

saethlin commented 1 month ago

Ah of course, the primary issue with debugging this performance issue is that the GC is doing its absolute best to cover up the problem. Perhaps a better minimization is this:

use std::cell::UnsafeCell;

const ITEMS: usize = 1 << 12;

fn main() {
    let page = [const { UnsafeCell::new(0u8) }; 4096];

    for i in 0..ITEMS {
        let _page = &page;
    }
}

When run with -Zmiri-provenance-gc=0 this has clearly super-linear runtime, at 1 << 10 iterations I get 2.4 seconds runtime, at 1 << 11 iterations I get 7.8 seconds, and at 1 << 12 iterations I see 25 seconds.

This is very similar to our slice-get-unchecked benchmark https://github.com/rust-lang/miri/blob/8f40dd256384883b78700bc348e7f5f272f131ca/bench-cargo-miri/slice-get-unchecked/src/main.rs#L5-L12 In both cases the program is stressing a code path which in Stacked Borrows is technically a linear search of ever-growing length.

But for this search, we don't have a stack cache, so you the only mitigation to true runtime explosion is the GC.

So I have two observations: All the borrow stacks here have a Unique base tag, then are SharedReadWrite all the way up. If stacks had a bool that indicated they have this property, we could skip the loop that's causing the runtime overhead here to be so dramatic. And I suspect that this property might be pretty common, shouldn't most atomics only be accessed through SharedReadWrite?

I wanted to print the stacks for this program and I got this:

use std::cell::UnsafeCell;

extern "Rust" {
    fn miri_get_alloc_id(ptr: *const u8) -> u64;
    fn miri_print_borrow_state(alloc_id: u64, show_unnamed: bool);
}

fn get_alloc_id(ptr: *const u8) -> u64 {
    unsafe { miri_get_alloc_id(ptr) }
}

fn print_borrow_stacks(alloc_id: u64) {
    unsafe {
        miri_print_borrow_state(alloc_id, /* ignored: show_unnamed */ false)
    }
}

fn main() {
    let page = [const { UnsafeCell::new(0u8) }; 4096];
    let id = get_alloc_id(page.as_ptr().cast());
    print_borrow_stacks(id);
}
...
4091..4092: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4092..4093: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4093..4094: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4094..4095: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4095..4096: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]

There is a different borrow stack for every byte, but they are all equal. I thought we had a mechanism to prevent this.

RalfJung commented 1 month ago

There is a different borrow stack for every byte, but they are all equal. I thought we had a mechanism to prevent this.

Yes, but that only kicks in when we iterate the stacks and it notices two adjacent stacks being equal. It also doesn't do arbitrarily much comparing and merging, to avoid unnecessary overhead.

But here nothing at all seems to be merged, not even the first bytes:

0..1: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
1..2: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
2..3: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
...

That is indeed strange.