Compiler spends all available memory when large array is initialized with lazy_static

hniksic commented 2 years ago

I tried this code:

use std::sync::Mutex;
use lazy_static::lazy_static;

const ROWS: usize = 100000;
const COLS: usize = 10000;

lazy_static! {
    static ref TWODARRAY: Mutex<[[u128; COLS]; ROWS]> = Mutex::new([[0; COLS]; ROWS]);
} 

fn main() {
    let mut twodarray = TWODARRAY.lock().unwrap();
}

I expected it to compile successfully.

Instead, the compiler spent all available RAM and swap, and I had to kill it. From top:

26208 hniksic   20   0  0.102t 7.939g  86392 S  96.8 58.0   0:04.78 rustc

Meta

rustc --version --verbose:

rustc 1.58.1 (db9d1b20b 2022-01-20)
binary: rustc
commit-hash: db9d1b20bba1968c1ec1fc49616d4742c1725b4b
commit-date: 2022-01-20
host: x86_64-unknown-linux-gnu
release: 1.58.1
LLVM version: 13.0.0

saethlin commented 2 years ago

In fairness to rustc, that array is 14.9 GB. If you only have 8 GB of memory, you'll never be able to run this program even if you do manage to compile it.

And if you do have enough memory to get through whatever's going on here, it looks to me like the incremental compilation system isn't prepared to handle this:

thread 'rustc' panicked at 'assertion failed: pos <= u32::MAX as usize', compiler/rustc_query_impl/src/on_disk_cache.rs:119:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: internal compiler error: unexpected panic

note: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/issues/new?labels=C-bug%2C+I-ICE%2C+T-compiler&template=ice.md

note: rustc 1.60.0-dev running on x86_64-unknown-linux-gnu

note: compiler flags: -C embed-bitcode=no -C debuginfo=2 -C incremental --crate-type bin

note: some of the compiler flags provided by cargo are hidden

query stack during panic:
end of query stack

(this is a local build because I want debug symbols)

All that said, I feel like this shouldn't be so memory-intensive. It looks like the memory consumption appears to be from the MIR interpreter's InitMask... I won't claim this because I'm truly lost in this area of the code but I'm looking.

eddyb commented 2 years ago

It looks like the memory consumption appears to be from the MIR interpreter's InitMask...

So I believe what you have here is a very large static that's mostly uninitialized (except for presumably the Once in it?). I forget the current implementation of InitMask, but if it's one bit per byte, the 14.9GiB static will require a 1.86GiB InitMask.

We could maybe reduce the cost of some of that tracking, but it's still not great overall.

Your executable would likely also need to be 14.9GiB in size, because the static isn't fully uninitialized, so AFAIK it can't go into .bss and instead has to be fully present in .data (unless LLVM gets clever?).

What you may want to do instead is Box your very large 2D array.

That results in successful compilation and this runtime message (on the playground):

memory allocation of 16000000000 bytes failed

saethlin commented 2 years ago

Ah, the allocations are all happening along a code path that calls copy_op. So I suspect there's somehow a copy happening in the MIR of this array? And... a lot of copies??? (I clearly don't understand much of MIR). The Size being passed to InitMask::grow when they're created makes sense.

Peak memory of the build is 44.8 GB, btw.

oli-obk commented 2 years ago

Yea, there are a few extra copies of values happening that we could probably avoid, but we always need at least two at some point during CTFE.

It may help to implement a "deinit source on move" scheme. Basically CTFE treats Operand::Move and Operand::Copy the same.

eddyb commented 2 years ago

So I suspect there's somehow a copy happening in the MIR of this array?

It's likely something like (Once, MaybeUninit<Mutex<[[T; N]; M]>>) - so the big array is in the same allocation, but that's not the thing being allocated and copied around, merely an uninitialized field of it.

The InitMask doesn't have to be that big, if the uninitialized data is a suffix of the allocation, but managing InitMask in the presence of clever encodings isn't easy. Supporting only an uninitialized suffix might be possible?

saethlin commented 2 years ago

OP's example is insufficiently minimized. A build of this program peaks at 33.6 GB:

const ROWS: usize = 100000;
const COLS: usize = 10000;

static TWODARRAY: [[u128; COLS]; ROWS] = [[0; COLS]; ROWS];

fn main() {}

Perhaps that makes sense, if we're doing a copy at all of this huge array, because that's twice the size of the array. Right?

hniksic commented 2 years ago

@saethlin

In fairness to rustc, that array is 14.9 GB. If you only have 8 GB of memory, you'll never be able to run this program even if you do manage to compile it.

I do have more than 8 GiB of memory, but rustc was spending 100+ GiB, which exceeds array size by far. And even if it didn't, I would hope to be able to allocate large arrays without the compiler having to reproduce the allocation. I often compile programs locally and then deploy them on servers with much more RAM, more CPU cores, etc.

@eddyb

What you may want to do instead is Box your very large 2D array.

I neglected to note in the issue that I am aware of the possibility of heap-allocating the array. In this case, however, I was explicitly experimenting with allocating the large array statically, akin to a C declaration of static int twodarray[ROWS][COLS]. It surprised me that attempting to do so broke the compiler, so I wanted to report it so it can hopefully get fixed or at least looked at.

Mark-Simulacrum commented 2 years ago

It seems like replacing the InitMask (which is basically a BitSet, not sure why we have a distinct impl of it for MIR allocations?) with something like the recently-added IntervalSet may make sense -- presumably it's pretty rare for a truly random pattern of initialization to occur.

eddyb commented 2 years ago

@Mark-Simulacrum I believe something like IntervalSet used to exist and was considered too cumbersome. Maybe the presence of a reusable abstraction might be enough to let miri use that?

In this case, however, I was explicitly experimenting with allocating the large array statically, akin to a C declaration of static int twodarray[ROWS][COLS].

To what end? Accessing it won't be more efficient than a pointer to a heap area - passing a &mut [[u128; COLS]; ROWS] around as a function argument (i.e. no globals), pointing to the heap, is probably the best you can do.

Putting it in a static just makes the file on disk unnecessarily large, and changes whose responsibility it is (kernel vs your program) to allocate it. And mutation requires a Mutex that you might be able to avoid otherwise.

Out of curiosity, I tried something similar (tho probably not sound on non-trivially-copyable types, heh) in C++ (using constexpr to match Rust's behavior) and it looks like GCC is worse than Clang at this.

OP's example is insufficiently minimized. A build of this program peaks at 33.6 GB:
const ROWS: usize = 100000;
const COLS: usize = 10000;

static TWODARRAY: [[u128; COLS]; ROWS] = [[0; COLS]; ROWS];

fn main() {}
Perhaps that makes sense, if we're doing a copy at all of this huge array, because that's twice the size of the array. Right?

That's worse than the OP example (in terms of initialization state), I'd say this is closer:

pub static BIG: Option<[u128; 1 << 30]> = None;

For 1 << 25 instead of 1 << 30, even godbolt can compile it (and it even works if it has to evaluate a const fn).

Surprisingly, this is the best I can get to compile on godbolt even fully-uninitialized:

pub static BIG_UNINIT: MaybeUninit<[u128; 1 << 27]> = MaybeUninit::uninit();

(though you'll want to replace u128 with some atomic type, or use some other way to enable interior mutability, to get it to go into .bss instead of .rodata)

saethlin commented 2 years ago

@Mark-Simulacrum I think I was mistaken to point to InitMask as a problem. It's surely where most of the compile time is coming from, but I think the memory consumption comes from MIR just doing a copy (in the minimal case) or two (in the case of lazy_static with a Mutex) of the intended array.

@eddyb

I believe something like IntervalSet used to exist and was considered too cumbersome.

It looks to me like there is currently InitMask and InitMaskCompressed, and in this program, the interpreter creates a full InitMask then attempts to do run-length encoding on it after-the-fact. I feel like this is just IntervalSet with extra steps?

Surprisingly, this is the best I can get to compile on godbolt even fully-uninitialized:

On godbolt I see "Compiler returned: 137" which from Googling means something else sent the compiler a SIGKILL. The last build that works peaks at 2.1 GB, and of course the next power of 2 is 4.1 GB. I think this is supposed to be unsurprising, because const eval has to construct the const. I suspect there's a (quite generous) 4 GB memory limit in Godbolt. So perhaps this is "as expected, but unfortunate"?

@hniksic

I do have more than 8 GiB of memory, but rustc was spending 100+ GiB, which exceeds array size by far.

Wowee, that's significantly worse. Can you share the lazy_static that was responsible for that amount of memory usage?

hniksic commented 2 years ago

Can you share the lazy_static that was responsible for that amount of memory usage?

That should actually be the original code in the report. The line from top shows the compiler having allocated 100 GiB virtual memory (and presumably not getting killed due to overcommit). Maybe the difference is that I was compiling it in release mode?

oli-obk commented 2 years ago

Supporting only an uninitialized suffix might be possible?

yea, we could lazily grow it and treat anything beyond the size as uninitialized.

RalfJung commented 2 years ago

It looks to me like there is currently InitMask and InitMaskCompressed, and in this program, the interpreter creates a full InitMask then attempts to do run-length encoding on it after-the-fact. I feel like this is just IntervalSet with extra steps?

Basically, we have InitMask to store the mask of an allocation, and InitMaskCompressed is used when copying the mask from one allocation to another (which might overlap, and we might be asked to repeat this N times, so we have to keep a copy of the original mask).

If it is possible, without perf loss for typical CTFE, to use the same data structure for both of these cases, that would be nice. :D But I think the problem is that testing if bit i is set in an InitMaskCompressed actually costs linear time since we have to scan the list of intervals.

tmiasko commented 2 years ago

But I think the problem is that testing if bit i is set in an InitMaskCompressed actually costs linear time since we have to scan the list of intervals.

If intervals lengths were to be replaced with prefix sums of their lengths test if i-th bit is set can be performed in logarithmic time.

rust-lang / rust

Compiler spends all available memory when large array is initialized with lazy_static #93215

Meta