mmtk / mmtk-core

Memory Management ToolKit
https://www.mmtk.io
Other
379 stars 69 forks source link

Improve nogc error when out of space #1175

Open eileencodes opened 4 months ago

eileencodes commented 4 months ago

When using NoGC if the machine runs out of memory, the user will see an error that says garbage collection failed:

ERROR: An MMTk GC thread panicked.  This is a bug.
panicked at /Users/eileencodes/src/github.com/open_source/mmtk-core/src/plan/nogc/global.rs:74:9:
internal error: entered unreachable code: GC triggered in nogc
Backtrace is disabled.
run with `RUST_BACKTRACE=1` environment variable to display a backtrace
running file: test/zlib/test_zlib.rb

I added a check in poll to see if the plan being used supports GC and if not, raise an error. I could have updated the schedule_collection function to return this error, but I felt that since that function is regarding scheduling, it was better to raise if any plan does not support GC than change the purpose of schedule_collection. We still want to know if a GC ended up being triggered in schedule_collection.

The new error looks like this:

[2024-07-24T18:43:45Z INFO  mmtk::memory_manager] Initialized MMTk with NoGC (FixedHeapSize(1073741824))
thread '<unnamed>' panicked at /Users/eileencodes/src/github.com/open_source/mmtk-core/src/util/heap/gc_trigger.rs:80:17:
internal error: entered unreachable code: User ran out of space and the plan does not support collecting garbage.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
running file: test/zlib/test_zlib.rb

Note: I've written very little rust so while I know this works (tested in CRuby) I don't know if from a design perspective this is correct.

wks commented 4 months ago

Yes. It makes sense to let it exit immediately when a mutator thread has an allocation failure. The previous code relies on the unreachable!() macro in schedule_collection executed on a GC thread, and the error message appeared as if it were a bug in MMTk while it is just normal for NoGC.

I made some slight changes. I retained the info! log in poll so that when we see the number of pages it prints is too small, we know we need to increase the heap size. I also changed unreachable! to panic! because it is technically not "unreachable code". unreachable! is intended to guard code paths that shouldn't be executed unless there is a bug in the code, but running out of memory is a run-time error.

wks commented 4 months ago

There are two additional problems.

  1. We currently do mock testing using MockVM, and it hooks the Collection::block_for_gc call back function to identify if the mutator triggered GC in a code snippet. If we panic early in GCTrigger::poll, it will not reach block_for_gc.
  2. We actually have an out-of-memory handler Collection::out_of_memory. Its default behavior is panicking, but it allows the VM binding to override it in order to handle the event of out-of-memory in a VM-specific manner. For example, JVM should throw OutOfMemoryError. Ruby doesn't seem to have a hard heap size limit, but only a heap growth factor, and the operating system will start killing innocent processes after a Ruby program allocates too many large objects (which have buffers allocated by malloc).

I think we need to think more about this topic, i.e. out-of-memory handling in NoGC. Intuitively, if we run out of memory in NoGC, then Collection::out_of_memory should be called, which defaults to panicking.

qinsoon commented 4 months ago

We can just rewrite related tests. I am also okay if you would like to call out_of_memory for NoGC.