mmtk / mmtk-core

Memory Management ToolKit
https://www.mmtk.io
Other
379 stars 69 forks source link

Panic with 'VO bit already set' in GenImmix #1070

Closed qinsoon closed 10 months ago

qinsoon commented 10 months ago

Build OpenJDK with fastdebug and VO_BIT=1.

Run dacapo 2006 with GenImmix.

MMTK_PLAN=GenImmix /home/runner/work/mmtk-openjdk/mmtk-openjdk/bundles/jdk/bin/java -XX:+UseThirdPartyHeap -server -XX:MetaspaceSize=100M -Xms20M -Xmx20M -jar /home/runner/work/mmtk-openjdk/mmtk-openjdk/dacapo/dacapo-2006-10-MR2.jar antlr

We will see.

[2024-01-15T04:43:14Z INFO  mmtk::util::heap::gc_trigger] [POLL] nursery: Triggering collection (5127/5120 pages)
[2024-01-15T04:43:14Z INFO  mmtk::plan::generational::global] Nursery GC
[2024-01-15T04:43:15Z INFO  mmtk::scheduler::gc_work] End of GC (2692/5120 pages, took 7 ms)
thread '<unnamed>' panicked at '40000000: VO bit already set', /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/79fb0bb/src/util/metadata/vo_bit/mod.rs:65:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
wks commented 10 months ago

Did this happen in CI? We can set RUST_BACKTRACE to 1 when running on the CI so that we can always collect the stack trace.

qinsoon commented 10 months ago

Yeah. In my PR https://github.com/mmtk/mmtk-openjdk/pull/263, I changed the heap size used in CI to a reasonable value (for https://github.com/mmtk/mmtk-core/issues/424), and started to see this error. I did not look into the problem though.

wks commented 10 months ago

This is 100% reproducible on my machine. I'll have a look at it.

wks commented 10 months ago

The culprit is the check current_chunk != self.common.start:

impl<VM: VMBinding> CopySpace<VM> {
    // ... 
    #[cfg(feature = "vo_bit")]
    unsafe fn reset_vo_bit(&self) {
        let current_chunk = self.pr.get_current_chunk();
        if self.common.contiguous {
            // If we have allocated something into this space, we need to clear its VO bit.
            if current_chunk != self.common.start {  // ERROR!
                crate::util::metadata::vo_bit::bzero_vo_bit(
                    self.common.start,
                    current_chunk + BYTES_IN_CHUNK - self.common.start,
                );
            }
        } else {
            for (start, size) in self.pr.iterate_allocated_regions() {
                crate::util::metadata::vo_bit::bzero_vo_bit(start, size);
            }
        }
    }
    // ...
}

It is intended to omit clearing VO bits for the nursery if no mutators allocated anything into the nursery between the last and the current GC. However, it does the check at chunk granularity. If mutators allocated less than one chunk of memory into the nursery, current_chunk will still be equal to self.common.start, and it will not clear any VO bits, even though some objects have been allocated.

One obvious fix is removing that if statement so that we unconditionally clear the VO bits of the entire nursery after every GC.

Another obvious fix is clearing from self.common.start to self.pr.cursor() because the cursor is the end (at page granularity) of allocation.

The third obvious fix is just using self.pr.iterate_allocated_regions() regardless whether the space is contiguous or not. For contiguous monotonic page resources, the iterator always yields one single range from pr.start to pr.cursor (cursor is aligned up to a multiple of chunks, so if pr.start == pr.cursor, it will be a range of 0 bytes.

I think we can just use self.pr.iterate_allocated_regions() as it is the simplest solution that works.