High resident set size (RSS)

wks commented 3 weeks ago

We observed on several VM bindings that the RSS memory consumption is higher when using MMTk compared to those VMs' default GCs. We need to inspect where those memory is used for. Possibilities include (but are not limited to)

The SFT
The use of BlockQueue which caches memory blocks instead of returning the memory to the OS by unmapping.
Auxillary Rust data structures, such as Vec, Box, etc.

qinsoon commented 2 weeks ago

I tried to log RSS usage during a fop run with our OpenJDK binding. The large RSS issue is mostly caused by the initialization of SFT and VMMap.

Before MMTk initializes, the RSS is 14M. After MMTk initializes SFT_MAP, the RSS is 526MB. After MMTk initializes VM_MAP, the RSS is 783MB. After the plan is created, the RSS is 785MB. After MMTk finishes the initialization, the RSS is 801MB. At the end of the fop run, the RSS is 900MB+.

Though it is worth investigating how the RSS grows over the run, we should focus on SFT and VM map first. They in total contribute to 80% of the RSS usage.

wks commented 2 weeks ago

@qinsoon Can you also evaluated the effect of

heap size (and with dynamic heap size, the max heap size)
Map32 vs Map64

and see whether they affect the RSS impact of SFT_MAP and VM_MAP, too? I think they affect the heap layout and may also have an effect on the memory SFT_MAP and VM_MAP try to mmap.

qinsoon commented 2 weeks ago

I was using a compressed pointer build. So it used SFTSparseChunkMap and Map32.

k-sareen commented 2 weeks ago

Yes. I ran tests a few months ago and I mentioned back then that we've regressed because the restricted address space uses the sparse chunk map. That, VMMap, and not returning pages back to the OS were the largest sources of RSS overhead from memory

qinsoon commented 2 weeks ago

Right. Without compressed pointer (using SFTSpaceMap and Map64), there is no substantial RSS increase during initialization. At the end of initilaization, the RSS was 17MB. During the fop run, the RSS increased from 43MB (first GC) to 221MB (last GC).

On the contrary, with compressed pointer (using SFTSparseMap and Map32), the RSS was 801MB and during the run, the RSS increased from 833MB (first GC) to 1023MB (last GC).

wks commented 18 hours ago

I compared the mmap entries after running the Liquid benchmark on mmtk-ruby. I used the same binary, and used command line argument to control whether to use MMTk or CRuby's default GC. When using MMTk, the plan is StickyImmix, and the heap size is set to 36M, that is, 1.5x min heap size. I printed the mmap entries and calculated their pages in RAM using the methodology described here. The data is collected at the time of harness_end. I tried to match mmap entries from /proc/pid/maps between the two executions, and the results are in the following spreadsheet:

compare2.ods

Note that the overhead of SFTMap is trivial because the mmtk-ruby binding is currently using SFTSpaceMap on 64-bit systems, and the tables don't have many entries. The overhead of Map64 is also trivial because the length of its tables (descriptor map, base address and high watermark) is MAX_SPACE which is only 16.

The mmap entries specific to MMTk includes:

MMTk heap spaces and metadata.
- All spaces and metadata has 26.5MB in RAM
- The ImmixSpace alone has 24.0MB in RAM
Stacks and thread-local mmap memory for GC worker threads.
- Stacks: 1.19MB in total, 76.0KB per thread
- Malloc: 8.20MB in total, 525KB per thread
- Stacks+malloc: 9.39MB in total, 601KB per thread

The mmap entries specific to CRuby's own GC includes:

CRuby's heap
- 7MB in total.

The main mmap entry for malloc (the [heap] entry plus other un-annotated entries) are:

MMTk-Ruby: 13.9MB
CRuby's default GC: 12.2MB

In summary, MMTk has larger RSS footprint in

the GC heap
the malloc heap (for both the mutators and the GC worker threads)
the stack for GC worker threads.

In this execution, the MMTk heap size was set to 1.5x min heap size. If we divide the ImmixSpace by 1.5, we get 16MB, which is still larger than CRuby's default GC which is 7MB in size. One possible explanation is that in the current implementation of mmtk-ruby, we allocate Array, String and MatchData payloads in the GC heap, while the vanilla CRuby allocates them in the malloc heap. That gives us an illusion that MMTk is using more GC heap. But CRuby is just allocating those objects off heap. If we assume half of the heap objects are the payloads of those objects (which is still conservative for the Liquid benchmark I think because it uses regular expression and strings very intensively), it will be 8MB, which is similar to the 7MB of the default GC.

MMTk uses the work packet system. All work packets are allocated in the malloc heap, and the Vec members of work packets are also allocated in the malloc heap. This can explain the malloc heap usage in GC worker threads. If we reduce the number of GC worker threads by setting the env var MMTK_THREADS, the RSS footprint related to pthread stacks and thread-local malloc buffers will reduce proportionally. It is arguable that this part of the RSS footprint is not a problem because the memory is not leaked. Rust's ownership mechanism always correctly free unused memory. But the RSS number does look worse than that of the vanilla CRuby.

mmtk / mmtk-core

High resident set size (RSS) #1220