Investigate memory fragmentation from ClrMD snapshots

Qonfused commented 4 months ago

ClrMd calls PssCaptureSnapshot to create a 'copy-on-write' clone of the MTGO client's process memory, creating a new snapshot process. This creates a clone of the process's virtual address space, sharing the same physical memory pages until any page is modified. Essentially, if the client process attempts to modify one of these pages, it instead modifies a new page with a clone of the original page contents.

It's worth noting that the process is frozen for the duration of the call to PssCaptureSnapshot, so no page-ins can occur until the snapshot has been created. However, changes made while the snapshot is still active will incur overhead from these new page-ins. After the snapshot is disposed of, no further page-ins will be made, as there exist no more references from the snapshot process (as it is now exclusively owned by the MTGO process).

Therefore, the optimization path for reducing the window that these snapshots take is twofold:

Avoid performance penalties from new page-ins for the snapshot's lifecycle.
Avoid maintaining stale copies of the object heap that may reference changed or non-existent memory locations.

Memory fragmentation in the MTGO process (irrespective of any behaviors caused by this snapshot) may trigger GC cycles that touch and cause page-ins even if there was no actual modification of data. At the same time, any unused pages of memory left over from copy-on-write modifications also need to be garbage collected, consuming additional CPU cycles.

This leaves a third optimization path to investigate with GC, which is further exacerbated by the memory handling of the MTGO process. This may require a custom GC mechanism to control/hide shared regions of memory created by the snapshot to reduce redundant copying.

Qonfused commented 4 months ago

The CLR exposes controls for suspending GC through the GC.TryStartNoGCRegion() and GC.EndNoGCRegion() methods (since .NET Framework 4.6+, added in https://github.com/dotnet/coreclr/commit/4f74a99e296d929945413c5a65d0c61bb7f2c32a):

// Assuming workstation GC, compute the maximum size for TryStartNoGCRegion().
// For 64-bit processes, use 256 MB; for 32-bit processes, use 16 MB.
int ephemeralSegmentSize = (Environment.Is64BitProcess ? 256 : 16) << 20;

// Avoid calling TryStartNoGCRegion() if we're already in a no GC region.
bool noGCRegionEntered = false;
if (GCSettings.LatencyMode != GCLatencyMode.NoGCRegion)
  noGCRegionEntered = GC.TryStartNoGCRegion(ephemeralSegmentSize, true);

try
{
  // - Create a new snapshot using ClrMD (calling PssCreateSnapshot)
  // - Carry out any work immediately after the snapshot is taken
  //   (i.e. returning values from getters/setters, function calls)
  // - Dispose of the snapshot, restoring COW-marked memory pages
}
finally
{
  // Ensure we leave the no GC region if we entered one.
  if (noGCRegionEntered)
    if (GCSettings.LatencyMode == GCLatencyMode.NoGCRegion)
      GC.EndNoGCRegion();
}

The hope is to prevent GC from removing objects or re-arranging and compacting memory for the duration of our snapshot handling. This is particularly helpful in accessing short-lived objects (generation 0 and 1) where the GC would most often follow this pattern.

This does appear to carry additional stipulations for frequent calls to GC.TryStartNoGCRegion() which aren't mentioned in the documentation. There may need to be explicit handling of exceptions thrown from GC.EndNoGCRegion() as a side-effect of this behavior.

Qonfused commented 3 months ago

With normal IO priority, the overhead for PssCaptureSnapshot typically falls between 10-20 ms (<30 ms for most page table sizes; fixed cost). Though this part of the process is relatively cheap, the overlying ClrMD method call to CreateSnapshotAndAttach (involving process creation) takes about 200-300 ms.

Older versions of Visual Studio (2017 version 15.9) set a cutoff time between subsequent snapshots around 300 ms (about P99 latency), where 81 ms lies at about the 75th percentile for the average snapshot time (article).

Qonfused commented 2 months ago

Memory fragmentation overall appears to be minimal; this point is somewhat obvious as .NET GC is unaware of OS-level CoW attributes of the page table (and most GC operations like finding roots don't require writes to random regions of heap memory). This mainly applies to passive operations of GC that would be expected to be largely read-only operations.

Another point is that GC-internal data structures are not kept in (managed) heap memory, which isolates the effect of page-ins from an active GC (aside from freeing or compacting memory). Often GC will update a card table/bitmap and various metadata (i.e. timestamps) that track when references are updated without modifying the object itself.

The caveat being that metadata stored in/near write barriers will tend to be located near the actual object, thus spanning over more pages. These writes are often (observed) as coalesced write operations which makes this behavior more predictable with when memory is actually being written to.

Qonfused commented 2 months ago

Overall the impact is fairly small for snapshots taken locating heap objects for ScubaDiver's IL indirection + reflection; the event-driven architecture of the SDK greatly mitigates this risk by pinning object references only as needed and not relying on the snapshot 'runtime' after subscription.

Will have better benchmarks for analyzing impact after #4, though I'd consider this a non-issue for the SDK as of current.

videre-project / MTGOSDK

Investigate memory fragmentation from ClrMD snapshots #11