Open ivberg opened 3 months ago
Related: [ExternalData - On Windows document proper offset for memory mapping support] (https://github.com/onnx/onnx/pull/6247) [Python external data conversion - Add support for aligning data for mmap] (https://github.com/onnx/onnx/pull/6248)
Describe the feature request
Overview
Users want as low as possible memory utilization and great system performance when running AI models.
The benefits are large multi GB Process Memory Commit savings & possible total overall memory savings in some models.
The feature ask is to allow fully optimized model to be serialized to disk and used as-is without large heap allocs.
In the following, Windows examples are used, but this would likely apply to other OSes and ORT as well; since memory mapping APIs are available on other OSes and the ORT code is cross platform in this respect.
About
If all of a model weights are not always used (sparse tensors?) then the weights actually used could only be read in from disk, and only occupy memory when accessed. In this case, total memory usage for running an AI model is less than the on disk size of the model.
For most Large Language Models (LLMs) all the weights during the attention mechanism are usually needed & accessed during inference. However, the method that they are accessed and read in from disk has performance and memory implications.
Reducing Process Memory Commit Usage
Using the example for (2) LLMs, there are techniques to reduce the memory commit usage of a process (using OS memory mapping APIs), and sometimes obtain higher performance including inference perf especially under low memory conditions. Some AI models are large and much more likely to push a system to its memory limits. If you use these memory mappings APIs, and don't need to allocate heap memory, then say weights / initializer data from an ONNX model can just be loaded when accessed and would not occupy process commit.
This is very beneficial because System Total Commit is a precious resource. Commit Limit is physical memory + pagefile size. E.g. 16GB RAM + 16GB pagefile = Max 32GB of memory that can be allocated. Once this limit is reached no more memory across the system can be allocated at all. See more Commit_charge Pushing the Limits of Windows: Virtual Memory Virtual Address Space and Physical Storage
For all further examples, we are going to use an example SLM (Small Language Model) Phi Silica example of around 1.85GB on disk size (3.2B) params.
Part 1 - Use ONNX External Data file with proper alignment + disable ORT Arena allocator
For our first experiment, we used ONNX External Data files with proper alignment fixes to generate a file which could be successfully mapped on Windows for all large initializers. See External Data Conversion is not saving most data with alignment support. Therefore, mmap support disabled for these initializers We also disabled the Arena memory allocator, as on CPU, the process consumes a lot more memory greedily and clouds the memory picture
m_session_options.DisableCpuMemArena();
With this in place ONNXRuntime was able to save a few hundred MBs (233MB) of process commit. This is just from the change of having an aligned external data file and thus letting ORT use map file support. However, this is not saving that much commit memory compared to the entire size of the model.
Part 2 - Disable pre-packing
For our second experiment, in addition to the technique & settings above, we disabled ORT pre-packing; which we determined from tracing was allocating the largest memory still - SessionState::PrepackConstantInitializedTensors
// Disable pre-packing - saves commit but REALLY REALLY bad for inf perf and overall runtime m_session_options.AddConfigEntry(kOrtSessionOptionsConfigDisablePrepacking, "1");
With this the commit memory savings were large, and in line with most of the size of the model 77%, in this case around 1436MB of commit. The issue here though is that disabling prepacking had severe runtime inference performance (200x worse), making the model unusable performance wise, but great memory wise.General framework for implementing the feature
What follows is technical information on the general approach that ORT might use to both prepack a model, and then serialize that to disk, such that memory mapping could work AND large memory allocations were not needed by ORT. This would have the benefit of the best of both worlds, great runtime performance while getting the best utilization of system memory.
Changes in how the model weights are accessed
What would happen is simply the OS would page in the initializers and weights as needed on demand during inference. Weights that were routinely accessed would be kept in physical memory, not much differently than how heap memory for active working sets is kept in physical memory when needed. The difference from before that would take place is
Feature request suggestions
So how to go about implementing this feature request?
ONNX Runtime already has the notion of graph optimizations that can be serialized/written to disk, for example in offline mode tied to a specific class of hardware - graph-optimizations.
However, even when using this offline optimized model, large memory allocations will still take occur in ORT due to something called PrePacking. Prepacking has large positive runtime inference performance benefits. However, our view is that these pre-packing optimizations should be done once, and be able to be serialized to disk so the data structure on disk matches the most optimized in memory layout that ORT will use.
Once prepacking is serialized on disk and used during session load, then the data structures needed for inference are already mapped into the process address space when using memory mapping and MapViewOfFile . With no other major allocations needed, when ORT attempts to access a data structure or weights, then simply the OS would page in from disk to memory those weights.
FYI @pranavsharma and @yuslepukhin whom we have already been working with on ORT on this
Describe scenario use case
This will be useful to optimize memory usage for on-device client scenarios with limited physical RAM with large CPU models.
Larger models on disk (1GB+) for example with Billions of parameters would utilize memory better with fully working memory map support.
One such example is Large Language Models (LLMs) or Small Language Models such as Phi Silica 3