Open silvasean opened 1 year ago
cc @benvanik @antiagainst
See also the description on https://github.com/openxla/iree/pull/11979.
One snippet from there:
Most usage that would benefit today falls into benchmarks and those are fine with unbounded. Since this style of allocator can easily lead to out-of-memory situations it's off by default and something a user needs to opt into based on their usage patterns and configuration needs. User binding layers can decide how they want to expose allocator configuration; some may want their own defaults and simple enums while others may want to expose deep configuration. This is like clang/gcc not forcing tcmalloc/jemalloc/mimalloc and requiring users to opt-in based on their needs.
Defining one single "default production configuration" for a low level library like IREE will be difficult/impractical. We'll likely want to carefully choose options like which allocator (and settings) to use for each application. Server training with large batch sizes will want a very different configuration than edge device inference on intermittent bursts of data samples.
I vaguely recall hearing that --device_allocator=caching
is not really a very good final solution -- is that correct?
Is it safe to say that for server training workloads with large batch sizes, --device_allocator=caching
is fully supported and ready for production? Or is there further work to be done in maturing it, or alternative approaches are the "right" solution for such a workload (I've heard something about a CUDA hal rewrite and stream ordered allocations, but I'm not sure if those matter here).
Hearing things like "can easily lead to out-of-memory situations" scares me.
The caching allocator is not a general purpose solution. You can turn it on by default with a configuration appropriate for your requirements in your own application but since we run things like resnet up to LLMs, stateless and stateful, single model vs multi model, and pipelined/multi-tenant workloads there's no one solution that works for everyone (what is an optimization for some cases will prevent other cases from running at all, etc).
The block-based suballocator is a better general purpose solution and it just needs to be finished. It'll still require hosting applications to tune things based on their usage but is much more resilient to the above concerns such that it could become opt-out instead of opt-in in the default tools. The caching allocator can only be opt-in.
You should try running with #13440 patched in - that changes all of the transient allocations made by IREE into ones using CUDA's memory pool. External allocations made by the user that aren't queue-ordered (iree_hal_device_queue_alloca/dealloca) will still hit our allocator and benefit from the caching/suballocator, though.
RE: out of memory, the code has configuration options with good comments: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/utils/caching_allocator.h (unbounded growth is convenient for benchmarking but can OOM on heavily dynamic programs, bounded growth has a few different ways that users can tune)
@jpienaar We need an owner for this item - can you take a look? @aaron-schneider
we have it as opt-in for SHARK since we have folks running stable diffusion like models from 6year old cards like RX480 to the latest w9000 / 4090. Happy to run any experiments in our nightly builds if it would help deciding on best defaults.
Is proposal here to finish the block allocator instead? Perhaps enable some simple heuristic form for caching allocator? (I don't know the amount of work needed for the block allocator fix and if the latter is something "good" or just having a couple of simple guidelines for folks to set manually gets up 70% of the way until then).
Updates from offline conversation: Related to #13545. We have a potential owner candidate, work to start ~July. Some spot fixes until then.
Request description
I am running an end-to-end workload and I find that running iree-benchmark-module with
--device_allocator=caching
results in a 15-20% speedup. After subsequent optimizations to the workload, it is likely to exceed 2x speedup. IREE's default production configuration should support this workload at full performance, via the caching allocator or otherwise.Do we have a plan for how to get those gains?
What component(s) does this issue relate to?
Runtime
Additional context
No response