[CUDA] Get speedups of caching allocator by default

silvasean commented 1 year ago

Request description

I am running an end-to-end workload and I find that running iree-benchmark-module with --device_allocator=caching results in a 15-20% speedup. After subsequent optimizations to the workload, it is likely to exceed 2x speedup. IREE's default production configuration should support this workload at full performance, via the caching allocator or otherwise.

Do we have a plan for how to get those gains?

What component(s) does this issue relate to?

Runtime

Additional context

No response

silvasean commented 1 year ago

cc @benvanik @antiagainst

ScottTodd commented 1 year ago

See also the description on https://github.com/openxla/iree/pull/11979.

One snippet from there:

Most usage that would benefit today falls into benchmarks and those are fine with unbounded. Since this style of allocator can easily lead to out-of-memory situations it's off by default and something a user needs to opt into based on their usage patterns and configuration needs. User binding layers can decide how they want to expose allocator configuration; some may want their own defaults and simple enums while others may want to expose deep configuration. This is like clang/gcc not forcing tcmalloc/jemalloc/mimalloc and requiring users to opt-in based on their needs.

ScottTodd commented 1 year ago

Defining one single "default production configuration" for a low level library like IREE will be difficult/impractical. We'll likely want to carefully choose options like which allocator (and settings) to use for each application. Server training with large batch sizes will want a very different configuration than edge device inference on intermittent bursts of data samples.

silvasean commented 1 year ago

I vaguely recall hearing that --device_allocator=caching is not really a very good final solution -- is that correct?

Is it safe to say that for server training workloads with large batch sizes, --device_allocator=caching is fully supported and ready for production? Or is there further work to be done in maturing it, or alternative approaches are the "right" solution for such a workload (I've heard something about a CUDA hal rewrite and stream ordered allocations, but I'm not sure if those matter here).

Hearing things like "can easily lead to out-of-memory situations" scares me.

benvanik commented 1 year ago

The caching allocator is not a general purpose solution. You can turn it on by default with a configuration appropriate for your requirements in your own application but since we run things like resnet up to LLMs, stateless and stateful, single model vs multi model, and pipelined/multi-tenant workloads there's no one solution that works for everyone (what is an optimization for some cases will prevent other cases from running at all, etc).

The block-based suballocator is a better general purpose solution and it just needs to be finished. It'll still require hosting applications to tune things based on their usage but is much more resilient to the above concerns such that it could become opt-out instead of opt-in in the default tools. The caching allocator can only be opt-in.

You should try running with #13440 patched in - that changes all of the transient allocations made by IREE into ones using CUDA's memory pool. External allocations made by the user that aren't queue-ordered (iree_hal_device_queue_alloca/dealloca) will still hit our allocator and benefit from the caching/suballocator, though.

ScottTodd commented 1 year ago

RE: out of memory, the code has configuration options with good comments: https://github.com/openxla/iree/blob/main/runtime/src/iree/hal/utils/caching_allocator.h (unbounded growth is convenient for benchmarking but can OOM on heavily dynamic programs, bounded growth has a few different ways that users can tune)

allieculp commented 1 year ago

@jpienaar We need an owner for this item - can you take a look? @aaron-schneider

powderluv commented 1 year ago

we have it as opt-in for SHARK since we have folks running stable diffusion like models from 6year old cards like RX480 to the latest w9000 / 4090. Happy to run any experiments in our nightly builds if it would help deciding on best defaults.

jpienaar commented 1 year ago

Is proposal here to finish the block allocator instead? Perhaps enable some simple heuristic form for caching allocator? (I don't know the amount of work needed for the block allocator fix and if the latter is something "good" or just having a couple of simple guidelines for folks to set manually gets up 70% of the way until then).

allieculp commented 1 year ago

Updates from offline conversation: Related to #13545. We have a potential owner candidate, work to start ~July. Some spot fixes until then.

iree-org / iree