CUDA - Custom memory management to avoid expensive alloc/dealloc

mratsim commented 7 years ago

Memory allocations and release will probably become a bottleneck during the forward and backward propagation.

During the forward pass it will hold inputs tensor in cache. During backward pass it will create intermediate gradient tensors that will be consumed during gradient descent.

As evidenced by this article or by this, this or https://devtalk.nvidia.com/default/topic/1011686/need-a-better-buffer-management-pool-for-improving-performance/ cuda forum threads, memory management is often an issue.

Challenges:

Avoid expensive calls. Replacing cudaMalloc / cudaFree by evenmore expensive calls is not an option.
Memory is allocated and released very quickly.
Allocations are very large as we often allocate the max available GPU memory per batch.
Just reusing a tensor would work if input are the same size. This is often the case for images but completely impractical for text. Introducing padding on the user end would mean we should just use a static graph framework ;).
It is not nice to outright reserve 70~90% of the GPU memory to run MNIST, what if people are actually using a screen?

Other frameworks:

Torch as THCCachingAllocator, aka Memory Pool
Mxnet uses a memory pool
Theano uses CNMem which just outright reserve a percentage (70% iirc) of GPU memory aka region-based memory management
Tensorflow reserves everything
Chainer uses a memory pool

Overview of the field

GPU allocation with benchmarks

Dedicated GPU memory management technique

Baidu p15 for speech DL obtained 2x improvements with a custom allocator using the buddy algorithm and inspired by jemalloc.
Buddy algorithm slides by Nvidia.
CPU implementation in C
Cuda caching allocator
Cuda allocator for massively parallel systems: mallocMC
CMalloc, potentially the fastest most flexible Cuda allocator

In other fields

Memory pool, Arena, region-based
slab allocator (used by Linux Kernel) to avoid fragmentation

data-man commented 7 years ago

CPU implementation in C

jemalloc (repo) nedmalloc

edubart commented 7 years ago

@data-man For CPU nim already have a good allocator, and works pretty well for our use case and it's also lockless because of how nim threads works, I've done some benchs against jemalloc, tcmalloc, mkl_alloc, rpmalloc and nim allocator does outperforms all of them.

An specialized CPU allocator for tensors could improve the performance slightly, although the overhead of allocation is already pretty low.

mratsim commented 7 years ago

For CPU the most gains will probably be by enforcing reuse of memory through a memory pool.

Nim GC apparently does that, but I have to check when and how to make it happy. In any case Araq mentionned on IRC that he wants to rework seq/string/ref internals to make them work without GC and we might benefit a lot from that.

I've checked the (unreadable) readme of nedmalloc, it seems impressive. Since it's a caching allocator it can serve as reference to build a custom one for GPU.

mratsim commented 7 years ago

Liveness Analysis (suggested by Scott Gray): http://www.diku.dk/hjemmesider/ansatte/torbenm/ICD/Register.pdf

Read "register" as Tensor

mratsim commented 7 years ago

Another caching allocator implementation by Nvidia

CUB Caching Allocator Note that CUB only cache for up to 6MB.

Another way to approach this:

We basically just need a GB-sized caching allocator that make make sure that if memory is requested and unrequested very rapidly in a loop and in a lesser measure, in a forward/backward longer loop we don't spend time in malloc/free.
No memory hoarding: memory that is not reused is given back to the OS/cuda driver.

For a quick and maintainable first implementation:

We can leave all the fragmenting issues to the cuda runtime and cache/reuse only with Tensors allocated with the exact same size. If a new non-available sized tensor is needed, we just cudaMalloc. This should cover 95% of the need for image processing (and 0% for NLP but oh well).
Store the (pointer, epoch, in_use) in a data structure (binary tree, linked list ...) and when the Tensor becomes unused, free the memory for good after a certain amount of time. It can be proportional to the amount of time the object was in use (note that backprop may be 10%~20% slower) or use an exponential rate of decay to cover only tight loops.
- Caveat: We might produce false Out-of-Memory errors if a chunk is not released to cuda driver and we request another chunk.

mratsim commented 6 years ago

Thread opened on Nim forum to RFC this object pool structure.

import  tables,
        intsets,
        deques,
        times

type
  BlockSize = int
  Allocator = proc (size: Natural): pointer {.noconv.}
  Deallocator = proc (p: pointer) {.noconv.}

  CachedBlock = tuple[epoch: Time, address: ByteAddress, size: BlockSize]
    ## Describe a contiguous allocated and available memory chunk in the pool:
    ##   - starting timestamp of availability,
    ##   - starting address,
    ##   - size in bytes

  DecayingObjectPool = object
    freeBlocks: Table[BlockSize, ByteAddress]
    evictionQueue: Deque[CachedBlock]
    lazyReused: IntSet
    allocator: Allocator
    deallocator: Deallocator
    ## Object pool / caching allocator with timed eviction
    ##   - Keep track of available blocksizes in the pool.
    ##   - Free blocks that are too old expire and are returned to the OS/devices via the deallocator.
    ##   - Eviction is managed by a queue however reallocated objects must be
    ##     be removed from the EvictionQueue as well and can be at arbitrary positions.
    ##     To avoid costly deletion in the middle of the queue, reused objects are tracked
    ##     in lazyReused and will be removed lazily from the queue when they expire but will not
    ##     trigger the deallocator.
    when defined(debug) or defined(test):
      unusedMem: Natural
      usedMem: Natural
      nbAllocations: Natural
      nbDeallocations: Natural
      cacheHits: Natural
      cacheMisses: Natural

proc initDecayingObjectPool(proc_alloc: Allocator, proc_dealloc: Deallocator): DecayingObjectPool =
  result.allocator = proc_alloc
  result.deallocator = proc_dealloc

when isMainModule:
  let foo = initDecayingObjectPool(alloc0, dealloc)
  echo foo

Description

A table to keep track of available cached blocks, BlockSize as keys, ByteAddress as values. Note that Nim table can store several values for the same key (say 90MB), and with take it returns the values in FIFO order.
A deque (queue are deprecated). Each CachedBlock has an epoch timestamp, after a threshold they will be free for real and returned to the OS/GPU driver
A intset (i.e. a set). Blocks reused before their expiration are added there. Deleting them from the middle of the queue at random location would be too costly (O(n)).

Complexity

The table operations used are add, del, take, all O(1)
The deque operations used are only the queue part: addLast, peekFirst (to check epoch), popFirst, all O(1)
The intset operations used are contains, incl, excl. I'm not sure of the complexity but it should be O(1) as well.

mratsim commented 6 years ago

Note if it becomes a bottleneck, it might be good to evaluate the Table implementation compared to TommyDS benchmark as all hashtable implementations there are in the same performance ballpark (from C++ unordered map to libdynamic and tommy-hashtable)

tommy_hashtable - Fixed size chained hashtable.
tommy_hashdyn - Dynamic chained hashtable.
tommy_hashlin - Linear chained hashtable.
tommy_trie - Trie optimized for cache usage.
tommy_trie_inplace - Trie completely inplace.
rbtree - Red-black tree by Jason Evans.
nedtrie - Binary trie inplace by Niall Douglas.
khash - Dynamic open addressing hashtable by Attractive Chaos.
uthash - Dynamic chaining hashtable by Troy D. Hanson.
judy - Burst trie (JudyL) by Doug Baskins.
judyarray - Burst trie by Karl Malbrain.
googledensehash - Dynamic open addressing hashtable by Craig Silverstein at Google.
googlebtree - Btree by Google.
stxbtree - STX Btree by Timo Bingmann.
c++unordered_map - C++ STL unordered_map<> template.
c++map - C++ STL map<> template.
tesseract - Binary Search Tesseract by Gregorius van den Hoven.
googlelibchash - LibCHash by Craig Silverstein at Google.
libdynamic - Hash set by Fredrik Widlund.
concurrencykit - Non-blocking hash set by Samy Al Bahra.

mratsim commented 6 years ago

Another data structure to look at Colony, it is designed for fast insert/delete with pointer stability in performance critical code where order does not matter.

Author claims it's the fastest.

One of the use case mentioned is game programming for object created and deleted in a tight loop which have a even more constraints that deep learning (being slow in a game is a problem).

mratsim / Arraymancer