mratsim / Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
https://mratsim.github.io/Arraymancer/
Apache License 2.0
1.34k stars 96 forks source link

CUDA - Custom memory management to avoid expensive alloc/dealloc #112

Open mratsim opened 7 years ago

mratsim commented 7 years ago

Memory allocations and release will probably become a bottleneck during the forward and backward propagation.

During the forward pass it will hold inputs tensor in cache. During backward pass it will create intermediate gradient tensors that will be consumed during gradient descent.

As evidenced by this article or by this, this or https://devtalk.nvidia.com/default/topic/1011686/need-a-better-buffer-management-pool-for-improving-performance/ cuda forum threads, memory management is often an issue.

Challenges:

Other frameworks:

Overview of the field

Dedicated GPU memory management technique

In other fields

data-man commented 7 years ago

CPU implementation in C

jemalloc (repo) nedmalloc

edubart commented 7 years ago

@data-man For CPU nim already have a good allocator, and works pretty well for our use case and it's also lockless because of how nim threads works, I've done some benchs against jemalloc, tcmalloc, mkl_alloc, rpmalloc and nim allocator does outperforms all of them.

An specialized CPU allocator for tensors could improve the performance slightly, although the overhead of allocation is already pretty low.

mratsim commented 7 years ago

For CPU the most gains will probably be by enforcing reuse of memory through a memory pool.

Nim GC apparently does that, but I have to check when and how to make it happy. In any case Araq mentionned on IRC that he wants to rework seq/string/ref internals to make them work without GC and we might benefit a lot from that.

I've checked the (unreadable) readme of nedmalloc, it seems impressive. Since it's a caching allocator it can serve as reference to build a custom one for GPU.

mratsim commented 7 years ago

Liveness Analysis (suggested by Scott Gray): http://www.diku.dk/hjemmesider/ansatte/torbenm/ICD/Register.pdf

Read "register" as Tensor

mratsim commented 7 years ago

 Another caching allocator implementation by Nvidia

Another way to approach this:

For a quick and maintainable first implementation:

mratsim commented 6 years ago

Thread opened on Nim forum to RFC this object pool structure.

import  tables,
        intsets,
        deques,
        times

type
  BlockSize = int
  Allocator = proc (size: Natural): pointer {.noconv.}
  Deallocator = proc (p: pointer) {.noconv.}

  CachedBlock = tuple[epoch: Time, address: ByteAddress, size: BlockSize]
    ## Describe a contiguous allocated and available memory chunk in the pool:
    ##   - starting timestamp of availability,
    ##   - starting address,
    ##   - size in bytes

  DecayingObjectPool = object
    freeBlocks: Table[BlockSize, ByteAddress]
    evictionQueue: Deque[CachedBlock]
    lazyReused: IntSet
    allocator: Allocator
    deallocator: Deallocator
    ## Object pool / caching allocator with timed eviction
    ##   - Keep track of available blocksizes in the pool.
    ##   - Free blocks that are too old expire and are returned to the OS/devices via the deallocator.
    ##   - Eviction is managed by a queue however reallocated objects must be
    ##     be removed from the EvictionQueue as well and can be at arbitrary positions.
    ##     To avoid costly deletion in the middle of the queue, reused objects are tracked
    ##     in lazyReused and will be removed lazily from the queue when they expire but will not
    ##     trigger the deallocator.
    when defined(debug) or defined(test):
      unusedMem: Natural
      usedMem: Natural
      nbAllocations: Natural
      nbDeallocations: Natural
      cacheHits: Natural
      cacheMisses: Natural

proc initDecayingObjectPool(proc_alloc: Allocator, proc_dealloc: Deallocator): DecayingObjectPool =
  result.allocator = proc_alloc
  result.deallocator = proc_dealloc

when isMainModule:
  let foo = initDecayingObjectPool(alloc0, dealloc)
  echo foo

Description

Complexity

mratsim commented 6 years ago

Note if it becomes a bottleneck, it might be good to evaluate the Table implementation compared to TommyDS benchmark as all hashtable implementations there are in the same performance ballpark (from C++ unordered map to libdynamic and tommy-hashtable)

tommy_hashtable - Fixed size chained hashtable.
tommy_hashdyn - Dynamic chained hashtable.
tommy_hashlin - Linear chained hashtable.
tommy_trie - Trie optimized for cache usage.
tommy_trie_inplace - Trie completely inplace.
rbtree - Red-black tree by Jason Evans.
nedtrie - Binary trie inplace by Niall Douglas.
khash - Dynamic open addressing hashtable by Attractive Chaos.
uthash - Dynamic chaining hashtable by Troy D. Hanson.
judy - Burst trie (JudyL) by Doug Baskins.
judyarray - Burst trie by Karl Malbrain.
googledensehash - Dynamic open addressing hashtable by Craig Silverstein at Google.
googlebtree - Btree by Google.
stxbtree - STX Btree by Timo Bingmann.
c++unordered_map - C++ STL unordered_map<> template.
c++map - C++ STL map<> template.
tesseract - Binary Search Tesseract by Gregorius van den Hoven.
googlelibchash - LibCHash by Craig Silverstein at Google.
libdynamic - Hash set by Fredrik Widlund.
concurrencykit - Non-blocking hash set by Samy Al Bahra.

image image

mratsim commented 6 years ago

Another data structure to look at Colony, it is designed for fast insert/delete with pointer stability in performance critical code where order does not matter.

Author claims it's the fastest.

One of the use case mentioned is game programming for object created and deleted in a tight loop which have a even more constraints that deep learning (being slow in a game is a problem).