Open mratsim opened 7 years ago
@data-man For CPU nim already have a good allocator, and works pretty well for our use case and it's also lockless because of how nim threads works, I've done some benchs against jemalloc, tcmalloc, mkl_alloc, rpmalloc and nim allocator does outperforms all of them.
An specialized CPU allocator for tensors could improve the performance slightly, although the overhead of allocation is already pretty low.
For CPU the most gains will probably be by enforcing reuse of memory through a memory pool.
Nim GC apparently does that, but I have to check when and how to make it happy. In any case Araq mentionned on IRC that he wants to rework seq/string/ref internals to make them work without GC and we might benefit a lot from that.
I've checked the (unreadable) readme of nedmalloc, it seems impressive. Since it's a caching allocator it can serve as reference to build a custom one for GPU.
Liveness Analysis (suggested by Scott Gray): http://www.diku.dk/hjemmesider/ansatte/torbenm/ICD/Register.pdf
Read "register" as Tensor
Another way to approach this:
malloc
/free
.For a quick and maintainable first implementation:
Thread opened on Nim forum to RFC this object pool structure.
import tables,
intsets,
deques,
times
type
BlockSize = int
Allocator = proc (size: Natural): pointer {.noconv.}
Deallocator = proc (p: pointer) {.noconv.}
CachedBlock = tuple[epoch: Time, address: ByteAddress, size: BlockSize]
## Describe a contiguous allocated and available memory chunk in the pool:
## - starting timestamp of availability,
## - starting address,
## - size in bytes
DecayingObjectPool = object
freeBlocks: Table[BlockSize, ByteAddress]
evictionQueue: Deque[CachedBlock]
lazyReused: IntSet
allocator: Allocator
deallocator: Deallocator
## Object pool / caching allocator with timed eviction
## - Keep track of available blocksizes in the pool.
## - Free blocks that are too old expire and are returned to the OS/devices via the deallocator.
## - Eviction is managed by a queue however reallocated objects must be
## be removed from the EvictionQueue as well and can be at arbitrary positions.
## To avoid costly deletion in the middle of the queue, reused objects are tracked
## in lazyReused and will be removed lazily from the queue when they expire but will not
## trigger the deallocator.
when defined(debug) or defined(test):
unusedMem: Natural
usedMem: Natural
nbAllocations: Natural
nbDeallocations: Natural
cacheHits: Natural
cacheMisses: Natural
proc initDecayingObjectPool(proc_alloc: Allocator, proc_dealloc: Deallocator): DecayingObjectPool =
result.allocator = proc_alloc
result.deallocator = proc_dealloc
when isMainModule:
let foo = initDecayingObjectPool(alloc0, dealloc)
echo foo
table
to keep track of available cached blocks, BlockSize
as keys, ByteAddress
as values. Note that Nim table can store several values for the same key (say 90MB), and with take it returns the values in FIFO order.deque
(queue are deprecated). Each CachedBlock
has an epoch timestamp, after a threshold they will be free for real and returned to the OS/GPU driverintset
(i.e. a set). Blocks reused before their expiration are added there. Deleting them from the middle of the queue at random location would be too costly (O(n)).add
, del
, take
, all O(1)queue
part: addLast
, peekFirst
(to check epoch), popFirst
, all O(1)contains
, incl
, excl
. I'm not sure of the complexity but it should be O(1) as well.Note if it becomes a bottleneck, it might be good to evaluate the Table implementation compared to TommyDS benchmark as all hashtable implementations there are in the same performance ballpark (from C++ unordered map to libdynamic and tommy-hashtable)
tommy_hashtable - Fixed size chained hashtable.
tommy_hashdyn - Dynamic chained hashtable.
tommy_hashlin - Linear chained hashtable.
tommy_trie - Trie optimized for cache usage.
tommy_trie_inplace - Trie completely inplace.
rbtree - Red-black tree by Jason Evans.
nedtrie - Binary trie inplace by Niall Douglas.
khash - Dynamic open addressing hashtable by Attractive Chaos.
uthash - Dynamic chaining hashtable by Troy D. Hanson.
judy - Burst trie (JudyL) by Doug Baskins.
judyarray - Burst trie by Karl Malbrain.
googledensehash - Dynamic open addressing hashtable by Craig Silverstein at Google.
googlebtree - Btree by Google.
stxbtree - STX Btree by Timo Bingmann.
c++unordered_map - C++ STL unordered_map<> template.
c++map - C++ STL map<> template.
tesseract - Binary Search Tesseract by Gregorius van den Hoven.
googlelibchash - LibCHash by Craig Silverstein at Google.
libdynamic - Hash set by Fredrik Widlund.
concurrencykit - Non-blocking hash set by Samy Al Bahra.
Another data structure to look at Colony, it is designed for fast insert/delete with pointer stability in performance critical code where order does not matter.
Author claims it's the fastest.
One of the use case mentioned is game programming for object created and deleted in a tight loop which have a even more constraints that deep learning (being slow in a game is a problem).
Memory allocations and release will probably become a bottleneck during the forward and backward propagation.
During the forward pass it will hold inputs tensor in cache. During backward pass it will create intermediate gradient tensors that will be consumed during gradient descent.
As evidenced by this article or by this, this or https://devtalk.nvidia.com/default/topic/1011686/need-a-better-buffer-management-pool-for-improving-performance/ cuda forum threads, memory management is often an issue.
Challenges:
Other frameworks:
Overview of the field
Dedicated GPU memory management technique
In other fields