kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.26k stars 5.32k forks source link

Memory allocator for CUDA #2195

Open danpovey opened 6 years ago

danpovey commented 6 years ago

This is a long-term issue, not something that's very time-sensitive. It's about how the memory manager for CUDA works (class CuAllocator). I believe it's quite inefficient, especially when the training has different chunk sizes (which it often does).

The issue is that the pieces of memory it offers to the user have all been allocated with exactly that size by the CUDA library (cudaMalloc, cudaMallocPitch)-- it doesn't attempt to break them up. This makes it very hard to have a good balance between caching pieces of allocated memory and releasing them back to CUDA. If it ends up caching too many pieces of memory, and we end up running out of memory, freeing the unused cached blocks may not always allow CUDA to allocate what it needs because of fragmentation. So it's certain that we're not using the memory very efficiently, and this can limit the minibatch sizes. It's hard to fix this via minor changes without affecting speed substantially (cudaMalloc and cudaMallocPitch are very slow).

It would probably make more sense to have a much smarter allocator that gets memory from the GPU in huge pieces (or maybe just pre-allocates the bulk of GPU memory), and then when chunks are freed, it merges them again into larger chunks that are stored in some kind of data structure on the CPU. There may even be a solution out there already. Any solution also needs to be sufficiently readable and not so complicated that it's unmaintainable. Bear in mind that speed of the allocator code is an issue; the data structures have to be designed such that the average time taken doing the necessary bookkeeping is not too large.

There also needs to be an option to stop the allocator from being too smart, and to have it just go through to cudaMalloc and cudaMallocPitch, so that cuda-memcheck will work correctly.

Anyway, this isn't a must-have, it's a nice-to-have, and any improvements in memory available would have to be balanced against speed, complexity and ease of debugging. (We would also need it not to break too badly when multiple processes use the same GPU, or to have options to make this work). Another thing that makes this less urgent is that I have implemented some changes to the nnet3 and chain codebases that reduce memory use by as much as a factor 2; these are in a branch that I hope to merge within a couple of weeks.

galv commented 6 years ago

Two things that will probably get us most of the way to reasonable performance: cnmem or CUB's allocator.

My justification is that these are the two libraries that other frameworks seem to use (Theano, Caffe2, etc.). While looking this up, I noticed as well that Caffe2 removed cnmem at some point recently, though no real justification was given other than something to the effect of "CUB is better."

Note that I don't fully understand how Kaldi's existing memory allocator works.

The one problem I see with your approach is that cudaMallocPitch does not have a sister function that describes how large the pitch would need to be without actually allocating memory - a "dry run" function (I've looked because I've wanted this in the past). This makes chunking up a single large block in an API-friendly way hard.

An allocator should not mistrigger cuda-memcheck, by the way.

danpovey commented 6 years ago

Hm. I had a look at CUB's allocator code (I'm not crazy about NVidia's coding guidelines). It seems quite wasteful of memory in the default setup because it by default stores allocations in blocks that cover factors of 8 in size, and never subdivides the blocks, so an allocation could potentially end up really using nearly 8 times more memory than you actually requested. You can specify that factor but it's an integer so the minimum factor is 2. Kaldi's allocator code errs in the opposite direction: it will only return the exact size you asked for. (Because Kaldi's allocator code supports pitch, mixing and matching sizes is harder; but at some point we could change it so internally it doesn't support pitch, since pitch isn't really that important for speed anyway on modern GPUs).

Also it would be hard to configure the CUB code in a way that would work OK when people would use multiple training jobs on the same GPU. (This is not recommended by us but people still do it). You have to specify the maximum cached memory (as an absolute value in bytes, not as a factor like the Kaldi version) and it would be hard to have a value that would be fast when running one big-memory job but also fast when running many smaller-memory jobs.

Regarding the 'dry-run' thing for cudaMallocPitch-- this is actually impossible because of how the CUDA API works; cudaMallocPitch isn't guaranteed to always return the same pitch, it depends whether memory is tight. In any case it's never obligatory to use the 'pitch' version, it's a speed thing.

My concern about cuda-memcheck is not about mis-triggering it, it's that if we subdivide large blocks it would fail to trigger when it should trigger. But this is a small pint.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.