idiap / pkwrap

A pytorch wrapper for LF-MMI training and parallel training in Kaldi
Other
72 stars 12 forks source link

Kaldi cuda memory caching #18

Open pchampio opened 2 years ago

pchampio commented 2 years ago

I was having this weird issue; on an RTX2080TI training uses ~11GB of GPU RAM (with my batch size). But then, I switched to a Quadro rtx8000, the exact same experiment now uses ~27GB of GPU RAM! I looked at the PyTorch only memory consumption and was quite surprised that the model/forward/backward has a relatively small RAM footprint. Less than 5GB! (regardless of the hardware) So where are those (11-5) 6GB and (27 - 5) 22 GB! GPU ram allocated to ??? WHy do I my experiences uses more RAM on other hardware? You guess it from the issue title, kaldi.

If I understated everything correctly, kaldi allocates 50% of the available memory of GPU at startup because "CUDA malloc and free routines are very slow". https://github.com/kaldi-asr/kaldi/blob/3b8a97d859464912ddc50877f7280048f654c275/src/cudamatrix/cu-allocator.h#L56-L59 In normal circumstances, where training is performed in the kaldi toolkit, pre-allocated memory will be used for the model/forward/backward/features... But not in the case of a wrapper such as pkwrap, as it's pytorch that handles this aspect. SO we are wasting a LOOOT of GPU ram that could be used!

For now, I'm simply patching kaldi:

diff --git a/src/cudamatrix/cu-allocator.h b/src/cudamatrix/cu-allocator.h
index d7d65da80..d7087ff44 100644
--- a/src/cudamatrix/cu-allocator.h
+++ b/src/cudamatrix/cu-allocator.h
@@ -65,7 +65,7 @@ struct CuAllocatorOptions {
   int32 num_subregions;

   CuAllocatorOptions():
-      cache_memory(true), memory_proportion(0.5), num_subregions(20) { }
+      cache_memory(true), memory_proportion(0.2), num_subregions(20) { }

   void Register(OptionsItf *po) {
     po->Register("cuda-cache-memory", &cache_memory, "True if you want "

I've search and seen that only https://github.com/Snowdar/asv-subtools seems to do the same kind of patch. https://github.com/Snowdar/asv-subtools/blob/2a39e73e8ea1456d8c08bd90b40aab5a2e4d8785/kaldi/patch/runPatch-GPU.sh#L22-L32

Maybe pkwrap can do it internally? https://groups.google.com/g/kaldi-help/c/y-sH76YPwfM

It's the --cuda-memory-proportion option, but it's not exposed unless the binary does RegisterCuAllocatorOptions() before doing po.Read(argc, argv);

example: https://github.com/kaldi-asr/kaldi/blob/cbed4ff688a172a7f765493d24771c1bd57dcd20/src/nnet3bin/nnet3-train.cc#L56-L58

How much GPU RAM do the KaldiChainObjfFunction and DenominatorGraph and kaldi related functions requires?

mrsrikanth commented 2 years ago

Thanks @pchampio for the detailed description! Frankly, I didn't know that the initial portion was 50% of the available memory. Indeed, this should be made configurable from pkwrap. I can patch that, but will you be able test it?

During the computation of the cost function, I don't believe there is a lot of memory being used. It should really be equal to the size of the output (or twice, but I'm not sure ). I think the major consumption comes from the use of Kaldi's NSGD, since for every layer of NaturalAffineTransform used, there are additional matrices for the grad input and output at the layer.

pchampio commented 2 years ago

I can patch that, but will you be able test it

I will! And thanks for the information about kaldi's NSGD.

mrsrikanth commented 2 years ago

Thanks! What I plan to do is the following:

Let me know if you have other suggestions.

Edit: I forgot to mention that after this you won't have to patch Kaldi.