Open pchampio opened 2 years ago
Thanks @pchampio for the detailed description! Frankly, I didn't know that the initial portion was 50% of the available memory. Indeed, this should be made configurable from pkwrap. I can patch that, but will you be able test it?
During the computation of the cost function, I don't believe there is a lot of memory being used. It should really be equal to the size of the output (or twice, but I'm not sure ). I think the major consumption comes from the use of Kaldi's NSGD, since for every layer of NaturalAffineTransform
used, there are additional matrices for the grad input and output at the layer.
I can patch that, but will you be able test it
I will! And thanks for the information about kaldi's NSGD.
Thanks! What I plan to do is the following:
Let me know if you have other suggestions.
Edit: I forgot to mention that after this you won't have to patch Kaldi.
I was having this weird issue; on an RTX2080TI training uses ~11GB of GPU RAM (with my batch size). But then, I switched to a Quadro rtx8000, the exact same experiment now uses ~27GB of GPU RAM! I looked at the PyTorch only memory consumption and was quite surprised that the model/forward/backward has a relatively small RAM footprint. Less than 5GB! (regardless of the hardware) So where are those (11-5) 6GB and (27 - 5) 22 GB! GPU ram allocated to ??? WHy do I my experiences uses more RAM on other hardware? You guess it from the issue title, kaldi.
If I understated everything correctly, kaldi allocates 50% of the available memory of GPU at startup because "CUDA malloc and free routines are very slow". https://github.com/kaldi-asr/kaldi/blob/3b8a97d859464912ddc50877f7280048f654c275/src/cudamatrix/cu-allocator.h#L56-L59 In normal circumstances, where training is performed in the kaldi toolkit, pre-allocated memory will be used for the model/forward/backward/features... But not in the case of a wrapper such as pkwrap, as it's pytorch that handles this aspect. SO we are wasting a LOOOT of GPU ram that could be used!
For now, I'm simply patching kaldi:
I've search and seen that only https://github.com/Snowdar/asv-subtools seems to do the same kind of patch. https://github.com/Snowdar/asv-subtools/blob/2a39e73e8ea1456d8c08bd90b40aab5a2e4d8785/kaldi/patch/runPatch-GPU.sh#L22-L32
Maybe pkwrap can do it internally? https://groups.google.com/g/kaldi-help/c/y-sH76YPwfM
example: https://github.com/kaldi-asr/kaldi/blob/cbed4ff688a172a7f765493d24771c1bd57dcd20/src/nnet3bin/nnet3-train.cc#L56-L58
How much GPU RAM do the
KaldiChainObjfFunction
andDenominatorGraph
and kaldi related functions requires?