Open nicolabertoldi opened 3 years ago
Hi,
This is all expected. --workspace
does not determine the total memory usage (as that is actually a bit hard to predict since everything in Marian gets allocated lazily), but the memory that the forward/backward process can use for the activations and their gradients.
Model parameters, gradients and optimizer parameters will be added on top of that. So in your case for an 835 MB model that will be 4x835MB (model, gradients, first Adam moments, second Adam moments). That also roughly explains what's inside the optimizer.npz, you will have the master parameters in there and the Adam moments. With exponential smoothing enabled that will also contain the unsmoothed model, so between 3-4 times model size.
So, with --mini-batch-fit
the workspace is being respected, but other things like model size and optimizer parameters depend on chosen architecture and number of GPUs used. Due to sharding, the space used by the optimizer actually goes down with more used GPUs.
thanks for the clear explanation.
A few more questions:
workspace
and the memory for model+gradient+moments?model.npz.optimizer.npz
correspond to the runtime GPU RAM usage for model+gradient+moments?1) Yes, that should be the case. 2) Yes, when training on a single GPU. On multiple GPUs, sharding reduces GPU memory compared to the checkpoint. 3) Not that I can think of.
Bug description
Please add a clear and concise description of the bug, including observed and if possible expected behavior.
Using marian I observed a couple of odd behaviours:
—workspace
to 4000, the GPU RAM used is much more higher, often above 10Gbmodel.npz
is about 836Mb, the size of the filemodel.npz.optimizer.npz
is 4 times bigger (3.3Gb)So I would like to get some clarification about.
—workspace
is not respected?model.npz.optimizer.npz
?How to reproduce
Context