too large GPU RAM memory usage and model file size

nicolabertoldi commented 3 years ago

Bug description

Please add a clear and concise description of the bug, including observed and if possible expected behavior.

Using marian I observed a couple of odd behaviours:

even if during training I set the parameter —workspace to 4000, the GPU RAM used is much more higher, often above 10Gb
while the size of the file model.npz is about 836Mb, the size of the file model.npz.optimizer.npz is 4 times bigger (3.3Gb)

So I would like to get some clarification about.

Why the memory upperbound specified by —workspace is not respected?
What is actually loaded in GPU RAM?
What is actually stored in the model.npz.optimizer.npz?
Are there are ways to be sure that the GPU RAM usage is limited to a specific value?

How to reproduce

marian --task transformer-big --devices 0 --layer-normalization    --model  model_path --train-sets   train_path.sl train_path.tl  --max-length 4096 --vocabs vocab_path, vocab_path --mini-batch-fit --workspace 4000 --mini-batch 1000 --maxi-batch 1000 --save-freq 5000 --beam-size 4 --normalize 1 --keep-best --early-stopping 5 --cost-type ce-mean-words --enc-depth 6 --dec-depth 6 --tied-embeddings-all --label-smoothing 0.1 --learn-rate 1 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --sync-sgd --seed 0 --no-restore-corpus --shuffle none --transformer-dropout 0 --exponential-smoothing 0

Context

Marian version: v1.10.14; 90e161fa 2021-04-14 09:53:49 +0100

emjotde commented 3 years ago

Hi, This is all expected. --workspace does not determine the total memory usage (as that is actually a bit hard to predict since everything in Marian gets allocated lazily), but the memory that the forward/backward process can use for the activations and their gradients.

Model parameters, gradients and optimizer parameters will be added on top of that. So in your case for an 835 MB model that will be 4x835MB (model, gradients, first Adam moments, second Adam moments). That also roughly explains what's inside the optimizer.npz, you will have the master parameters in there and the Adam moments. With exponential smoothing enabled that will also contain the unsmoothed model, so between 3-4 times model size.

So, with --mini-batch-fit the workspace is being respected, but other things like model size and optimizer parameters depend on chosen architecture and number of GPUs used. Due to sharding, the space used by the optimizer actually goes down with more used GPUs.

nicolabertoldi commented 3 years ago

thanks for the clear explanation.

A few more questions:

is it correct to say that the actual GPU RAM usage is almost the sum of reserved workspace and the memory for model+gradient+moments?
does the size of the file model.npz.optimizer.npz correspond to the runtime GPU RAM usage for model+gradient+moments?
or there are additional elements which should be considered to estimate the overall GPU RAM usage?

emjotde commented 3 years ago

1) Yes, that should be the case. 2) Yes, when training on a single GPU. On multiple GPUs, sharding reduces GPU memory compared to the checkpoint. 3) Not that I can think of.

marian-nmt / marian-dev