marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
257 stars 127 forks source link

too large GPU RAM memory usage and model file size #859

Open nicolabertoldi opened 3 years ago

nicolabertoldi commented 3 years ago

Bug description

Please add a clear and concise description of the bug, including observed and if possible expected behavior.

Using marian I observed a couple of odd behaviours:

So I would like to get some clarification about.

How to reproduce

marian --task transformer-big --devices 0 --layer-normalization    --model  model_path --train-sets   train_path.sl train_path.tl  --max-length 4096 --vocabs vocab_path, vocab_path --mini-batch-fit --workspace 4000 --mini-batch 1000 --maxi-batch 1000 --save-freq 5000 --beam-size 4 --normalize 1 --keep-best --early-stopping 5 --cost-type ce-mean-words --enc-depth 6 --dec-depth 6 --tied-embeddings-all --label-smoothing 0.1 --learn-rate 1 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --sync-sgd --seed 0 --no-restore-corpus --shuffle none --transformer-dropout 0 --exponential-smoothing 0

Context

emjotde commented 3 years ago

Hi, This is all expected. --workspace does not determine the total memory usage (as that is actually a bit hard to predict since everything in Marian gets allocated lazily), but the memory that the forward/backward process can use for the activations and their gradients.

Model parameters, gradients and optimizer parameters will be added on top of that. So in your case for an 835 MB model that will be 4x835MB (model, gradients, first Adam moments, second Adam moments). That also roughly explains what's inside the optimizer.npz, you will have the master parameters in there and the Adam moments. With exponential smoothing enabled that will also contain the unsmoothed model, so between 3-4 times model size.

So, with --mini-batch-fit the workspace is being respected, but other things like model size and optimizer parameters depend on chosen architecture and number of GPUs used. Due to sharding, the space used by the optimizer actually goes down with more used GPUs.

nicolabertoldi commented 3 years ago

thanks for the clear explanation.

A few more questions:

emjotde commented 3 years ago

1) Yes, that should be the case. 2) Yes, when training on a single GPU. On multiple GPUs, sharding reduces GPU memory compared to the checkpoint. 3) Not that I can think of.