Open Gldkslfmsd opened 5 years ago
The total number of model parameters must be divisible by the number of GPUs. It’s a limitation of NCCL that we have not yet worked around.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Dominik Macháček notifications@github.com Sent: Sunday, February 17, 2019 2:35 AM To: marian-nmt/marian Cc: Subscribed Subject: [marian-nmt/marian] all shards must have the same size -- problem with 6GPUs, but not with 5 (#248)
Hello,
I trained my model on 5 GPUs and it worked without problem. Then I aborted it and wanted to continue with 6 GPUs, but there's this error. The same was when I was switching from 2 to 3.
Does this happend because the number of batches in training data must be divisible by number of GPUs?
[2019-02-17 10:52:52] [data] Restoring the corpus state to epoch 1, batch 206000 [2019-02-17 11:18:06] Training started [2019-02-17 11:18:06] [memory] Reserving 725 MB, device gpu0 [2019-02-17 11:18:06] [memory] Reserving 725 MB, device gpu1 [2019-02-17 11:18:06] [memory] Reserving 725 MB, device gpu4 [2019-02-17 11:18:06] [memory] Reserving 725 MB, device gpu5 [2019-02-17 11:18:06] [memory] Reserving 725 MB, device gpu2 [2019-02-17 11:18:06] [memory] Reserving 725 MB, device gpu3 [2019-02-17 11:18:07] Loading model from model/model.npz [2019-02-17 11:18:09] [memory] Reserving 725 MB, device cpu0 [2019-02-17 11:18:10] Error: presently, all shards must have the same size [2019-02-17 11:18:10] Error: Aborted from size_t marian::NCCLCommunicator::shardSize() const in /lnet/troja/projects/elitr/marian/src/training/communicator_nccl.h:76
Or can I avoid it by changing parameters?
[2019-02-17 11:18:23] [marian] Marian v1.7.6 02f4af4e 2018-12-12 18:51:10 -0800 [2019-02-17 11:18:23] [marian] Running on tdll1 as process 13625 with command line: [2019-02-17 11:18:23] [marian] /lnet/troja/projects/elitr/marian/build/marian --ignore-model-config --model model/model.npz --train-sets ../data/auth+csmono1617.shuffled.cs-en.en ../data/auth+csmono1617.shuffled.cs-en.cs --log model/train.log --valid-log model/valid.log --devices 0 1 2 3 4 5 --seed 123456789 -w 12400 --no-shuffle --dim-vocabs 32000 32000 --vocabs model/vocab.encs.spm model/vocab.encs.spm --type transformer --enc-depth 6 --dec-depth 6 --dim-emb 1024 --transformer-dim-ffn 4096 --transformer-heads 16 --max-length 150 --transformer-dropout 0.0 --transformer-dropout-attention 0.1 --transformer-dropout-ffn 0.1 --lr-warmup 8000 --mini-batch-fit --lr-decay-inv-sqrt 8000 --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 --label-smoothing 0.1 --learn-rate 0.0003 --transformer-decoder-autoreg average-attention --transformer-aan-nogate --transformer-dim-aan 1024 --transformer-aan-activation relu --valid-freq 1000 --save-freq 1000 --disp-freq 50 --valid-sets ../data/valid.en ../data/valid.cs --valid-metrics cross-entropy perplexity bleu valid-script --valid-script-path ./validate.sh --valid-translation-output valid/valid.cs.output --quiet-translation --beam-size 4 --normalize=1 --valid-mini-batch 64 --early-stopping 100 --after-epochs 0 --cost-type=ce-mean-words --overwrite --keep-best --tied-embeddings-all --lr-report --sync-sgd --exponential-smoothing
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarian-nmt%2Fmarian%2Fissues%2F248&data=02%7C01%7Cfseide%40microsoft.com%7C5ebafaf3ddd14431abff08d694c392b1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636859965038514312&sdata=FH1bxHXXmiXf2jUS1np%2Bq%2B33TR2xwfI3UxwQadJZAVg%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAP5hlqI9sl2nH_2KK5R7VvN4IiOlbsC2ks5vOTBVgaJpZM4a_fEK&data=02%7C01%7Cfseide%40microsoft.com%7C5ebafaf3ddd14431abff08d694c392b1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636859965038514312&sdata=r4iMLSJ8DP4O1N84nOTrkNot0VxJf837pBkWd%2F7fJZ8%3D&reserved=0.
I also do not recommend to use NCCL with a number of GPUs that's not a power of 2. While you may get a small performance improvement going from, say, 6 to 7, but you get a comparatively larger improvement when going from 7 to 8.
The total number of model parameters must be divisible by the number of GPUs. It’s a limitation of NCCL that we have not yet worked around.
Then, what is the easiest way to count the exact number of parameters in the model?
I want my models to be trainable on 1-10 GPUs. How difficult is to add some dummy parameters to the model so that the total number of parameters is divisible by all numbers 1-10 (= by 840)?
I also do not recommend to use NCCL with a number of GPUs that's not a power of 2. While you may get a small performance improvement going from, say, 6 to 7, but you get a comparatively larger improvement when going from 7 to 8.
I need to train lots of models and I can't always allocate power of 2 GPUs. Every small improvement helps.
Thanks for replies!
Not entirely easy, as we also pad parameter sizes to multiples of 256 bytes or so.
The easiest would be to just add dummy parameter values equal to the largest number of GPUs we support, and then simply chop off at the end to make it divisible by the number of GPUs.
We had an internal discussion about this before, and I even had some initial code written. The problem is that would require to rework memory management to support this, and rework what parts of the system talk to each other when. We eventually decided that it would take quite some effort, and be quite an intrusive change for a not so common use case. We are secretly hoping that Nvidia will just fix their API. I understand that this is not satisfactory for you.
Maybe we can file an Issue with Nvidia? They’d just need to add an additional “actualSize” parameter.
As for the runtime of non-powers-of-two, I don’t understand why powers of two make any difference for Nvidia’s ring-reduction algorithm.
From: Dominik Macháček notifications@github.com Sent: Monday, February 18, 2019 2:31 To: marian-nmt/marian marian@noreply.github.com Cc: Frank Seide fseide@microsoft.com; Comment comment@noreply.github.com Subject: Re: [marian-nmt/marian] all shards must have the same size -- problem with 6GPUs but not with 5 (#248)
The total number of model parameters must be divisible by the number of GPUs. It’s a limitation of NCCL that we have not yet worked around.
Then, what is the easiest way to count the exact number of parameters in the model?
I want my models to be trainable on 1-10 GPUs. How difficult is to add some dummy parameters to the model so that the total number of parameters is divisible by all numbers 1-10 (= by 840)?
I also do not recommend to use NCCL with a number of GPUs that's not a power of 2. While you may get a small performance improvement going from, say, 6 to 7, but you get a comparatively larger improvement when going from 7 to 8.
I need to train lots of models and I can't always allocate power of 2 GPUs. Every small improvement helps.
Thanks for replies!
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmarian-nmt%2Fmarian%2Fissues%2F248%23issuecomment-464677083&data=02%7C01%7Cfseide%40microsoft.com%7C26eac160298d4b80c56608d6958c330b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636860826718472304&sdata=VTZz0VVEt8mYStQP78zP1%2BaEX6TvDl0tK5%2F%2BWvR9Jck%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAP5hlt3aX9ZAEgstc5N2Zsy4Fz1NRgYKks5vOoDtgaJpZM4a_fEK&data=02%7C01%7Cfseide%40microsoft.com%7C26eac160298d4b80c56608d6958c330b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636860826718492320&sdata=jC%2F03DPNzFGWkOG%2FOb2dyvx6deftcHX5sFA8NPebCjM%3D&reserved=0.
Yes, based on the padding to 256 bytes the only guarantee you have right now is that any divisor of 256 will work, unfortunately these are only powers of 2. Even changing the embedding sizes to values dividable by say 6 does not help here. The 5 GPUs might have worked out because your number of total tensors was dividable by 5 by luck.
Maybe we can file an Issue with Nvidia? They’d just need to add an additional “actualSize” parameter.
Please, do.
The 5 GPUs might have worked out because your number of total tensors was dividable by 5 by luck.
I looked into my logfiles again and found I was wrong. In fact it never worked with 5 or 6 GPUs, it always crashed with "all shards must have the same size" or used 1 GPU.
@frankseide Given a more concrete suggestion, I'll send it off to my NVIDIA contact.
So, based on Kenneth's communication with NVidia, we have this hint:
slide 15 http://on-demand.gputechconf.com/gtc/2018/presentation/s8462-multi-gpu-training-with-nccl.pdf
Basically we do a second gather/scatter if the last shard is smaller, right?
I think we can do that second scatter/gather in parallel?
Yes, I understand that putting those two operations into one group will do that for us.
So e.g. to distribute 999 elements to 4 GPUs, one would first do a 4-way exchange of 249 elements, and then another of 1 element each but only for the first 3 GPUs?
This would work if the first 4-way exchange would work on elements 0..248, 250..498, 500..748, and 750..998, and the second on elements 249, 499, and 749.
But I believe that is not supported by ncclReduceScatter()
. I remember it expects the data to be consecutive in memory. I.e. the above would instead have to operate on elements 0..248, 249..497, 498..746, and 747..995, and the second round on 996, 997, and 998.
Wouldn't that require some nasty sharding? Like 3 shards of 249 elements, and one with 252 (which we can then sub-slice)?
Ugh, maybe we should just pad.
The Nvidia contact responded that we can roll our own ncclReduceScatter()
that supports this by a combination of ncclGroupStart()
, ncclReduce()
, and ncclGroupEnd()
. That would work indeed, although I'd rather have them fix their broken API...
Hello,
I trained my model on 5 GPUs and it worked without problem. Then I aborted it and wanted to continue with 6 GPUs, but there's this error. The same was when I was switching from 2 to 3.
Does this happend because the number of batches in training data must be divisible by number of GPUs?
Or can I avoid it by changing parameters?