marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
255 stars 125 forks source link

Multi-node training seems to be even slower than single node #410

Open memnoh opened 5 years ago

memnoh commented 5 years ago

Hi all, I've got a 2 node setup with 1 1080 Titan each. I've successfully managed to configure MPI with NCCL and run marian on both nodes (with sync-sgd). For the same small corpus single node converged in about 12 hours but the multi node variant converged in almost 30 hours.

I'm stumped as to why this could be. Even if while running multi node with mini batch size 64 means that my "global" batch is 128 doesn't exactly explain this (btw is this true? Is my "global" batch actually 128 in this scenario?)

While looking at the logs I also see that 1 epoch on single node variant takes about 6 minutes while on each node in the multi node variant presumably the same exact epoch takes about 20 minutes.

Could you please point me at the right direction here? At this point i'm completely lost and would really appreciate the help.

Single node parameters: dropout-trg 0.1 disp-freq 10000 seed 0 valid-metrics cross-entropy translation dropout-rnn 0.2 overwrite normalize exponential-smoothing valid-freq 10000 beam-size 12 type amun early-stopping 5 save-freq 10000 devices 0 quiet-translation dropout-src 0.1 keep-best

Multi-node parameters: dropout-trg 0.1 disp-freq 10000 seed 0 valid-metrics cross-entropy translation multi-node-overlap 0 dropout-rnn 0.2 overwrite normalize exponential-smoothing valid-freq 10000 beam-size 12 num-devices 1 keep-best type amun sync-sgd early-stopping 5 save-freq 10000 devices 0 0 quiet-translation dropout-src 0.1

The corpus size is 158k sentences.

frankseide commented 5 years ago

Sadly, multi-node training for large models typically only provides efficiency benefits if the nodes are connected via Infiniband with RDMA enabled. Without RDMA, we also cannot get any gain. The reason is that the time to exchange the data (roughly network bandwidth x 2 x the size of the model, independent of number of nodes) should be small compared to the time it takes to compute a batch. I once saw a 7-times increase of data-exchange time when disabling RDMA, which was too high to see any benefit.

To see what your connection is, could you set this environment variable which enables NCCL diagnostics messages, and post those here?

NCCL_DEBUG=INFO

To get best parallelization gains, you would want to use the largest minibatch size that still converges properly. You can tell Marian to dynamically adjust the MB size as to fully use your GPU RAM. These settings enable this for a 24 GB CUDA card:

mini-batch-fit: true
mini-batch-fit-step: 5
workspace: 17000
maxi-batch: 1000
mini-batch: 1000

The workspace is in MB, use the largest that fits. It is roughly your GPU RAM size minus the space required for parameters, parameter gradients, the Adam state, and a buffer for exponential smoothing. So take your RAM size and subtract maybe ~5 x model size as a starting point.

Depending on your task, this may give you much larger batch sizes which will affect convergence, you may need to retune hyper-parameters.

mini batch size 64 means that my "global" batch is 128 doesn't exactly explain this (btw is this true? Is my "global" batch actually 128

Should be. Does your log show the number of samples? Then you should be able to verify from that.