microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.65k stars 4.05k forks source link

allgather_bucket_size can siginificantly influence the communication cost #724

Open benywon opened 3 years ago

benywon commented 3 years ago

I have used zero-stage2 to train a 2.5 billion parameters BERT model on 4 8x V100s. And the nodes are interconnected with RDMA Infiniband. I found that the allgather_bucket_size parameter in zero optimizer could significantly influence the training speed, i.e., the optimizer_allgather time could vary from 2.5s to 0.4s. Are there some rules of thumb to tune this parameter? Thanks!

tjruwase commented 3 years ago

@benywon, thanks for your question. Your observation of the performance implication of allgather_bucket_size is on track. Generally, the bucket size determines the number of parameters that can be all gathered at once. So larger buckets will require fewer rounds of communication than smaller buckets to allgather the entire model. This of course depends on your model size and communication hardware.

Hope that helps. If you want to analyze further, can you share the bucket sizes => allgather times values that you have observed.

benywon commented 3 years ago

@benywon, thanks for your question. Your observation of the performance implication of allgather_bucket_size is on track. Generally, the bucket size determines the number of parameters that can be all gathered at once. So larger buckets will require fewer rounds of communication than smaller buckets to allgather the entire model. This of course depends on your model size and communication hardware.

Hope that helps. If you want to analyze further, can you share the bucket sizes => allgather times values that you have observed.

Thanks for your reply. In some applications, increasing the bucket size can indeed reduce the allgather times, but in some applications when my model is not very large, i.e. nearly 1 billion parameters, increasing the "allgather_bucket_size" could conversely increase the optimizer_allgather time. And recently I found this value is also correlated with the number of nodes I use. When I train a large BERT model with 2.5 billion parameters on four nodes, I set the "allgather_bucket_size" to 1e9 and the optimizer_allgather takes about 440 ms per step. However, when I add one more node to the cluster (i.e., to 5 nodes). The optimizer_allgather takes about 920 ms per step. I enumerate some values and change the "allgather_bucket_size" to 7.2e8, which gives me the best 615 ms per step of optimizer_allgather. I suppose the increase of the communication time is more or less attribute to the use of NCCL double binary tree algorithm (https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/), which prefers the powers of 2. But the value of the "allgather_bucket_size" is still hard to determine.

tjruwase commented 3 years ago

@benywon, thanks for sharing your experience and insights. Your observations are correct and valuable, including the impact of NCCL algorithms. I think you have already covered most of the analysis that you need for your application. As you have observed, tuning the bucket sizes is both crucial to performance but also challenging to get right. We have been thinking of ways to help users tune these and other parameters for their specific scenarios, but it remains an open problem.

Please let me know of what you want to do next for now, and hope I can help.