[BUG] Multi-computer training is slower than single-computer training

RickMeow commented 1 year ago

Describe the bug

I'm using 24 A100 (40G) video cards for llama-2-70B training, and previously had a lot of OOM issues with deepspeed ZeRO-3, so I'm currently using multi-card parameter parallel fine-tuning training method, which compensates for the lack of RAM for fine-tuning training with 40GB of graphic memory.

I found the problem during the fine-tuning of the Llama-2-70B: 2 nodes with a total of 16 cards (2*8*A100 40GB) is considerably slower than 1 node with 8 cards for training. This conclusion mainly comes from the observation that the total training time of the former is considerably higher than that of the latter.

To Reproduce

I have configured the following project:

git clone https://github.com/hiyouga/LLaMA-Efficient-Tuning.git
conda create -n llama_etuning python=3.10
conda activate llama_etuning
cd LLaMA-Efficient-Tuning
pip install -r requirements.txt

Use the deepspeed configuration file (deepspeed.json) provided below.

deepspeed.json：


{
"bfloat16": {
"enabled": false
},
"fp16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
    "lr": "auto",
    "betas": "auto",
    "eps": "auto",
    "weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
    "warmup_min_lr": "auto",
    "warmup_max_lr": "auto",
    "warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
    "device": "cpu",
    "pin_memory": true
},
"offload_param": {
    "device": "cpu",
    "pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 1e5,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}


3. Run command for multiple machines(2\*8=16 A100):

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --num_gpus 8 \ --num_nodes 2 \ --master_addr 192.168.0.32 \ --master_port 9901 \ --hostfile /mnt/download/configs/hostfile_1.txt \ src/train_bash.py \ --stage sft \ --model_name_or_path "/mnt/model/Llama-2-70b-hf/" \ --do_train \ --dataset zr_test_math \ --finetuning_type lora \ --output_dir /mnt/output/70B/ \ --overwrite_cache \ --overwrite_output_dir \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --logging_steps 1 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 2 \ --plot_loss \ --fp16 \ --lora_target q_proj,v_proj \ --template llama2 \ --deepspeed "/mnt/deepspeed/deepspeed.json"

4. Single machine run command(8 A100):

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --num_gpus=8 src/train_bash.py \ --stage sft \ ---model_name_or_path "/mnt/model/Llama-2-70b-hf/" \ --do_train \ ...... (followed by the same configuration as the multicomputer)


5. Hostfile:
- hostfile_1.txt

192.168.0.32 slots=8 192.168.0.23 slots=8


**Expected behavior**
Multi-computer training is faster than single-computer training, and in ideal conditions, 16 graphics cards train twice as fast as 8 cards**(Is it possible to solve this kind of problem by changing the configuration and parameters?)**

**Screenshots**

- 8-card training: using the same dataset, waiting for a period of time to stabilize, estimated time close to 52 hours and 10 minutes
<img width="510" alt="8_A100" src="https://github.com/microsoft/DeepSpeed/assets/122031049/caa63da7-f723-4116-a220-272b5bf72927">

- 16-card training: using the same dataset and waiting for a period of time to stabilize, the estimated time is close to 79 hours and 31 minutes
<img width="1206" alt="16_A100" src="https://github.com/microsoft/DeepSpeed/assets/122031049/bc69f513-8164-4906-82d4-f00d141cd22e">

**System info (please complete the following information):**
 - Server Configuration: 8 cards per node A100 PCIE (not NVLink) 40GB of video memory, 600GB of CPU memory, inter-node communication bandwidth is 25G over 10 Gigabit, but not IB or RDMA
- OS: Ubuntu 20.04.6 LTS
- Python = 3.10
- 16 A100 training parameters:

Running training Num examples = 56,318 Num Epochs = 2 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 128 Gradient Accumulation steps = 1 Total optimization steps = 880 Number of trainable parameters = 16,384,000


- 8 A100 training parameters:

Running training Num examples = 56,318 Num Epochs = 2 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 64 Gradient Accumulation steps = 1 Total optimization steps = 1,760 Number of trainable parameters = 16,384,000



**Looking forward to your response!**

mmaaz60 commented 1 year ago

Hi @RickMeow,

I am facing the same issue with much slow training on multiple nodes as compared to a single node. Were you lucky enough to figure the reason? Thanks

X-Buring commented 1 year ago

I meet a similar problem during fine-tune with deepspeed stage 3 offload. But multi-computer training works well when I fine-tune with deepspeed stage 2 (without offload). Were you lucky enough to figure the reason? Thanks a lot!!

SiriusWy commented 2 weeks ago

I meet a similar problem during fine-tune with deepspeed stage 3 offload. But multi-computer training works well when I fine-tune with deepspeed stage 2 (without offload). Were you lucky enough to figure the reason? Thanks a lot!!

hi, have you analyzed the cause of this problem?

microsoft / DeepSpeed

[BUG] Multi-computer training is slower than single-computer training #4281