microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.68k stars 4.15k forks source link

[BUG] Multi-computer training is slower than single-computer training #4281

Open RickMeow opened 1 year ago

RickMeow commented 1 year ago

Describe the bug

I'm using 24 A100 (40G) video cards for llama-2-70B training, and previously had a lot of OOM issues with deepspeed ZeRO-3, so I'm currently using multi-card parameter parallel fine-tuning training method, which compensates for the lack of RAM for fine-tuning training with 40GB of graphic memory.

I found the problem during the fine-tuning of the Llama-2-70B: 2 nodes with a total of 16 cards (2*8*A100 40GB) is considerably slower than 1 node with 8 cards for training. This conclusion mainly comes from the observation that the total training time of the former is considerably higher than that of the latter.

To Reproduce

  1. I have configured the following project:
    git clone https://github.com/hiyouga/LLaMA-Efficient-Tuning.git
    conda create -n llama_etuning python=3.10
    conda activate llama_etuning
    cd LLaMA-Efficient-Tuning
    pip install -r requirements.txt
  2. Use the deepspeed configuration file (deepspeed.json) provided below.
    • deepspeed.json:
      
      {
      "bfloat16": {
      "enabled": false
      },
      "fp16": {
      "enabled": "auto"
      },
      "optimizer": {
      "type": "AdamW",
      "params": {
          "lr": "auto",
          "betas": "auto",
          "eps": "auto",
          "weight_decay": "auto"
      }
      },
      "scheduler": {
      "type": "WarmupLR",
      "params": {
          "warmup_min_lr": "auto",
          "warmup_max_lr": "auto",
          "warmup_num_steps": "auto"
      }
      },
      "zero_optimization": {
      "stage": 3,
      "offload_optimizer": {
          "device": "cpu",
          "pin_memory": true
      },
      "offload_param": {
          "device": "cpu",
          "pin_memory": true
      },
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_fp16_weights_on_model_save": true
      },
      "gradient_accumulation_steps": "auto",
      "gradient_clipping": "auto",
      "steps_per_print": 1e5,
      "train_batch_size": "auto",
      "train_micro_batch_size_per_gpu": "auto",
      "wall_clock_breakdown": false
      }

3. Run command for multiple machines(2\*8=16 A100):

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --num_gpus 8 \ --num_nodes 2 \ --master_addr 192.168.0.32 \ --master_port 9901 \ --hostfile /mnt/download/configs/hostfile_1.txt \ src/train_bash.py \ --stage sft \ --model_name_or_path "/mnt/model/Llama-2-70b-hf/" \ --do_train \ --dataset zr_test_math \ --finetuning_type lora \ --output_dir /mnt/output/70B/ \ --overwrite_cache \ --overwrite_output_dir \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --logging_steps 1 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 2 \ --plot_loss \ --fp16 \ --lora_target q_proj,v_proj \ --template llama2 \ --deepspeed "/mnt/deepspeed/deepspeed.json"

4. Single machine run command(8 A100):

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --num_gpus=8 src/train_bash.py \ --stage sft \ ---model_name_or_path "/mnt/model/Llama-2-70b-hf/" \ --do_train \ ...... (followed by the same configuration as the multicomputer)


5. Hostfile:
- hostfile_1.txt

192.168.0.32 slots=8 192.168.0.23 slots=8


**Expected behavior**
Multi-computer training is faster than single-computer training, and in ideal conditions, 16 graphics cards train twice as fast as 8 cards**(Is it possible to solve this kind of problem by changing the configuration and parameters?)**

**Screenshots**

- 8-card training: using the same dataset, waiting for a period of time to stabilize, estimated time close to 52 hours and 10 minutes
<img width="510" alt="8_A100" src="https://github.com/microsoft/DeepSpeed/assets/122031049/caa63da7-f723-4116-a220-272b5bf72927">

- 16-card training: using the same dataset and waiting for a period of time to stabilize, the estimated time is close to 79 hours and 31 minutes
<img width="1206" alt="16_A100" src="https://github.com/microsoft/DeepSpeed/assets/122031049/bc69f513-8164-4906-82d4-f00d141cd22e">

**System info (please complete the following information):**
 - Server Configuration: 8 cards per node A100 PCIE (not NVLink) 40GB of video memory, 600GB of CPU memory, inter-node communication bandwidth is 25G over 10 Gigabit, but not IB or RDMA
- OS: Ubuntu 20.04.6 LTS
- Python = 3.10
- 16 A100 training parameters:

Running training Num examples = 56,318 Num Epochs = 2 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 128 Gradient Accumulation steps = 1 Total optimization steps = 880 Number of trainable parameters = 16,384,000


- 8 A100 training parameters:

Running training Num examples = 56,318 Num Epochs = 2 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 64 Gradient Accumulation steps = 1 Total optimization steps = 1,760 Number of trainable parameters = 16,384,000



**Looking forward to your response!**
mmaaz60 commented 1 year ago

Hi @RickMeow,

I am facing the same issue with much slow training on multiple nodes as compared to a single node. Were you lucky enough to figure the reason? Thanks

X-Buring commented 1 year ago

I meet a similar problem during fine-tune with deepspeed stage 3 offload. But multi-computer training works well when I fine-tune with deepspeed stage 2 (without offload). Were you lucky enough to figure the reason? Thanks a lot!!

SiriusWy commented 2 weeks ago

I meet a similar problem during fine-tune with deepspeed stage 3 offload. But multi-computer training works well when I fine-tune with deepspeed stage 2 (without offload). Were you lucky enough to figure the reason? Thanks a lot!!

hi, have you analyzed the cause of this problem?