Open RickMeow opened 1 year ago
Hi @RickMeow,
I am facing the same issue with much slow training on multiple nodes as compared to a single node. Were you lucky enough to figure the reason? Thanks
I meet a similar problem during fine-tune with deepspeed stage 3 offload. But multi-computer training works well when I fine-tune with deepspeed stage 2 (without offload). Were you lucky enough to figure the reason? Thanks a lot!!
I meet a similar problem during fine-tune with deepspeed stage 3 offload. But multi-computer training works well when I fine-tune with deepspeed stage 2 (without offload). Were you lucky enough to figure the reason? Thanks a lot!!
hi, have you analyzed the cause of this problem?
Describe the bug
I'm using 24 A100 (40G) video cards for llama-2-70B training, and previously had a lot of OOM issues with deepspeed ZeRO-3, so I'm currently using multi-card parameter parallel fine-tuning training method, which compensates for the lack of RAM for fine-tuning training with 40GB of graphic memory.
I found the problem during the fine-tuning of the Llama-2-70B: 2 nodes with a total of 16 cards (2*8*A100 40GB) is considerably slower than 1 node with 8 cards for training. This conclusion mainly comes from the observation that the total training time of the former is considerably higher than that of the latter.
To Reproduce
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --num_gpus 8 \ --num_nodes 2 \ --master_addr 192.168.0.32 \ --master_port 9901 \ --hostfile /mnt/download/configs/hostfile_1.txt \ src/train_bash.py \ --stage sft \ --model_name_or_path "/mnt/model/Llama-2-70b-hf/" \ --do_train \ --dataset zr_test_math \ --finetuning_type lora \ --output_dir /mnt/output/70B/ \ --overwrite_cache \ --overwrite_output_dir \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --logging_steps 1 \ --save_steps 1000 \ --learning_rate 5e-5 \ --num_train_epochs 2 \ --plot_loss \ --fp16 \ --lora_target q_proj,v_proj \ --template llama2 \ --deepspeed "/mnt/deepspeed/deepspeed.json"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --num_gpus=8 src/train_bash.py \ --stage sft \ ---model_name_or_path "/mnt/model/Llama-2-70b-hf/" \ --do_train \ ...... (followed by the same configuration as the multicomputer)
192.168.0.32 slots=8 192.168.0.23 slots=8
Running training Num examples = 56,318 Num Epochs = 2 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 128 Gradient Accumulation steps = 1 Total optimization steps = 880 Number of trainable parameters = 16,384,000
Running training Num examples = 56,318 Num Epochs = 2 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 64 Gradient Accumulation steps = 1 Total optimization steps = 1,760 Number of trainable parameters = 16,384,000