microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.98k stars 1.01k forks source link

step3 use same memory when I increase GPUs #817

Open Little-rookie-ee opened 9 months ago

Little-rookie-ee commented 9 months ago

when I use 4 A100 80G to run step3 with llama2-7b(actor_model) tiny-llama-1.1B(ref_model),it will used 53848MB memory in generation and in training used 79610MB memory . when I use 8 A100 80G to run, it will used 55834MB memory in generation and in training used 78216MB memory. Almost the same used memory, and when increasing to 16* A100 80G is the same result. Does use more GPUs is useless? ds config: torchrun --nnodes ${tmp_nodes} --nproc_per_node ${tmp_nproc_per_node} \ --master_addr ${tmp_master_addr} --node_rank ${tmp_node_rank} \ --master_port ${tmp_master_port} ${PROJECT_PATH}/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py \ --data_path ${PROJECT_PATH}/applications/DeepSpeed-Chat/data/Dahoas/rm-static \ --data_split 2,4,4 \ --actor_model_name_or_path $ACTOR_MODEL_PATH \ --critic_model_name_or_path $CRITIC_MODEL_PATH \ --num_padding_at_beginning 1 \ --per_device_generation_batch_size 1 \ --per_device_training_batch_size 1 \ --generation_batches 1 \ --ppo_epochs 1 \ --max_answer_seq_len 2000 \ --max_prompt_seq_len 16000 \ --actor_learning_rate ${Actor_Lr} \ --critic_learning_rate ${Critic_Lr} \ --actor_weight_decay 0.1 \ --critic_weight_decay 0.1 \ --num_train_epochs 2 \ --lr_scheduler_type cosine \ --gradient_accumulation_steps 1 \ --actor_gradient_checkpointing \ --critic_gradient_checkpointing \ --disable_actor_dropout \ --num_warmup_steps 10 \ --deepspeed --seed 1234 \ --dtype bf16 \ --offload \ --offload_reference_model \ --actor_zero_stage $ACTOR_ZERO_STAGE \ --critic_zero_stage $CRITIC_ZERO_STAGE \ --enable_hybrid_engine \ --output_dir $OUTPUT \ --kl_ctl 0.1 | tee $tmp_log_file 2>&1

EeyoreLee commented 8 months ago

@Little-rookie-ee This is in line with the design. Since you may use DDP not MDP, zero2 will only split optim and gradient, and the batch size is setup through per_device_xxx, so it will auto-increase as the number of gpu increases. the memory seems same when you increase the number of gpu, but it'll save the whole training time.