microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6.13k stars 1.05k forks source link

step 3 : OOM #378

Closed MAJIN123 closed 1 year ago

MAJIN123 commented 1 year ago

Steps 1 and 2 are running normally. When running step 3, I encountered an OOM (out of memory) issue again. Even when the batch size was set to 1, it still didn't work. Does anyone know what the situation is?

在跑step3的时候又遇到显存不够的问题,batch都设置成了1也不行,有人知道什么情况吗?

4 * v100-40G

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=5e-4 Critic_Lr=5e-6

deepspeed --master_port 12346 main.py \ --data_path Hello-SimpleAI/HC3-Chinese \ --data_split 2,4,4 \ --actor_model_name_or_path $ACTOR_MODEL_PATH \ --critic_model_name_or_path $CRITIC_MODEL_PATH \ --num_padding_at_beginning 1 \ --per_device_train_batch_size 1 \ --per_device_mini_train_batch_size 1 \ --generation_batch_numbers 1 \ --ppo_epochs 1 \ --max_answer_seq_len 128 \ --max_prompt_seq_len 128 \ --actor_learning_rate ${Actor_Lr} \ --critic_learning_rate ${Critic_Lr} \ --actor_weight_decay 0.1 \ --critic_weight_decay 0.1 \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --gradient_accumulation_steps 1 \ --num_warmup_steps 100 \ --deepspeed --seed 1234 \ --enable_hybrid_engine \ --inference_tp_size 2 \ --actor_zero_stage $ACTOR_ZERO_STAGE \ --critic_zero_stage $CRITIC_ZERO_STAGE \ --actor_gradient_checkpointing \ --critic_gradient_checkpointing \ --actor_lora_dim 128 \ --actor_lora_module_name decoder.layers. \ --output_dir $OUTPUT \ &> $OUTPUT/training.log

image
EikeKohl commented 1 year ago

Hey @MAJIN123, what are the actor / critic model architectures?

MAJIN123 commented 1 year ago

Hi @EikeKohl ,actor model:LLaMA 7B,critic model:facebook/opt-350m.

AltenLi commented 1 year ago

试试zero2、3

MAJIN123 commented 1 year ago

@AltenLi 还是不行,老哥,还是显存不够,好奇怪的。

yaozhewei commented 1 year ago

Hi, you can try to offload the reference model. Please take a look at the

MAJIN123 commented 1 year ago

tks bro @yaozhewei 😯

iamsile commented 1 year ago

@MAJIN123 您好,我在用v100跑第三步的时候也遇到了oom的情况,请问您最后是怎么解决的哈,我这边也是把能调的都调到最小了