Closed MAJIN123 closed 1 year ago
Hey @MAJIN123, what are the actor / critic model architectures?
Hi @EikeKohl ,actor model:LLaMA 7B,critic model:facebook/opt-350m.
试试zero2、3
@AltenLi 还是不行,老哥,还是显存不够,好奇怪的。
Hi, you can try to offload the reference model. Please take a look at the
tks bro @yaozhewei 😯
@MAJIN123 您好,我在用v100跑第三步的时候也遇到了oom的情况,请问您最后是怎么解决的哈,我这边也是把能调的都调到最小了
Steps 1 and 2 are running normally. When running step 3, I encountered an OOM (out of memory) issue again. Even when the batch size was set to 1, it still didn't work. Does anyone know what the situation is?
在跑step3的时候又遇到显存不够的问题,batch都设置成了1也不行,有人知道什么情况吗?
4 * v100-40G
Num_Padding_at_Beginning=1 # this is model related
Actor_Lr=5e-4 Critic_Lr=5e-6
deepspeed --master_port 12346 main.py \ --data_path Hello-SimpleAI/HC3-Chinese \ --data_split 2,4,4 \ --actor_model_name_or_path $ACTOR_MODEL_PATH \ --critic_model_name_or_path $CRITIC_MODEL_PATH \ --num_padding_at_beginning 1 \ --per_device_train_batch_size 1 \ --per_device_mini_train_batch_size 1 \ --generation_batch_numbers 1 \ --ppo_epochs 1 \ --max_answer_seq_len 128 \ --max_prompt_seq_len 128 \ --actor_learning_rate ${Actor_Lr} \ --critic_learning_rate ${Critic_Lr} \ --actor_weight_decay 0.1 \ --critic_weight_decay 0.1 \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --gradient_accumulation_steps 1 \ --num_warmup_steps 100 \ --deepspeed --seed 1234 \ --enable_hybrid_engine \ --inference_tp_size 2 \ --actor_zero_stage $ACTOR_ZERO_STAGE \ --critic_zero_stage $CRITIC_ZERO_STAGE \ --actor_gradient_checkpointing \ --critic_gradient_checkpointing \ --actor_lora_dim 128 \ --actor_lora_module_name decoder.layers. \ --output_dir $OUTPUT \ &> $OUTPUT/training.log