l294265421 / alpaca-rlhf

Finetuning LLaMA with RLHF (Reinforcement Learning with Human Feedback) based on DeepSpeed Chat
https://88aeeb3aef5040507e.gradio.live/
MIT License
103 stars 13 forks source link

reward model在v100上训练时会卡住不动 #4

Closed iamsile closed 1 year ago

iamsile commented 1 year ago

step: 82 loss:0.83251953125, correct_predictions: 0.0, reward: -0.50390625 r_reward: -0.487060546875 step: 83 loss:0.76611328125, correct_predictions: 0.0, reward: -0.492919921875 r_reward: -0.492431640625 step: 84 loss:0.7578125, correct_predictions: 0.0, reward: -0.5439453125 r_reward: -0.5361328125 step: 85 loss:0.83251953125, correct_predictions: 1.0, reward: -0.464111328125 r_reward: -0.467529296875 step: 86 loss:1.537109375, correct_predictions: 1.0, reward: -0.509765625 r_reward: -0.51708984375 step: 87 loss:0.6142578125, correct_predictions: 0.0, reward: -0.5087890625 r_reward: -0.48291015625 step: 88 loss:0.5380859375, correct_predictions: 0.0, reward: -0.451171875 r_reward: -0.44921875 [2023-05-17 14:28:38,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=7, lr=[0.0004994156634161006], mom=[(0.9, 0.95)] [2023-05-17 14:28:38,359] [INFO] [timer.py:199:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=14.808713117576154, CurrSamplesPerSec=14.958361511223531, MemAllocated=12.34GB, MaxMemAllocated=22.87GB step: 89 loss:0.67333984375, correct_predictions: 1.0, reward: -0.435302734375 r_reward: -0.43994140625 step: 90 loss:0.35107421875, correct_predictions: 1.0, reward: -0.421875 r_reward: -0.457275390625 step: 91 loss:0.7763671875, correct_predictions: 1.0, reward: -0.439453125 r_reward: -0.442138671875 step: 92 loss:0.69091796875, correct_predictions: 1.0, reward: -0.440185546875 r_reward: -0.46826171875 step: 93 loss:0.355712890625, correct_predictions: 1.0, reward: -0.432373046875 r_reward: -0.455078125 step: 94 loss:0.607421875, correct_predictions: 1.0, reward: -0.425537109375 r_reward: -0.427734375 step: 95 loss:0.87060546875, correct_predictions: 0.0, reward: -0.4775390625 r_reward: -0.468017578125 step: 96 loss:0.7841796875, correct_predictions: 1.0, reward: -0.39013671875 r_reward: -0.404541015625 step: 97 loss:1.23828125, correct_predictions: 0.0, reward: -0.40869140625 r_reward: -0.36572265625 step: 98 loss:0.87890625, correct_predictions: 0.0, reward: -0.445556640625 r_reward: -0.42333984375 [2023-05-17 14:28:43,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=7, lr=[0.0004992664502959351], mom=[(0.9, 0.95)] [2023-05-17 14:28:43,805] [INFO] [timer.py:199:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=14.80846616069343, CurrSamplesPerSec=14.749032318751523, MemAllocated=12.34GB, MaxMemAllocated=22.87GB step: 99 loss:0.7666015625, correct_predictions: 0.0, reward: -0.384033203125 r_reward: -0.382080078125


您好,打扰您一下,我在v100上训练reward model训练时,每次卡在第99步就停止不动了,程序不报错,也没有结束。 用nvidia-smi查看后,显卡也是占满的情况,麻烦您帮忙看一下。 ps:程序运行执行速度很快,但是到step 99后就不再动了

这个是我的执行命令: nohup deepspeed --num_gpus 8 /home/rlhf/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/main.py --data_output_path /home/rlhf/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/data_output --model_name_or_path decapoda-research/llama-7b-hf --num_padding_at_beginning 0 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 5e-4 --num_train_epochs 1 --gradient_accumulation_steps 1 --num_warmup_steps 0 --zero_stage 2 --deepspeed --output_dir /home/rlhf/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/data_output --lora_dim 2 --lora_module_name q_proj,k_proj --only_optimize_lora > nohup1.txt &.

l294265421 commented 1 year ago

step: 82 loss:0.83251953125, correct_predictions: 0.0, reward: -0.50390625 r_reward: -0.487060546875 step: 83 loss:0.76611328125, correct_predictions: 0.0, reward: -0.492919921875 r_reward: -0.492431640625 step: 84 loss:0.7578125, correct_predictions: 0.0, reward: -0.5439453125 r_reward: -0.5361328125 step: 85 loss:0.83251953125, correct_predictions: 1.0, reward: -0.464111328125 r_reward: -0.467529296875 step: 86 loss:1.537109375, correct_predictions: 1.0, reward: -0.509765625 r_reward: -0.51708984375 step: 87 loss:0.6142578125, correct_predictions: 0.0, reward: -0.5087890625 r_reward: -0.48291015625 step: 88 loss:0.5380859375, correct_predictions: 0.0, reward: -0.451171875 r_reward: -0.44921875 [2023-05-17 14:28:38,358] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=7, lr=[0.0004994156634161006], mom=[(0.9, 0.95)] [2023-05-17 14:28:38,359] [INFO] [timer.py:199:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=14.808713117576154, CurrSamplesPerSec=14.958361511223531, MemAllocated=12.34GB, MaxMemAllocated=22.87GB step: 89 loss:0.67333984375, correct_predictions: 1.0, reward: -0.435302734375 r_reward: -0.43994140625 step: 90 loss:0.35107421875, correct_predictions: 1.0, reward: -0.421875 r_reward: -0.457275390625 step: 91 loss:0.7763671875, correct_predictions: 1.0, reward: -0.439453125 r_reward: -0.442138671875 step: 92 loss:0.69091796875, correct_predictions: 1.0, reward: -0.440185546875 r_reward: -0.46826171875 step: 93 loss:0.355712890625, correct_predictions: 1.0, reward: -0.432373046875 r_reward: -0.455078125 step: 94 loss:0.607421875, correct_predictions: 1.0, reward: -0.425537109375 r_reward: -0.427734375 step: 95 loss:0.87060546875, correct_predictions: 0.0, reward: -0.4775390625 r_reward: -0.468017578125 step: 96 loss:0.7841796875, correct_predictions: 1.0, reward: -0.39013671875 r_reward: -0.404541015625 step: 97 loss:1.23828125, correct_predictions: 0.0, reward: -0.40869140625 r_reward: -0.36572265625 step: 98 loss:0.87890625, correct_predictions: 0.0, reward: -0.445556640625 r_reward: -0.42333984375 [2023-05-17 14:28:43,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=7, lr=[0.0004992664502959351], mom=[(0.9, 0.95)] [2023-05-17 14:28:43,805] [INFO] [timer.py:199:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=14.80846616069343, CurrSamplesPerSec=14.749032318751523, MemAllocated=12.34GB, MaxMemAllocated=22.87GB step: 99 loss:0.7666015625, correct_predictions: 0.0, reward: -0.384033203125 r_reward: -0.382080078125

您好,打扰您一下,我在v100上训练reward model训练时,每次卡在第99步就停止不动了,程序不报错,也没有结束。 用nvidia-smi查看后,显卡也是占满的情况,麻烦您帮忙看一下。 ps:程序运行执行速度很快,但是到step 99后就不再动了

这个是我的执行命令: nohup deepspeed --num_gpus 8 /home/rlhf/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/main.py --data_output_path /home/rlhf/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/data_output --model_name_or_path decapoda-research/llama-7b-hf --num_padding_at_beginning 0 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 5e-4 --num_train_epochs 1 --gradient_accumulation_steps 1 --num_warmup_steps 0 --zero_stage 2 --deepspeed --output_dir /home/rlhf/alpaca_rlhf/deepspeed_chat/training/step2_reward_model_finetuning/data_output --lora_dim 2 --lora_module_name q_proj,k_proj --only_optimize_lora > nohup1.txt &.

你把下面这个代码注释掉: 图片

这个代码是每100步就在验证集上评估一下

iamsile commented 1 year ago

您好,按您说的注释掉后,就没有这个问题了。不过想和你交流一下,这个评估为什么这么耗时哈,我试过让它运行3个小时,仍然没法评估完一轮