用chatGLM-6B训RW的时候loss不收敛

GUORUIWANG commented 1 year ago

用这个数据简单跑了一下，loss不收敛，请问有遇到这个问题吗，rw模型应该选什么呢

TRAIN_FILENAME="dev_data_external_v1.jsonl"

deepspeed --num_gpus 2 train_reward.py \ --data_dir ' ' \ --output_dir 'RLHF/output' \ --model_name_or_path 'chatglm-6b' \ --tokenizer_path 'chatglm-6b' \ --max_length 512 \ --logging_steps 1 \ --save_steps 100 \ --learning_rate 1e-6 \ --do_train \ --train_filename $TRAIN_FILENAME \ --train_batch_size 8 \ --gradient_accumulation_steps 4 \ --num_epochs 5 \ --gradient_checkpointing \ --deepspeed_config "stage-3.json"

sunzeyeah commented 1 year ago

你好，loss的变化情况是什么样的？

目前Reward模型的训练是在经过SFT后的模型上进行的，这种情况下收敛是比较快的。但没有试验过直接在原始预训练模型上跑Reward。

使用ChatGLM-6B训练SFT和Reward的时候，观察到它的loss下降比较慢，将学习率提高至1e-5，能加快loss下降

GUORUIWANG commented 1 year ago

你好，谢谢回复，我这个loss是个破浪线，我理解RW模型不是单独训练的吗，要在SFT后的模型上进行吗。 deepspeedchat开源code上面用的是op 350m 一个比较小的模型进行训练的，sft是用的op 1.3b模型。感谢指导！

GUORUIWANG commented 1 year ago

用这个数据简单跑了一下，loss不收敛，请问有遇到这个问题吗，rw模型应该选什么呢

TRAIN_FILENAME="dev_data_external_v1.jsonl"

deepspeed --num_gpus 2 train_reward.py --data_dir ' ' --output_dir 'RLHF/output' --model_name_or_path 'chatglm-6b' --tokenizer_path 'chatglm-6b' --max_length 512 --logging_steps 1 --save_steps 100 --learning_rate 1e-6 --do_train --train_filename $TRAIN_FILENAME --train_batch_size 8 --gradient_accumulation_steps 4 --num_epochs 5 --gradient_checkpointing --deepspeed_config "stage-3.json"

你好，还有个疑问，这里的参数设置其batch_size = train_batch_size gradient_accumulation_steps = 32 但他们源码不太一样，多卡的训练的时候是 batch_size = per_device_train_batch_size gradient_accumulation_stepsnum_gpus = 32 2=64 ，这里我们也是数据并行吗，谢谢指导

sunzeyeah commented 1 year ago

你好，谢谢回复，我这个loss是个破浪线，我理解RW模型不是单独训练的吗，要在SFT后的模型上进行吗。 deepspeedchat开源code上面用的是op 350m 一个比较小的模型进行训练的，sft是用的op 1.3b模型。感谢指导！

从loss变化看中间有一个spike，可能是模型比较大、使用fp16收敛不稳定或者学习率过大，不过最终看是收敛了吧。

InstructGPT论文中的reward是在SFT基础上训练的，我理解比直接从预训练模型来跑，应该是能加速收敛

sunzeyeah commented 1 year ago

用这个数据简单跑了一下，loss不收敛，请问有遇到这个问题吗，rw模型应该选什么呢 TRAIN_FILENAME="dev_data_external_v1.jsonl" deepspeed --num_gpus 2 train_reward.py --data_dir ' ' --output_dir 'RLHF/output' --model_name_or_path 'chatglm-6b' --tokenizer_path 'chatglm-6b' --max_length 512 --logging_steps 1 --save_steps 100 --learning_rate 1e-6 --do_train --train_filename $TRAIN_FILENAME --train_batch_size 8 --gradient_accumulation_steps 4 --num_epochs 5 --gradient_checkpointing --deepspeed_config "stage-3.json"

你好，还有个疑问，这里的参数设置其batch_size = train_batch_size gradient_accumulation_steps = 32 但他们源码不太一样，多卡的训练的时候是 batch_size = per_device_train_batch_size gradient_accumulation_stepsnum_gpus = 32 2=64 ，这里我们也是数据并行吗，谢谢指导

train_reward.py中的参数释义：

train_batch_size：对应deepspeed配置中的train_micro_batch_size_per_gpu
gradient_accumulation_steps：对应deepspeed配置中的gradient_accumulation_steps

deepspeed配置中的train_batch_size定义为train_micro_batch_size_per_gpu gradient_accumulation_steps num_gpus。

因为使用了transformers的Trainer和TrainingArguments类。如果deepspeed配置中设置"train_batch_size": "auto"，即可根据上述公式自动计算。具体可以参考DeepSpeed Configuration

GUORUIWANG commented 1 year ago

用这个数据简单跑了一下，loss不收敛，请问有遇到这个问题吗，rw模型应该选什么呢 TRAIN_FILENAME="dev_data_external_v1.jsonl" deepspeed --num_gpus 2 train_reward.py --data_dir ' ' --output_dir 'RLHF/output' --model_name_or_path 'chatglm-6b' --tokenizer_path 'chatglm-6b' --max_length 512 --logging_steps 1 --save_steps 100 --learning_rate 1e-6 --do_train --train_filename $TRAIN_FILENAME --train_batch_size 8 --gradient_accumulation_steps 4 --num_epochs 5 --gradient_checkpointing --deepspeed_config "stage-3.json"

你好，还有个疑问，这里的参数设置其batch_size = train_batch_size gradient_accumulation_steps = 32 但他们源码不太一样，多卡的训练的时候是 batch_size = per_device_train_batch_size gradient_accumulation_stepsnum_gpus = 32 2=64 ，这里我们也是数据并行吗，谢谢指导

train_reward.py中的参数释义：

train_batch_size：对应deepspeed配置中的train_micro_batch_size_per_gpu

gradient_accumulation_steps：对应deepspeed配置中的gradient_accumulation_steps

deepspeed配置中的train_batch_size定义为train_micro_batch_size_per_gpu gradient_accumulation_steps num_gpus。

因为使用了transformers的Trainer和TrainingArguments类。如果deepspeed配置中设置"train_batch_size": "auto"，即可根据上述公式自动计算。具体可以参考DeepSpeed Configuration

懂了，谢谢讲解

xikaluo commented 1 year ago

你好，谢谢回复，我这个loss是个破浪线，我理解RW模型不是单独训练的吗，要在SFT后的模型上进行吗。 deepspeedchat开源code上面用的是op 350m 一个比较小的模型进行训练的，sft是用的op 1.3b模型。感谢指导！

从loss变化看中间有一个spike，可能是模型比较大、使用fp16收敛不稳定或者学习率过大，不过最终看是收敛了吧。

InstructGPT论文中的reward是在SFT基础上训练的，我理解比直接从预训练模型来跑，应该是能加速收敛

您好，想请教一下在正常的训练过程中，reward model的损失应该是一个怎样的曲线？是类似y=logx的上凸曲线，还是类似y=1/x的下凹曲线? 另外，您所引用的这张loss曲线里，训练开始和结束的loss值基本相同，请问这是训练reward模型时的常见情况吗？谢谢～

sunzeyeah / RLHF

用chatGLM-6B训RW的时候loss不收敛 #10