Closed bigcash closed 4 days ago
root@gpu# llamafactory-cli env [2024-06-21 07:55:43,744] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
llamafactory
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train examples/train_full/qwen2_7b_full_pt.yaml
使用全参pt微调Qwen2-7B-instruct模型,期间发现loss突然变大,且一直保持在高位。所以将训练中断,从正常loss的step继续训练。同时修改lr,将lr调小为1e-7。
发现有两个问题:
是我哪里改的不对吗?
这是loss突然变大的截图
这个是继续训练后的截图
中断训练后修改了以下两处: checkpoint-600/trainer_state.json中的jsondata['log_history'][-2]['learning_rate']设置为1e-07 qwen2_7b_full_pt.yaml中增加resume_from_checkpoint: saves/qwen2_7b/full/pt/checkpoint-600,同时learning_rate: 1e-7
qwen2_7b_full_pt.yaml中全部参数如下:
### model model_name_or_path: /data2/lb/models/Qwen2-7B-Instruct resume_from_checkpoint: saves/qwen2_7b/full/pt/checkpoint-600 ### method stage: pt do_train: true finetuning_type: full ### ddp ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_config.json flash_attn: sdpa ### dataset dataset: skypile10B,edu_10B,mathpile,web_pt,mm template: qwen cutoff_len: 8192 overwrite_cache: true preprocessing_num_workers: 256 ### output output_dir: saves/qwen2_7b/full/pt logging_steps: 50 save_steps: 50 plot_loss: true overwrite_output_dir: true report_to: tensorboard ### train per_device_train_batch_size: 3 gradient_accumulation_steps: 64 learning_rate: 1.0e-7 num_train_epochs: 1.0 lr_scheduler_type: cosine warmup_ratio: 0.0005 bf16: true ### eval val_size: 0.0001 per_device_eval_batch_size: 3 eval_strategy: steps eval_steps: 50
建议从头训一下,断点续训不能修改 lr
Reminder
System Info
root@gpu# llamafactory-cli env [2024-06-21 07:55:43,744] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
llamafactory
version: 0.8.2Reproduction
Expected behavior
使用全参pt微调Qwen2-7B-instruct模型,期间发现loss突然变大,且一直保持在高位。所以将训练中断,从正常loss的step继续训练。同时修改lr,将lr调小为1e-7。
发现有两个问题:
是我哪里改的不对吗?
这是loss突然变大的截图![image](https://github.com/hiyouga/LLaMA-Factory/assets/15136420/5e867ceb-1847-4bdc-bc05-aec8e414dce7)
这个是继续训练后的截图![image](https://github.com/hiyouga/LLaMA-Factory/assets/15136420/29bb3c04-2971-4efb-acdc-7dae7ccb3937)
Others
中断训练后修改了以下两处: checkpoint-600/trainer_state.json中的jsondata['log_history'][-2]['learning_rate']设置为1e-07 qwen2_7b_full_pt.yaml中增加resume_from_checkpoint: saves/qwen2_7b/full/pt/checkpoint-600,同时learning_rate: 1e-7
qwen2_7b_full_pt.yaml中全部参数如下: