Closed ycjcl868 closed 4 days ago
A100_SXM_80GB * 8
llamafactory
llamafactory-cli train --model_name_or_path llama3 \ --stage sft \ --do_train true \ --finetuning_type lora \ --lora_target all \ --deepspeed examples/deepspeed/ds_z3_config.json \ --dataset xxxxx \ --template llama3 \ --cutoff_len 2048 \ --max_samples 1000 \ --overwrite_cache true \ --preprocessing_num_workers 16 \ --output_dir saves/xxx \ --logging_steps 10 \ --save_steps 500 \ --plot_loss true \ --overwrite_output_dir true \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 2 \ --learning_rate 1.0e-5 \ --num_train_epochs 3.0 \ --lr_scheduler_type cosine \ --warmup_ratio 0.1 \ --fp16 true \ --ddp_timeout 180000000 \ --val_size 0.1 \ --per_device_eval_batch_size 1 \ --eval_strategy steps \ --eval_steps 500
training_eval_loss:
而 eval_results.json 里是有值的
eval_results.json
有曲线
No response
减少 eval steps
Reminder
System Info
A100_SXM_80GB * 8
llamafactory
version: 0.8.3.dev0Reproduction
training_eval_loss:![image](https://github.com/hiyouga/LLaMA-Factory/assets/13595509/3fb2b9b3-b68b-46a5-b153-c9ecd134e3e4)
而
eval_results.json
里是有值的Expected behavior
有曲线
Others
No response