hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.26k stars 3.13k forks source link

deepspeed zero3 出现 training_eval_loss 图为空白 #4459

Closed ycjcl868 closed 4 days ago

ycjcl868 commented 4 days ago

Reminder

System Info

A100_SXM_80GB * 8

Reproduction

llamafactory-cli train --model_name_or_path llama3 \
  --stage sft \ 
  --do_train true \ 
  --finetuning_type lora \ 
  --lora_target all \
  --deepspeed examples/deepspeed/ds_z3_config.json \ 
  --dataset xxxxx \
  --template llama3 \
  --cutoff_len 2048 \ 
  --max_samples 1000 \ 
  --overwrite_cache true \ 
  --preprocessing_num_workers 16 \ 
  --output_dir saves/xxx \
  --logging_steps 10 \ 
  --save_steps 500 \ 
  --plot_loss true \ 
  --overwrite_output_dir true \ 
  --per_device_train_batch_size 1 \ 
  --gradient_accumulation_steps 2 \ 
  --learning_rate 1.0e-5 \ 
  --num_train_epochs 3.0 \ 
  --lr_scheduler_type cosine \
  --warmup_ratio 0.1 \ 
  --fp16 true \ 
  --ddp_timeout 180000000 \ 
  --val_size 0.1 \ 
  --per_device_eval_batch_size 1 \ 
  --eval_strategy steps \ 
  --eval_steps 500

training_eval_loss: image

eval_results.json 里是有值的

Expected behavior

有曲线

Others

No response

hiyouga commented 4 days ago

减少 eval steps