Open HackGiter opened 1 month ago
Thanks for your effort, may I ask how to solve it, I have the same problem.
Thanks for your effort, may I ask how to solve it, I have the same problem.
Actually I change my model and use different dataset to solve that. And it still happens occasionally.
Describe the bug I try to use pipeline parallelism with transformer, but I just found out it will be stuck in the mid
To Reproduce Steps to reproduce the behavior:
Expected behavior Exit successfully
Additional Information I just check out the proceeding, and I found that it will get stuck in LoadMicrobatch in PipelineEngine.
System info (please complete the following information):
Launcher context
!/bin/bash
NNODES=1 deepspeed \ --include localhost:4,5,6,7 \ general_distributed_training.py \ --model_name_or_path /data0/pretrained-models/deepseek-llm-7b-base \ --stage sft \ --do_train true \ --finetuning_type full \ --output_dir saves/llama3-8b/full/sft \ --logging_steps 100 \ --save_steps 500 \ --plot_loss true \ --overwrite_output_dir true \ --deepspeed \ --deepspeed_config general_distributed_training.json \ --dataset_dir /home/hzli/LLaMA-Factory/data \ --dataset identity,alpaca_gpt4_en \ --template llama3 \ --cutoff_len 2048 \ --max_samples 1000 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 8 \ --bf16 true \ --learning_rate 0.0001 \ --num_train_epochs 3.0 \ --lr_scheduler_type cosine \ --val_size 0.1 \ --per_device_eval_batch_size 1 \ --evaluation_strategy steps \ --eval_steps 500 \ --disable_tqdm false \ --p 4 \ --steps 1000 \