Open amoyplane opened 2 weeks ago
llamafactory version: 0.9.0 Platform: Linux Python version: 3.10.14 PyTorch version: 2.4.1 Transformers version: 4.44.2 Accelerate version: 0.34.2 GPU type: NVIDIA H20-100GB DeepSpeed version: 0.15.1
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train \ --stage sft \ --do_predict \ --model_name_or_path /home/admin/******/LLaMA-Factory/saves/sft_demo/full/sft/aug_cot_new_tmplate_qwen_qw2514bpre/checkpoint-7500/ \ --eval_dataset ****** \ --dataset_dir ./data \ --template qwen \ --finetuning_type full \ --output_dir ./saves/sft_demo/lora/predict/aug_cot_new_tmplate_qwen_qw2514bpre_full \ --overwrite_cache \ --overwrite_output_dir \ --cutoff_len 1024 \ --preprocessing_num_workers 16 \ --per_device_eval_batch_size 1 \ --max_samples 2000 \ --predict_with_generate
错误信息
W0927 10:09:19.925000 140389975775040 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 181470 closing signal SIGTERM E0927 10:09:21.341000 140389975775040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -8) local_rank: 0 (pid: 181469) of binary: /home/admin/anaconda3/envs/llama_factory/bin/python3
未观察到gpu cpu 内存明显瓶颈,且sft full训练阶段可以正常运行,仅do_predict无法运行 请问可能是什么原因呢,谢谢
No response
注:使用训练前的基座模型进行do_predict会有相同现象
Reminder
System Info
llamafactory version: 0.9.0 Platform: Linux Python version: 3.10.14 PyTorch version: 2.4.1 Transformers version: 4.44.2 Accelerate version: 0.34.2 GPU type: NVIDIA H20-100GB DeepSpeed version: 0.15.1
Reproduction
错误信息
W0927 10:09:19.925000 140389975775040 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 181470 closing signal SIGTERM E0927 10:09:21.341000 140389975775040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -8) local_rank: 0 (pid: 181469) of binary: /home/admin/anaconda3/envs/llama_factory/bin/python3
Expected behavior
未观察到gpu cpu 内存明显瓶颈,且sft full训练阶段可以正常运行,仅do_predict无法运行 请问可能是什么原因呢,谢谢
Others
No response