hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32.24k stars 3.95k forks source link

H20 100G*8 qwen14B full sft do_predict 阶段运行终止 failed (exitcode: -8) #5559

Open amoyplane opened 2 weeks ago

amoyplane commented 2 weeks ago

Reminder

System Info

llamafactory version: 0.9.0 Platform: Linux Python version: 3.10.14 PyTorch version: 2.4.1 Transformers version: 4.44.2 Accelerate version: 0.34.2 GPU type: NVIDIA H20-100GB DeepSpeed version: 0.15.1

Reproduction

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train \
    --stage sft \
    --do_predict \
    --model_name_or_path /home/admin/******/LLaMA-Factory/saves/sft_demo/full/sft/aug_cot_new_tmplate_qwen_qw2514bpre/checkpoint-7500/ \
    --eval_dataset ****** \
    --dataset_dir ./data \
    --template qwen \
    --finetuning_type full \
    --output_dir ./saves/sft_demo/lora/predict/aug_cot_new_tmplate_qwen_qw2514bpre_full \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_eval_batch_size 1 \
    --max_samples 2000 \
    --predict_with_generate

错误信息

W0927 10:09:19.925000 140389975775040 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 181470 closing signal SIGTERM E0927 10:09:21.341000 140389975775040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -8) local_rank: 0 (pid: 181469) of binary: /home/admin/anaconda3/envs/llama_factory/bin/python3

Expected behavior

未观察到gpu cpu 内存明显瓶颈,且sft full训练阶段可以正常运行,仅do_predict无法运行 请问可能是什么原因呢,谢谢

Others

No response

amoyplane commented 2 weeks ago

注:使用训练前的基座模型进行do_predict会有相同现象