H20 100G*8 qwen14B full sft do_predict 阶段运行终止 failed (exitcode: -8)

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.0 Platform: Linux Python version: 3.10.14 PyTorch version: 2.4.1 Transformers version: 4.44.2 Accelerate version: 0.34.2 GPU type: NVIDIA H20-100GB DeepSpeed version: 0.15.1

Reproduction

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train \
    --stage sft \
    --do_predict \
    --model_name_or_path /home/admin/******/LLaMA-Factory/saves/sft_demo/full/sft/aug_cot_new_tmplate_qwen_qw2514bpre/checkpoint-7500/ \
    --eval_dataset ****** \
    --dataset_dir ./data \
    --template qwen \
    --finetuning_type full \
    --output_dir ./saves/sft_demo/lora/predict/aug_cot_new_tmplate_qwen_qw2514bpre_full \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_eval_batch_size 1 \
    --max_samples 2000 \
    --predict_with_generate

错误信息

W0927 10:09:19.925000 140389975775040 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 181470 closing signal SIGTERM E0927 10:09:21.341000 140389975775040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -8) local_rank: 0 (pid: 181469) of binary: /home/admin/anaconda3/envs/llama_factory/bin/python3

Expected behavior

未观察到gpu cpu 内存明显瓶颈，且sft full训练阶段可以正常运行，仅do_predict无法运行请问可能是什么原因呢，谢谢

Others

No response

hiyouga / LLaMA-Factory