无法SFT Qwen2.5-0.5B，exitcode : -8 ，traceback : Signal 8 (SIGFPE)

XLydia commented 5 days ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.35
Python version: 3.11.10
PyTorch version: 2.4.0+cu121 (GPU)
Transformers version: 4.43.4
Datasets version: 2.21.0
Accelerate version: 0.32.0
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H20
DeepSpeed version: 0.14.5

Reproduction

运行脚本

source activate
cd /root/LLaMA-Factory
conda activate llama-factory
llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path /mnt/data/models/Qwen/Qwen2.5-0.5B \
    --preprocessing_num_workers 16 \
    --finetuning_type full \
    --template qwen \
    --flash_attn fa2 \
    --dataset_dir data \
    --dataset alpaca_en_demo \
    --cutoff_len 4096 \
    --learning_rate 5e-06 \
    --num_train_epochs 5.0 \
    --max_samples 400000 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 5000 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --packing False \
    --report_to none \
    --output_dir /mnt/data/models/run/Qwen2.5-0.5B \
    --bf16 True \
    --plot_loss True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --save_only_model \
    --deepspeed /root/LLaMA-Factory/examples/deepspeed/ds_z2_config.json

错误信息

W1015 09:36:51.588000 139670032881472 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 176359 closing signal SIGTERM
W1015 09:36:51.588000 139670032881472 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 176360 closing signal SIGTERM
W1015 09:36:51.589000 139670032881472 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 176361 closing signal SIGTERM
W1015 09:36:51.590000 139670032881472 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 176362 closing signal SIGTERM
W1015 09:36:51.591000 139670032881472 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 176363 closing signal SIGTERM
W1015 09:36:51.592000 139670032881472 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 176365 closing signal SIGTERM
W1015 09:36:51.592000 139670032881472 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 176366 closing signal SIGTERM
E1015 09:36:53.273000 139670032881472 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -8) local_rank: 5 (pid: 176364) of binary: /opt/miniconda3/envs/llama-factory/bin/python
Traceback (most recent call last):
  File "/opt/miniconda3/envs/llama-factory/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/opt/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-15_09:36:51
  host      : c5b6aa04dab1
  rank      : 5 (local_rank: 5)
  exitcode  : -8 (pid: 176364)
  error_file: <N/A>
  traceback : Signal 8 (SIGFPE) received by PID 176364
============================================================

Expected behavior

对Qwen2.5使用数据集进行SFT。

Others

此外，如果使用我自己的某个数据集，则会出现Qwen2.5-0.5B及Qwen2.5-3B能够正常SFT，但Qwen2.5-1.5B无法SFT。更换另一个数据集，会出现Qwen2.5-0.5B能够SFT，但1.5B、3B均无法SFT。似乎对不同数据集的表现也有差异。

XLydia commented 5 days ago

已解决，发现是因为安装的torch版本和CUDA版本不对应。本机CUDA版本为12.4及以上，应使用pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 命令安装torch。

BeerTai commented 5 days ago

已解决，发现是因为安装的torch版本和CUDA版本不对应。本机CUDA版本为12.4及以上，应使用pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 命令安装torch。请问您微调0.5b后，如果不加系统提示以及输入较短的content，模型会重复生成吗？

hiyouga / LLaMA-Factory

无法SFT Qwen2.5-0.5B，exitcode : -8 ，traceback : Signal 8 (SIGFPE) #5713