modelscope / swift

ms-swift: Use PEFT or Full-parameter to finetune 250+ LLMs or 35+ MLLMs. (Qwen2, GLM4, Internlm2, Yi, Llama3, Llava, MiniCPM-V, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://github.com/modelscope/swift/blob/main/docs/source/LLM/index.md
Apache License 2.0
2.13k stars 205 forks source link

failed (exitcode: -9) local_rank: 0 (pid: 1760809) of binary: /home/xxx/miniconda3/envs/swift/bin/python #1208

Open zengzwww opened 1 week ago

zengzwww commented 1 week ago

Describe the bug Hello, we encountered an error when fine-tuning GLM-4V-9B model. Our fine-tuning command is:

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
    --model_type glm4v-9b-chat \
    --model_id_or_path ../../models/ZhipuAI/glm-4v-9b \
    --dataset ../../data/substation/sft/train.json \
    --val_dataset ../../data/substation/sft/test.json \
    --num_train_epochs 3 \
    --sft_type lora \
    --ddp_find_unused_parameters true \
    --eval_steps 150 \
    --save_steps 1 \
    --output_dir ../../experiments/glm-4v-9b-sft \
    --deepspeed zero3-offload

when steps=1, the program will save the model, and then the following error will appear:

{'loss': 6.52929688, 'acc': 0.03333334, 'grad_norm': 23.93275214, 'learning_rate': 0.0, 'memory(GiB)': 10.31, 'train_speed(iter/s)': 0.01689, 'epoch': 0.01, 'global_step': 1}             
Train:   0%|▎                                                                                                                                            | 1/531 [00:56<8:16:19, 56.19s/it]W0622 19:47:13.438884 140373504014144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 1760810 closing signal SIGTERM
E0622 19:47:17.565751 140373504014144 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 1760809) of binary: /home/xxx/miniconda3/envs/swift/bin/python
Traceback (most recent call last):
  File "/home/xxx/miniconda3/envs/swift/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/xxx/miniconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/xxx/miniconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/xxx/miniconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/xxx/miniconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xxx/miniconda3/envs/swift/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/xxx/miniconda3/envs/swift/lib/python3.9/site-packages/swift/cli/sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-22_19:47:13
  host      : industai-Super-Server
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 1760809)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1760809
============================================================

How to solve this problem?

Your hardware and system info

Additional context

tastelikefeet commented 3 days ago

Sending process 1760810 closing signal SIGTERM Can you check your memory status?