lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.36k stars 4.47k forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #1200

Open whk6688 opened 1 year ago

whk6688 commented 1 year ago

parameters is : torchrun --nproc_per_node=1 --master_port=20001 FastChat/fastchat/train/train_mem.py --model_name_or_path /home/wanghaikuan/vicuna-7b --data_path /home/wanghaikuan/chat/playground_data_dummy.json --bf16 False --output_dir output --num_train_epochs 3 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 10 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 False --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess False

platform: v100 with 2 cards
memory: 256G python3.9 cuda:11.7

whk6688 commented 1 year ago

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 243859) of binary: /home/wanghaikuan/anaconda3/envs/python39/bin/python Traceback (most recent call last): File "/home/wanghaikuan/anaconda3/envs/python39/bin/torchrun", line 8, in sys.exit(main()) File "/home/wanghaikuan/anaconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/wanghaikuan/anaconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/wanghaikuan/anaconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/wanghaikuan/anaconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wanghaikuan/anaconda3/envs/python39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

FastChat/fastchat/train/train_mem.py FAILED

Failures:

-------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-05-11_16:59:10 host : localhost.localdomain rank : 0 (local_rank: 0) exitcode : -11 (pid: 243859) error_file: traceback : Signal 11 (SIGSEGV) received by PID 243859 ========================================================
whk6688 commented 1 year ago

anyone can help me?

whk6688 commented 1 year ago

@merrymercy i will appreciate your answer!

Minxiangliu commented 1 year ago

I also encounter a similar issue when fine-tuning llama, and I hope someone can assist in answering it!

/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
  warnings.warn(
Loading checkpoint shards:   0%|                                | 0/2 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 6269) of binary: /root/miniconda3/envs/vicuna/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/envs/vicuna/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
FastChat/fastchat/train/train_mem.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-16_07:09:12
  host      : mx-69977d7b58-zrz6r
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 6269)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 6269
=====================================================
Void-fun commented 1 year ago

i have the same problem, help pls, i need to end my final qualifying work in one week :(

Minxiangliu commented 1 year ago

i have the same problem, help pls, i need to end my final qualifying work in one week :(

Hi @whk6688 @Void-fun , In the end, I used the following configuration to perform fine-tuning locally.

platform: Ubuntu18.04 A100 (40GB) with 2 cards
memory: 256G
python:3.10 
cuda:11.6
nvidia:510.73
torch:2.0.1+cu117
flash-attn:1.0.5
fschat:0.2.10

Installation sequence of the environment: CUDA -> new python environment for 3.10 -> torch2.0.1+cu117 -> flash-attn -> fschat

Modify the following files: ..../site-packages/torch/distributed/fsdp/_state_dict_utils.py

In 309 line:

# state_dict[fqn] = state_dict[fqn].clone().detach( )
state_dict[fqn] = state_dict[fqn].cpu().clone().detach( )

Contents of the executed command:

export NCCL_IB_DISABLE=1;
export NCCL_P2P_DISABLE=1;
export NCCL_DEBUG=INFO;
export NCCL_SOCKET_IFNAME=en,eth,em,bond;
export CXX=g++;
torchrun --nproc_per_node=2 --master_port=20001 \
    ...../fastchat/train/train_mem.py \
    --model_name_or_path llama-7b \
    --data_path datasets/dummy.json \
    --bf16 True \
    --output_dir finetune_output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "shard_grad_op auto_wrap offload" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

The training can be completed within 1 to 2 hours. I hope this can be helpful to you.

whk6688 commented 1 year ago

thanks, i will try later

CSerxy commented 1 year ago

i have the same problem, help pls, i need to end my final qualifying work in one week :(

Hi @whk6688 @Void-fun , In the end, I used the following configuration to perform fine-tuning locally.

platform: Ubuntu18.04 A100 (40GB) with 2 cards
memory: 256G
python:3.10 
cuda:11.6
nvidia:510.73
torch:2.0.1+cu117
flash-attn:1.0.5
fschat:0.2.10

Installation sequence of the environment: CUDA -> new python environment for 3.10 -> torch2.0.1+cu117 -> flash-attn -> fschat

Modify the following files: ..../site-packages/torch/distributed/fsdp/_state_dict_utils.py

In 309 line:

# state_dict[fqn] = state_dict[fqn].clone().detach( )
state_dict[fqn] = state_dict[fqn].cpu().clone().detach( )

Contents of the executed command:

export NCCL_IB_DISABLE=1;
export NCCL_P2P_DISABLE=1;
export NCCL_DEBUG=INFO;
export NCCL_SOCKET_IFNAME=en,eth,em,bond;
export CXX=g++;
torchrun --nproc_per_node=2 --master_port=20001 \
    ...../fastchat/train/train_mem.py \
    --model_name_or_path llama-7b \
    --data_path datasets/dummy.json \
    --bf16 True \
    --output_dir finetune_output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "shard_grad_op auto_wrap offload" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

The training can be completed within 1 to 2 hours. I hope this can be helpful to you.

Thanks for your detailed instructions! One question I have: this script is used to fine-tune vicuna from llama model, right? If I want to fine-tune vicuna model, do you have an answer how to do it?

Minxiangliu commented 1 year ago

Hi @CSerxy , I have not attempted fine-tuning the Vicuna model. You can replace --model_name_or_path with the path to the pre-trained Vicuna model. Since both Llama and Vicuna models are converted into the Hugging Face Transformers format, their formats should be the same.

CSerxy commented 1 year ago

Many thanks @Minxiangliu !

Hzzhang-nlp commented 1 year ago

I want to know if you have tried the method he gave above? effective?

ghost commented 1 year ago

i have the same problem, help pls, i need to end my final qualifying work in one week :(

Hi @whk6688 @Void-fun , In the end, I used the following configuration to perform fine-tuning locally.

platform: Ubuntu18.04 A100 (40GB) with 2 cards
memory: 256G
python:3.10 
cuda:11.6
nvidia:510.73
torch:2.0.1+cu117
flash-attn:1.0.5
fschat:0.2.10

Installation sequence of the environment: CUDA -> new python environment for 3.10 -> torch2.0.1+cu117 -> flash-attn -> fschat

Modify the following files: ..../site-packages/torch/distributed/fsdp/_state_dict_utils.py

In 309 line:

# state_dict[fqn] = state_dict[fqn].clone().detach( )
state_dict[fqn] = state_dict[fqn].cpu().clone().detach( )

Contents of the executed command:

export NCCL_IB_DISABLE=1;
export NCCL_P2P_DISABLE=1;
export NCCL_DEBUG=INFO;
export NCCL_SOCKET_IFNAME=en,eth,em,bond;
export CXX=g++;
torchrun --nproc_per_node=2 --master_port=20001 \
    ...../fastchat/train/train_mem.py \
    --model_name_or_path llama-7b \
    --data_path datasets/dummy.json \
    --bf16 True \
    --output_dir finetune_output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "shard_grad_op auto_wrap offload" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

The training can be completed within 1 to 2 hours. I hope this can be helpful to you.

i used 4 of A100(40G) for training and succeeded,RAM used about 103G,thanks a lot

Hzzhang-nlp commented 1 year ago

I wonder if this problem is solved?

ghost commented 1 year ago

And,I also tried the deepspeed method,using official command,but it happened OOM. I just modified the "stage" from 3 to 1 in the configuration file, and it works.here is my command: torchrun --nproc_per_node=4 --master_port=61234 train.py --model_name_or_path <> --data_path <> --bf16 True --output_dir output_data --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed default_offload_opt_param.json --tf32 False --model_max_length 512

brucewlee commented 1 year ago

i have the same problem, help pls, i need to end my final qualifying work in one week :(

Hi @whk6688 @Void-fun , In the end, I used the following configuration to perform fine-tuning locally.

platform: Ubuntu18.04 A100 (40GB) with 2 cards
memory: 256G
python:3.10 
cuda:11.6
nvidia:510.73
torch:2.0.1+cu117
flash-attn:1.0.5
fschat:0.2.10

Installation sequence of the environment: CUDA -> new python environment for 3.10 -> torch2.0.1+cu117 -> flash-attn -> fschat Modify the following files: ..../site-packages/torch/distributed/fsdp/_state_dict_utils.py In 309 line:

# state_dict[fqn] = state_dict[fqn].clone().detach( )
state_dict[fqn] = state_dict[fqn].cpu().clone().detach( )

Contents of the executed command:

export NCCL_IB_DISABLE=1;
export NCCL_P2P_DISABLE=1;
export NCCL_DEBUG=INFO;
export NCCL_SOCKET_IFNAME=en,eth,em,bond;
export CXX=g++;
torchrun --nproc_per_node=2 --master_port=20001 \
    ...../fastchat/train/train_mem.py \
    --model_name_or_path llama-7b \
    --data_path datasets/dummy.json \
    --bf16 True \
    --output_dir finetune_output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "shard_grad_op auto_wrap offload" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

The training can be completed within 1 to 2 hours. I hope this can be helpful to you.

Thanks for your detailed instructions! One question I have: this script is used to fine-tune vicuna from llama model, right? If I want to fine-tune vicuna model, do you have an answer how to do it?

only fix would be pip install packaging flash-attn --no-build-isolation

yukiontheiceberg commented 1 year ago

I ran into the same problem and worked around it using the solution proposed above. Wondering if anyone knows the root cause and could explain why?

Minxiangliu commented 1 year ago

I ran into the same problem and worked around it using the solution proposed above. Wondering if anyone knows the root cause and could explain why?

Here is my speculation. Firstly, the recommended training script is configured by default to use multiple GPUs. Without adding the offload parameter, it may result in insufficient GPU availability. Additionally, sufficient main memory capacity is also required. Modifying the original code is another approach to prevent GPU out-of-memory (OOM) errors. These practices are adjusted based on limited hardware support. If your hardware support is better, using their recommended original script may be the best option.