Open whk6688 opened 1 year ago
Failures:
anyone can help me?
@merrymercy i will appreciate your answer!
I also encounter a similar issue when fine-tuning llama, and I hope someone can assist in answering it!
/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/transformers/training_args.py:1388: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
warnings.warn(
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 6269) of binary: /root/miniconda3/envs/vicuna/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/vicuna/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/vicuna/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
FastChat/fastchat/train/train_mem.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-05-16_07:09:12
host : mx-69977d7b58-zrz6r
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 6269)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 6269
=====================================================
i have the same problem, help pls, i need to end my final qualifying work in one week :(
i have the same problem, help pls, i need to end my final qualifying work in one week :(
Hi @whk6688 @Void-fun , In the end, I used the following configuration to perform fine-tuning locally.
platform: Ubuntu18.04 A100 (40GB) with 2 cards
memory: 256G
python:3.10
cuda:11.6
nvidia:510.73
torch:2.0.1+cu117
flash-attn:1.0.5
fschat:0.2.10
Installation sequence of the environment:
CUDA -> new python environment for 3.10 -> torch2.0.1+cu117 -> flash-attn -> fschat
Modify the following files:
..../site-packages/torch/distributed/fsdp/_state_dict_utils.py
In 309 line:
# state_dict[fqn] = state_dict[fqn].clone().detach( )
state_dict[fqn] = state_dict[fqn].cpu().clone().detach( )
Contents of the executed command:
export NCCL_IB_DISABLE=1;
export NCCL_P2P_DISABLE=1;
export NCCL_DEBUG=INFO;
export NCCL_SOCKET_IFNAME=en,eth,em,bond;
export CXX=g++;
torchrun --nproc_per_node=2 --master_port=20001 \
...../fastchat/train/train_mem.py \
--model_name_or_path llama-7b \
--data_path datasets/dummy.json \
--bf16 True \
--output_dir finetune_output \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "shard_grad_op auto_wrap offload" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True
The training can be completed within 1 to 2 hours. I hope this can be helpful to you.
thanks, i will try later
i have the same problem, help pls, i need to end my final qualifying work in one week :(
Hi @whk6688 @Void-fun , In the end, I used the following configuration to perform fine-tuning locally.
platform: Ubuntu18.04 A100 (40GB) with 2 cards memory: 256G python:3.10 cuda:11.6 nvidia:510.73 torch:2.0.1+cu117 flash-attn:1.0.5 fschat:0.2.10
Installation sequence of the environment:
CUDA -> new python environment for 3.10 -> torch2.0.1+cu117 -> flash-attn -> fschat
Modify the following files:
..../site-packages/torch/distributed/fsdp/_state_dict_utils.py
In 309 line:
# state_dict[fqn] = state_dict[fqn].clone().detach( ) state_dict[fqn] = state_dict[fqn].cpu().clone().detach( )
Contents of the executed command:
export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; export NCCL_DEBUG=INFO; export NCCL_SOCKET_IFNAME=en,eth,em,bond; export CXX=g++; torchrun --nproc_per_node=2 --master_port=20001 \ ...../fastchat/train/train_mem.py \ --model_name_or_path llama-7b \ --data_path datasets/dummy.json \ --bf16 True \ --output_dir finetune_output \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 10 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "shard_grad_op auto_wrap offload" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True
The training can be completed within 1 to 2 hours. I hope this can be helpful to you.
Thanks for your detailed instructions! One question I have: this script is used to fine-tune vicuna from llama model, right? If I want to fine-tune vicuna model, do you have an answer how to do it?
Hi @CSerxy ,
I have not attempted fine-tuning the Vicuna model. You can replace --model_name_or_path
with the path to the pre-trained Vicuna model. Since both Llama and Vicuna models are converted into the Hugging Face Transformers format, their formats should be the same.
Many thanks @Minxiangliu !
I want to know if you have tried the method he gave above? effective?
i have the same problem, help pls, i need to end my final qualifying work in one week :(
Hi @whk6688 @Void-fun , In the end, I used the following configuration to perform fine-tuning locally.
platform: Ubuntu18.04 A100 (40GB) with 2 cards memory: 256G python:3.10 cuda:11.6 nvidia:510.73 torch:2.0.1+cu117 flash-attn:1.0.5 fschat:0.2.10
Installation sequence of the environment:
CUDA -> new python environment for 3.10 -> torch2.0.1+cu117 -> flash-attn -> fschat
Modify the following files:
..../site-packages/torch/distributed/fsdp/_state_dict_utils.py
In 309 line:
# state_dict[fqn] = state_dict[fqn].clone().detach( ) state_dict[fqn] = state_dict[fqn].cpu().clone().detach( )
Contents of the executed command:
export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; export NCCL_DEBUG=INFO; export NCCL_SOCKET_IFNAME=en,eth,em,bond; export CXX=g++; torchrun --nproc_per_node=2 --master_port=20001 \ ...../fastchat/train/train_mem.py \ --model_name_or_path llama-7b \ --data_path datasets/dummy.json \ --bf16 True \ --output_dir finetune_output \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 10 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "shard_grad_op auto_wrap offload" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True
The training can be completed within 1 to 2 hours. I hope this can be helpful to you.
i used 4 of A100(40G) for training and succeeded,RAM used about 103G,thanks a lot
I wonder if this problem is solved?
And,I also tried the deepspeed method,using official command,but it happened OOM. I just modified the "stage" from 3 to 1 in the configuration file, and it works.here is my command: torchrun --nproc_per_node=4 --master_port=61234 train.py --model_name_or_path <> --data_path <> --bf16 True --output_dir output_data --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 500 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --deepspeed default_offload_opt_param.json --tf32 False --model_max_length 512
i have the same problem, help pls, i need to end my final qualifying work in one week :(
Hi @whk6688 @Void-fun , In the end, I used the following configuration to perform fine-tuning locally.
platform: Ubuntu18.04 A100 (40GB) with 2 cards memory: 256G python:3.10 cuda:11.6 nvidia:510.73 torch:2.0.1+cu117 flash-attn:1.0.5 fschat:0.2.10
Installation sequence of the environment:
CUDA -> new python environment for 3.10 -> torch2.0.1+cu117 -> flash-attn -> fschat
Modify the following files:..../site-packages/torch/distributed/fsdp/_state_dict_utils.py
In 309 line:# state_dict[fqn] = state_dict[fqn].clone().detach( ) state_dict[fqn] = state_dict[fqn].cpu().clone().detach( )
Contents of the executed command:
export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1; export NCCL_DEBUG=INFO; export NCCL_SOCKET_IFNAME=en,eth,em,bond; export CXX=g++; torchrun --nproc_per_node=2 --master_port=20001 \ ...../fastchat/train/train_mem.py \ --model_name_or_path llama-7b \ --data_path datasets/dummy.json \ --bf16 True \ --output_dir finetune_output \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 16 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1200 \ --save_total_limit 10 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "shard_grad_op auto_wrap offload" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --lazy_preprocess True
The training can be completed within 1 to 2 hours. I hope this can be helpful to you.
Thanks for your detailed instructions! One question I have: this script is used to fine-tune vicuna from llama model, right? If I want to fine-tune vicuna model, do you have an answer how to do it?
only fix would be pip install packaging flash-attn --no-build-isolation
I ran into the same problem and worked around it using the solution proposed above. Wondering if anyone knows the root cause and could explain why?
I ran into the same problem and worked around it using the solution proposed above. Wondering if anyone knows the root cause and could explain why?
Here is my speculation. Firstly, the recommended training script is configured by default to use multiple GPUs. Without adding the offload
parameter, it may result in insufficient GPU availability. Additionally, sufficient main memory capacity is also required. Modifying the original code is another approach to prevent GPU out-of-memory (OOM) errors. These practices are adjusted based on limited hardware support. If your hardware support is better, using their recommended original script may be the best option.
parameters is : torchrun --nproc_per_node=1 --master_port=20001 FastChat/fastchat/train/train_mem.py --model_name_or_path /home/wanghaikuan/vicuna-7b --data_path /home/wanghaikuan/chat/playground_data_dummy.json --bf16 False --output_dir output --num_train_epochs 3 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 10 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 False --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess False
platform: v100 with 2 cards
memory: 256G python3.9 cuda:11.7