多卡增量预训练时卡住

Universe-Sun commented 3 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

MASTER_PORT=$(shuf -n 1 -i 10000-65535) export CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=$MASTER_PORT src/train.py \ --model_name_or_path /gpt/model/Qwen1.5-14B \ --stage pt \ --do_train \ --template qwen \ --dataset pre \ --finetuning_type lora \ --lora_target all \ --output_dir saves/Qwen1.5-14B/pt_lora/2024-07-09-18-00-00 \ --overwrite_cache \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 8 \ --preprocessing_num_workers 16 \ --lr_scheduler_type cosine \ --cutoff_len 8192 \ --logging_steps 5 \ --save_steps 1000 \ --learning_rate 1e-4 \ --num_train_epochs 3 \ --max_grad_norm 0.5 \ --lora_rank 32 \ --plot_loss \ --ddp_timeout 180000000 \ --fp16 \ --warmup_ratio 0.1 \ --deepspeed examples/deepspeed/ds_z2_config.json

Reproduction

W0709 17:58:57.596000 140044774704960 torch/distributed/run.py:757] W0709 17:58:57.596000 140044774704960 torch/distributed/run.py:757] W0709 17:58:57.596000 140044774704960 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0709 17:58:57.596000 140044774704960 torch/distributed/run.py:757] [2024-07-09 17:59:01,274] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-09 17:59:01,286] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-09 17:59:01,302] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-07-09 17:59:01,340] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3 [WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible [2024-07-09 17:59:03,061] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-09 17:59:03,064] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-09 17:59:03,064] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-07-09 17:59:03,123] [INFO] [comm.py:637:init_distributed] cdb=None 07/09/2024 17:59:03 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 07/09/2024 17:59:03 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.float16 [INFO|tokenization_utils_base.py:2159] 2024-07-09 17:59:03,204 >> loading file vocab.json [INFO|tokenization_utils_base.py:2159] 2024-07-09 17:59:03,204 >> loading file merges.txt [INFO|tokenization_utils_base.py:2159] 2024-07-09 17:59:03,204 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2159] 2024-07-09 17:59:03,204 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2159] 2024-07-09 17:59:03,204 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2159] 2024-07-09 17:59:03,204 >> loading file tokenizer_config.json [2024-07-09 17:59:03,331] [INFO] [comm.py:637:init_distributed] cdb=None 07/09/2024 17:59:03 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 07/09/2024 17:59:03 - INFO - llamafactory.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.float16 07/09/2024 17:59:03 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 07/09/2024 17:59:03 - INFO - llamafactory.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.float16 [WARNING|logging.py:313] 2024-07-09 17:59:03,431 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 07/09/2024 17:59:03 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> 07/09/2024 17:59:03 - INFO - llamafactory.data.loader - Loading dataset pre_data_62424.json... 07/09/2024 17:59:03 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 07/09/2024 17:59:03 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.float16 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 07/09/2024 17:59:03 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 07/09/2024 17:59:03 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 07/09/2024 17:59:03 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> Converting format of dataset (num_proc=16): 100%|████████████████████████████████████████| 62424/62424 [00:00<00:00, 117497.67 examples/s]

Expected behavior

一直卡在这里

Others

No response

Universe-Sun commented 3 months ago

1

chengjl19 commented 1 month ago

请问是怎么解决的

ShuoZhang2003 commented 1 week ago

请问是怎么解决的

hiyouga / LLaMA-Factory