hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32.99k stars 4.06k forks source link

昇腾910B npu8卡训练显存不足 #5491

Open LtroiNGU opened 1 month ago

LtroiNGU commented 1 month ago

Reminder

System Info

环境信息如下:

Reproduction

训练参数如下

model

model_name_or_path: /model-data

model_name_or_path: /mnt/0913/Qwen1.572B do_sample: false

method

stage: sft do_train: true finetuning_type: lora lora_target: all

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z0_config.json

dataset

dataset: alpaca_en_demo.json #alpha_zh_demo # video: mllm_video_demo template: qwen cutoff_len: 1024 max_samples: 1000 overwrite_cache: true

preprocessing_num_workers: 16

preprocessing_num_workers: 8

output

output_dir: /mnt/llama-facotry-store/output logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

Expected behavior

报错信息:

root@hw-osc-ai:/mnt/llama-facotry-store/LLaMA-Factory-main# ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=4 llamafactory-cli train /mnt/llama-facotry-store/LLaMA-Factory-main/examples/train_lora/qwen2_lora_sft_test.yaml

[2024-09-20 01:18:56,433] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) 09/20/2024 01:19:00 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:24287 [2024-09-20 01:19:02,616] torch.distributed.run: [WARNING] [2024-09-20 01:19:02,616] torch.distributed.run: [WARNING] [2024-09-20 01:19:02,616] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-09-20 01:19:02,616] torch.distributed.run: [WARNING] [2024-09-20 01:19:18,204] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-09-20 01:19:19,052] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-09-20 01:19:19,310] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-09-20 01:19:19,479] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-20 01:19:19,551] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) 09/20/2024 01:19:19 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 09/20/2024 01:19:19 - INFO - llamafactory.hparams.parser - Process rank: 3, device: npu:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 09/20/2024 01:19:20 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> [2024-09-20 01:19:20,317] [INFO] [comm.py:637:init_distributed] cdb=None 09/20/2024 01:19:20 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 09/20/2024 01:19:20 - INFO - llamafactory.hparams.parser - Process rank: 2, device: npu:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [2024-09-20 01:19:20,572] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-20 01:19:20,572] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend hccl 09/20/2024 01:19:20 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 09/20/2024 01:19:20 - INFO - llamafactory.hparams.parser - Process rank: 0, device: npu:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [INFO|configuration_utils.py:731] 2024-09-20 01:19:20,588 >> loading configuration file /mnt/0913/Qwen1.572B/config.json [INFO|configuration_utils.py:800] 2024-09-20 01:19:20,590 >> Model config Qwen2Config { "_name_or_path": "/mnt/0913/Qwen1.572B", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 24576, "max_position_embeddings": 32768, "max_window_layers": 70, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 64, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.44.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

[INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,592 >> loading file vocab.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,592 >> loading file merges.txt [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,592 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,592 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,592 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,592 >> loading file tokenizer_config.json [2024-09-20 01:19:20,829] [INFO] [comm.py:637:init_distributed] cdb=None [INFO|tokenization_utils_base.py:2513] 2024-09-20 01:19:20,858 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:731] 2024-09-20 01:19:20,859 >> loading configuration file /mnt/0913/Qwen1.572B/config.json [INFO|configuration_utils.py:800] 2024-09-20 01:19:20,861 >> Model config Qwen2Config { "_name_or_path": "/mnt/0913/Qwen1.572B", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 24576, "max_position_embeddings": 32768, "max_window_layers": 70, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 64, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.44.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

[INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,862 >> loading file vocab.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,862 >> loading file merges.txt [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,862 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,862 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,862 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:19:20,862 >> loading file tokenizer_config.json 09/20/2024 01:19:21 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 09/20/2024 01:19:21 - INFO - llamafactory.hparams.parser - Process rank: 1, device: npu:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 09/20/2024 01:19:21 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> [INFO|tokenization_utils_base.py:2513] 2024-09-20 01:19:21,116 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 09/20/2024 01:19:21 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> 09/20/2024 01:19:21 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> Traceback (most recent call last): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in Traceback (most recent call last): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in launch() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch run_exp() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp launch() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
run_exp() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 47, in run_sft

File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp Traceback (most recent call last): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/loader.py", line 248, in get_dataset

File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 47, in run_sft dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/loader.py", line 248, in get_dataset dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)launch()

File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/loader.py", line 154, in _get_merged_dataset File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/loader.py", line 154, in _get_merged_dataset run_exp()for dataset_attr in get_dataset_list(dataset_names, data_args.dataset_dir):

File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/parser.py", line 107, in get_dataset_list for dataset_attr in get_dataset_list(dataset_names, data_args.dataset_dir): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/parser.py", line 107, in get_dataset_list Traceback (most recent call last): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)raise ValueError("Undefined dataset {} in {}.".format(name, DATA_CONFIG))

File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 47, in run_sft raise ValueError("Undefined dataset {} in {}.".format(name, DATA_CONFIG))ValueError : Undefined dataset alpaca_en_demo.json in dataset_info.json.ValueError : Undefined dataset alpaca_en_demo.json in dataset_info.json. dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/loader.py", line 248, in get_dataset launch() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch run_exp()dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)

File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/loader.py", line 154, in _get_merged_dataset run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
for dataset_attr in get_dataset_list(dataset_names, data_args.dataset_dir): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 47, in run_sft

File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/parser.py", line 107, in get_dataset_list dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", *tokenizer_module) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/loader.py", line 248, in get_dataset raise ValueError("Undefined dataset {} in {}.".format(name, DATA_CONFIG)) ValueError: Undefined dataset alpaca_en_demo.json in dataset_info.json. dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/loader.py", line 154, in _get_merged_dataset for dataset_attr in get_dataset_list(dataset_names, data_args.dataset_dir): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/data/parser.py", line 107, in get_dataset_list raise ValueError("Undefined dataset {} in {}.".format(name, DATA_CONFIG)) ValueError: Undefined dataset alpaca_en_demo.json in dataset_info.json. [2024-09-20 01:19:37,663] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 21112) of binary: /usr/local/python3.10.13/bin/python3.10 Traceback (most recent call last): File "/usr/local/python3.10.13/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(args, **kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py FAILED

Failures: [1]: time : 2024-09-20_01:19:37 host : hw-osc-ai rank : 1 (local_rank: 1) exitcode : 1 (pid: 21113) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-09-20_01:19:37 host : hw-osc-ai rank : 2 (local_rank: 2) exitcode : 1 (pid: 21114) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-09-20_01:19:37 host : hw-osc-ai rank : 3 (local_rank: 3) exitcode : 1 (pid: 21115) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-09-20_01:19:37 host : hw-osc-ai rank : 0 (local_rank: 0) exitcode : 1 (pid: 21112) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

^C root@hw-osc-ai:/mnt/llama-facotry-store/LLaMA-Factory-main# vi /mnt/llama-facotry-store/LLaMA-Factory-main/examples/train_lora/qwen2_lora_sft_test.yaml root@hw-osc-ai:/mnt/llama-facotry-store/LLaMA-Factory-main# clear root@hw-osc-ai:/mnt/llama-facotry-store/LLaMA-Factory-main# ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=4 llamafactory-cli train /mnt/llama-facotry-store/LLaMA-Factory-main/examples/train_lora/qwen2_lora_sft_test.yaml

[2024-09-20 01:20:51,901] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) 09/20/2024 01:20:56 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:21770 [2024-09-20 01:20:58,065] torch.distributed.run: [WARNING] [2024-09-20 01:20:58,065] torch.distributed.run: [WARNING] [2024-09-20 01:20:58,065] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-09-20 01:20:58,065] torch.distributed.run: [WARNING] [2024-09-20 01:21:14,111] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-09-20 01:21:14,295] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-09-20 01:21:15,114] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-09-20 01:21:15,159] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-09-20 01:21:15,351] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-20 01:21:15,351] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend hccl 09/20/2024 01:21:15 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 09/20/2024 01:21:15 - INFO - llamafactory.hparams.parser - Process rank: 0, device: npu:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [INFO|configuration_utils.py:731] 2024-09-20 01:21:15,368 >> loading configuration file /mnt/0913/Qwen1.572B/config.json [INFO|configuration_utils.py:800] 2024-09-20 01:21:15,369 >> Model config Qwen2Config { "_name_or_path": "/mnt/0913/Qwen1.572B", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 24576, "max_position_embeddings": 32768, "max_window_layers": 70, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 64, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.44.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

[INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,371 >> loading file vocab.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,371 >> loading file merges.txt [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,371 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,371 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,371 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,371 >> loading file tokenizer_config.json [2024-09-20 01:21:15,540] [INFO] [comm.py:637:init_distributed] cdb=None [INFO|tokenization_utils_base.py:2513] 2024-09-20 01:21:15,638 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:731] 2024-09-20 01:21:15,639 >> loading configuration file /mnt/0913/Qwen1.572B/config.json [INFO|configuration_utils.py:800] 2024-09-20 01:21:15,640 >> Model config Qwen2Config { "_name_or_path": "/mnt/0913/Qwen1.572B", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 24576, "max_position_embeddings": 32768, "max_window_layers": 70, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 64, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.44.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

[INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,641 >> loading file vocab.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,641 >> loading file merges.txt [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,641 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,641 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,641 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2267] 2024-09-20 01:21:15,641 >> loading file tokenizer_config.json 09/20/2024 01:21:15 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 09/20/2024 01:21:15 - INFO - llamafactory.hparams.parser - Process rank: 2, device: npu:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [INFO|tokenization_utils_base.py:2513] 2024-09-20 01:21:15,910 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 09/20/2024 01:21:15 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> 09/20/2024 01:21:15 - INFO - llamafactory.data.loader - Loading dataset alpha_zh_demo.json... num_proc must be <= 4. Reducing num_proc to 4 for dataset of size 4. Converting format of dataset (num_proc=4): 100%|███████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.85 examples/s] 09/20/2024 01:21:16 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> [2024-09-20 01:21:16,384] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-20 01:21:16,427] [INFO] [comm.py:637:init_distributed] cdb=None 09/20/2024 01:21:16 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 09/20/2024 01:21:16 - INFO - llamafactory.hparams.parser - Process rank: 1, device: npu:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 09/20/2024 01:21:16 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 09/20/2024 01:21:16 - INFO - llamafactory.hparams.parser - Process rank: 3, device: npu:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 09/20/2024 01:21:17 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> 09/20/2024 01:21:17 - INFO - llamafactory.data.template - Replace eos token: <|im_end|> 09/20/2024 01:21:23 - INFO - llamafactory.data.loader - Loading dataset alpha_zh_demo.json... 09/20/2024 01:21:23 - INFO - llamafactory.data.loader - Loading dataset alpha_zh_demo.json... num_proc must be <= 4. Reducing num_proc to 4 for dataset of size 4. 09/20/2024 01:21:23 - INFO - llamafactory.data.loader - Loading dataset alpha_zh_demo.json... num_proc must be <= 4. Reducing num_proc to 4 for dataset of size 4. num_proc must be <= 4. Reducing num_proc to 4 for dataset of size 4. num_proc must be <= 4. Reducing num_proc to 4 for dataset of size 4. Running tokenizer on dataset (num_proc=4): 100%|███████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4.21 examples/s] training example: input_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 102450, 62926, 104136, 89012, 22382, 44177, 101047, 100369, 99891, 101911, 5122, 102150, 101911, 33108, 8903, 63109, 36587, 1773, 151645, 198, 151644, 77091, 198, 102150, 101911, 20412, 100206, 99891, 104111, 101911, 3837, 99652, 100140, 55338, 100702, 31914, 100132, 67071, 48934, 30709, 105166, 106251, 8545, 102150, 31838, 104384, 1773, 100346, 114651, 104111, 99896, 101911, 3837, 100140, 102150, 20412, 55338, 100206, 105166, 100166, 33108, 98380, 75317, 3837, 104152, 100206, 100132, 67071, 46944, 57191, 101213, 102150, 101286, 3837, 102150, 101097, 67338, 102150, 110935, 100394, 100676, 102150, 1773, 100147, 101911, 67071, 106929, 22382, 120806, 5373, 99330, 101190, 31843, 33108, 100167, 100809, 34204, 16, 23, 18, 24, 7948, 104181, 101080, 3407, 8903, 63109, 36587, 104442, 101281, 20412, 101281, 38176, 9370, 99488, 3837, 105884, 3837, 113837, 102074, 101281, 108215, 9370, 101911, 1773, 99487, 101911, 112479, 105062, 29490, 63109, 36587, 101313, 3837, 100140, 102493, 102095, 105339, 9370, 99488, 1773, 8903, 63109, 36587, 9370, 101080, 28946, 20412, 99685, 99470, 72225, 13935, 99826, 99243, 99685, 3837, 104677, 16, 21, 101186, 84607, 102098, 108124, 101712, 26940, 35727, 31914, 104001, 67831, 87243, 109268, 34187, 101281, 38176, 113837, 102074, 101281, 104001, 9370, 104949, 3837, 17714, 35727, 104179, 103949, 107759, 102334, 102007, 1773, 151645] inputs: <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user 识别并解释给定列表中的两个科学理论:细胞理论和日心说。<|im_end|> <|im_start|>assistant 细胞理论是生物科学的一个理论,它认为所有生命体都是由微小的基本单元——细胞所构成。这是生物学的一个基础理论,认为细胞是所有生物的基本结构和功能单位,所有的生物都是由一个或多个细胞组成,细胞只能通过细胞分裂产生新的细胞。这一理论由薛定谔、施瓦内和雪莱于1839年首次提出。

日心说是指太阳是太阳系的中心,也就是说,行星围绕太阳旋转的理论。这个理论打破了传统的地心说观点,认为地球并不是宇宙的中心。日心说的提出者是尼古拉·哥白尼,他在16世纪初发表了他的著作《天体运行论》,阐述了太阳系行星围绕太阳运行的模型,为天文学的发展做出了巨大贡献。<|im_end|> label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 102150, 101911, 20412, 100206, 99891, 104111, 101911, 3837, 99652, 100140, 55338, 100702, 31914, 100132, 67071, 48934, 30709, 105166, 106251, 8545, 102150, 31838, 104384, 1773, 100346, 114651, 104111, 99896, 101911, 3837, 100140, 102150, 20412, 55338, 100206, 105166, 100166, 33108, 98380, 75317, 3837, 104152, 100206, 100132, 67071, 46944, 57191, 101213, 102150, 101286, 3837, 102150, 101097, 67338, 102150, 110935, 100394, 100676, 102150, 1773, 100147, 101911, 67071, 106929, 22382, 120806, 5373, 99330, 101190, 31843, 33108, 100167, 100809, 34204, 16, 23, 18, 24, 7948, 104181, 101080, 3407, 8903, 63109, 36587, 104442, 101281, 20412, 101281, 38176, 9370, 99488, 3837, 105884, 3837, 113837, 102074, 101281, 108215, 9370, 101911, 1773, 99487, 101911, 112479, 105062, 29490, 63109, 36587, 101313, 3837, 100140, 102493, 102095, 105339, 9370, 99488, 1773, 8903, 63109, 36587, 9370, 101080, 28946, 20412, 99685, 99470, 72225, 13935, 99826, 99243, 99685, 3837, 104677, 16, 21, 101186, 84607, 102098, 108124, 101712, 26940, 35727, 31914, 104001, 67831, 87243, 109268, 34187, 101281, 38176, 113837, 102074, 101281, 104001, 9370, 104949, 3837, 17714, 35727, 104179, 103949, 107759, 102334, 102007, 1773, 151645] labels: 细胞理论是生物科学的一个理论,它认为所有生命体都是由微小的基本单元——细胞所构成。这是生物学的一个基础理论,认为细胞是所有生物的基本结构和功能单位,所有的生物都是由一个或多个细胞组成,细胞只能通过细胞分裂产生新的细胞。这一理论由薛定谔、施瓦内和雪莱于1839年首次提出。

日心说是指太阳是太阳系的中心,也就是说,行星围绕太阳旋转的理论。这个理论打破了传统的地心说观点,认为地球并不是宇宙的中心。日心说的提出者是尼古拉·哥白尼,他在16世纪初发表了他的著作《天体运行论》,阐述了太阳系行星围绕太阳运行的模型,为天文学的发展做出了巨大贡献。<|im_end|> num_proc must be <= 4. Reducing num_proc to 4 for dataset of size 4. num_proc must be <= 4. Reducing num_proc to 4 for dataset of size 4. [INFO|configuration_utils.py:731] 2024-09-20 01:21:25,218 >> loading configuration file /mnt/0913/Qwen1.572B/config.json num_proc must be <= 4. Reducing num_proc to 4 for dataset of size 4. [INFO|configuration_utils.py:800] 2024-09-20 01:21:25,219 >> Model config Qwen2Config { "_name_or_path": "/mnt/0913/Qwen1.572B", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 8192, "initializer_range": 0.02, "intermediate_size": 24576, "max_position_embeddings": 32768, "max_window_layers": 70, "model_type": "qwen2", "num_attention_heads": 64, "num_hidden_layers": 80, "num_key_value_heads": 64, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.44.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

[INFO|modeling_utils.py:3653] 2024-09-20 01:21:25,280 >> loading weights file /mnt/0913/Qwen1.572B/model.safetensors.index.json [INFO|modeling_utils.py:1584] 2024-09-20 01:21:25,281 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1038] 2024-09-20 01:21:25,282 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 }

Loading checkpoint shards: 42%|██████████████████████████████████████▋ | 16/38 [00:12<00:17, 1.24it/s] Traceback (most recent call last): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in launch() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch run_exp() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 48, in run_sft model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/model/loader.py", line 162, in load_model model = load_class.from_pretrained(init_kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4415, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, set_module_kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 400, in set_module_tensor_to_device new_value = value.to(device) RuntimeError: NPU out of memory. Tried to allocate 386.00 MiB (NPU 3; 60.97 GiB total capacity; 59.57 GiB already allocated; 59.57 GiB current active; 16.99 MiB free; 60.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. Loading checkpoint shards: 42%|██████████████████████████████████████▋ | 16/38 [00:14<00:19, 1.14it/s] Traceback (most recent call last): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in launch() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch run_exp() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 48, in run_sft model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/model/loader.py", line 162, in load_model model = load_class.from_pretrained(init_kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4415, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, set_module_kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 400, in set_module_tensor_to_device new_value = value.to(device) RuntimeError: NPU out of memory. Tried to allocate 386.00 MiB (NPU 0; 60.97 GiB total capacity; 59.57 GiB already allocated; 59.57 GiB current active; 25.36 MiB free; 60.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. Loading checkpoint shards: 42%|██████████████████████████████████████▋ | 16/38 [00:13<00:19, 1.15it/s] Traceback (most recent call last): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in launch() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch run_exp() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 48, in run_sft model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/model/loader.py", line 162, in load_model model = load_class.from_pretrained(init_kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4415, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, set_module_kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 400, in set_module_tensor_to_device new_value = value.to(device) RuntimeError: NPU out of memory. Tried to allocate 386.00 MiB (NPU 2; 60.97 GiB total capacity; 59.57 GiB already allocated; 59.57 GiB current active; 16.54 MiB free; 60.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. Loading checkpoint shards: 42%|██████████████████████████████████████▋ | 16/38 [00:14<00:19, 1.13it/s] Traceback (most recent call last): File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in launch() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch run_exp() File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 50, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/train/sft/workflow.py", line 48, in run_sft model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) File "/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/model/loader.py", line 162, in load_model model = load_class.from_pretrained(init_kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained ) = cls._load_pretrained_model( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4415, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/usr/local/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, set_module_kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 400, in set_module_tensor_to_device new_value = value.to(device) RuntimeError: NPU out of memory. Tried to allocate 386.00 MiB (NPU 1; 60.97 GiB total capacity; 59.57 GiB already allocated; 59.57 GiB current active; 17.39 MiB free; 60.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. [2024-09-20 01:22:03,132] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 24246 closing signal SIGTERM [2024-09-20 01:22:03,132] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 24247 closing signal SIGTERM [2024-09-20 01:22:03,396] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 24248) of binary: /usr/local/python3.10.13/bin/python3.10 Traceback (most recent call last): File "/usr/local/python3.10.13/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/python3.10.13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/mnt/llama-facotry-store/LLaMA-Factory-main/src/llamafactory/launcher.py FAILED

Failures: [1]: time : 2024-09-20_01:22:03 host : hw-osc-ai rank : 3 (local_rank: 3) exitcode : 1 (pid: 24249) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-09-20_01:22:03 host : hw-osc-ai rank : 2 (local_rank: 2) exitcode : 1 (pid: 24248) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Others

尝试过了Qwen2 72B和Qwen1.5 72B,使用npu-smi info查看显卡信息,确实存在爆显存的情况,训练数据集也很小,4k大小的样子,请教下是什么原因,该如何解决。

RY-lxf commented 4 weeks ago

我也遇到了类似的问题~想问问为什么

Linuxstyle commented 1 week ago

+1