[INFO|modeling_utils.py:1670] 2024-11-23 13:17:09,877 >> Instantiating Qwen2VisionTransformerPretrainedModel model under default dtype torch.bfloat16.
[WARNING|logging.py:168] 2024-11-23 13:17:09,890 >> Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.32it/s]
[INFO|modeling_utils.py:4800] 2024-11-23 13:17:13,974 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.
[INFO|modeling_utils.py:4808] 2024-11-23 13:17:13,974 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at /root/autodl-tmp/Qwen2-VL-7B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training.
[INFO|configuration_utils.py:1049] 2024-11-23 13:17:13,977 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/generation_config.json
[INFO|configuration_utils.py:1096] 2024-11-23 13:17:13,978 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"temperature": 0.01,
"top_k": 1,
"top_p": 0.001
}
[INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.checkpointing:157 >> Gradient checkpointing enabled.
[INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.attention:157 >> Using torch SDPA for faster training and inference.
[INFO|2024-11-23 13:17:13] llamafactory.model.adapter:157 >> Upcasting trainable params to float32.
[INFO|2024-11-23 13:17:13] llamafactory.model.adapter:157 >> Fine-tuning method: LoRA
[INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.misc:157 >> Found linear modules: k_proj,q_proj,o_proj,up_proj,v_proj,down_proj,gate_proj
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.32it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.28it/s]
/root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(kwargs)
[INFO|2024-11-23 13:17:15] llamafactory.model.loader:157 >> trainable params: 20,185,088 || all params: 8,311,560,704 || trainable%: 0.2429
/root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(kwargs)
[INFO|trainer.py:698] 2024-11-23 13:17:15,186 >> Using auto half precision backend
/root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(kwargs)
[INFO|trainer.py:2313] 2024-11-23 13:17:15,653 >> ** Running training
[INFO|trainer.py:2314] 2024-11-23 13:17:15,653 >> Num examples = 46
[INFO|trainer.py:2315] 2024-11-23 13:17:15,653 >> Num Epochs = 3
[INFO|trainer.py:2316] 2024-11-23 13:17:15,653 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2319] 2024-11-23 13:17:15,654 >> Total train batch size (w. parallel, distributed & accumulation) = 48
[INFO|trainer.py:2320] 2024-11-23 13:17:15,654 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2321] 2024-11-23 13:17:15,654 >> Total optimization steps = 3
[INFO|trainer.py:2322] 2024-11-23 13:17:15,657 >> Number of trainable parameters = 20,185,088
0%| | 0/3 [00:00<?, ?it/s]E1123 13:17:21.332000 140454999893184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -8) local_rank: 0 (pid: 5090) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Reminder
System Info
root@autodl-container-40b74f9912-1ab26877:~# llamafactory-cli env [2024-11-23 13:16:23,920] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
llamafactory
version: 0.9.1.dev0Reproduction
[INFO|2024-11-23 13:17:00] llamafactory.cli:157 >> Initializing distributed tasks at: 127.0.0.1:26797 [2024-11-23 13:17:04,905] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-23 13:17:04,980] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-23 13:17:04,994] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING|2024-11-23 13:17:06] llamafactory.hparams.parser:162 >>
ddp_find_unused_parameters
needs to be set as False for LoRA in DDP training. [INFO|2024-11-23 13:17:06] llamafactory.hparams.parser:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [INFO|configuration_utils.py:677] 2024-11-23 13:17:06,228 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/config.json [INFO|configuration_utils.py:746] 2024-11-23 13:17:06,230 >> Model config Qwen2VLConfig { "_name_or_path": "/root/autodl-tmp/Qwen2-VL-7B-Instruct", "architectures": [ "Qwen2VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2_vl", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.46.1", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "in_chans": 3, "model_type": "qwen2_vl", "spatial_patch_size": 14 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 }[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file vocab.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file merges.txt [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file tokenizer_config.json [INFO|2024-11-23 13:17:06] llamafactory.hparams.parser:355 >> Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [INFO|2024-11-23 13:17:06] llamafactory.hparams.parser:355 >> Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [INFO|tokenization_utils_base.py:2475] 2024-11-23 13:17:06,472 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|image_processing_base.py:373] 2024-11-23 13:17:06,473 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/preprocessor_config.json [INFO|image_processing_base.py:373] 2024-11-23 13:17:06,475 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/preprocessor_config.json [INFO|image_processing_base.py:429] 2024-11-23 13:17:06,475 >> Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "max_pixels": 12845056, "min_pixels": 3136 }, "temporal_patch_size": 2 }
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file vocab.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file merges.txt [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,476 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,476 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2475] 2024-11-23 13:17:06,705 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|processing_utils.py:755] 2024-11-23 13:17:07,088 >> Processor Qwen2VLProcessor:
image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "max_pixels": 12845056, "min_pixels": 3136 }, "temporal_patch_size": 2 }
tokenizer: Qwen2TokenizerFast(name_or_path='/root/autodl-tmp/Qwen2-VL-7B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False), added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), }
{ "processor_class": "Qwen2VLProcessor" }
[INFO|2024-11-23 13:17:07] llamafactory.data.loader:157 >> Loading dataset deepseek.json... my-dataset-is-secert <|im_end|> [INFO|configuration_utils.py:677] 2024-11-23 13:17:09,864 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/config.json [INFO|configuration_utils.py:746] 2024-11-23 13:17:09,865 >> Model config Qwen2VLConfig { "_name_or_path": "/root/autodl-tmp/Qwen2-VL-7B-Instruct", "architectures": [ "Qwen2VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2_vl", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.46.1", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "in_chans": 3, "model_type": "qwen2_vl", "spatial_patch_size": 14 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 }
[INFO|modeling_utils.py:3934] 2024-11-23 13:17:09,875 >> loading weights file /root/autodl-tmp/Qwen2-VL-7B-Instruct/model.safetensors.index.json [INFO|modeling_utils.py:1670] 2024-11-23 13:17:09,876 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1096] 2024-11-23 13:17:09,877 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 }
[INFO|modeling_utils.py:1670] 2024-11-23 13:17:09,877 >> Instantiating Qwen2VisionTransformerPretrainedModel model under default dtype torch.bfloat16. [WARNING|logging.py:168] 2024-11-23 13:17:09,890 >>
Qwen2VLRotaryEmbedding
can now be fully parameterized by passing the model config through theconfig
argument. All other arguments will be removed in v4.46 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.32it/s] [INFO|modeling_utils.py:4800] 2024-11-23 13:17:13,974 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.[INFO|modeling_utils.py:4808] 2024-11-23 13:17:13,974 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at /root/autodl-tmp/Qwen2-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training. [INFO|configuration_utils.py:1049] 2024-11-23 13:17:13,977 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/generation_config.json [INFO|configuration_utils.py:1096] 2024-11-23 13:17:13,978 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "temperature": 0.01, "top_k": 1, "top_p": 0.001 }
[INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.checkpointing:157 >> Gradient checkpointing enabled. [INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.attention:157 >> Using torch SDPA for faster training and inference. [INFO|2024-11-23 13:17:13] llamafactory.model.adapter:157 >> Upcasting trainable params to float32. [INFO|2024-11-23 13:17:13] llamafactory.model.adapter:157 >> Fine-tuning method: LoRA [INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.misc:157 >> Found linear modules: k_proj,q_proj,o_proj,up_proj,v_proj,down_proj,gate_proj Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.32it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.28it/s] /root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
sys.exit(main())
^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead. super().init(kwargs) [INFO|2024-11-23 13:17:15] llamafactory.model.loader:157 >> trainable params: 20,185,088 || all params: 8,311,560,704 || trainable%: 0.2429 /root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead. super().init(kwargs) [INFO|trainer.py:698] 2024-11-23 13:17:15,186 >> Using auto half precision backend /root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead. super().init(kwargs) [INFO|trainer.py:2313] 2024-11-23 13:17:15,653 >> ** Running training [INFO|trainer.py:2314] 2024-11-23 13:17:15,653 >> Num examples = 46 [INFO|trainer.py:2315] 2024-11-23 13:17:15,653 >> Num Epochs = 3 [INFO|trainer.py:2316] 2024-11-23 13:17:15,653 >> Instantaneous batch size per device = 2 [INFO|trainer.py:2319] 2024-11-23 13:17:15,654 >> Total train batch size (w. parallel, distributed & accumulation) = 48 [INFO|trainer.py:2320] 2024-11-23 13:17:15,654 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2321] 2024-11-23 13:17:15,654 >> Total optimization steps = 3 [INFO|trainer.py:2322] 2024-11-23 13:17:15,657 >> Number of trainable parameters = 20,185,088 0%| | 0/3 [00:00<?, ?it/s]E1123 13:17:21.332000 140454999893184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -8) local_rank: 0 (pid: 5090) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "/root/miniconda3/bin/torchrun", line 8, in/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED
Failures: [1]: time : 2024-11-23_13:17:21 host : autodl-container-40b74f9912-1ab26877 rank : 1 (local_rank: 1) exitcode : -8 (pid: 5091) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 5091 [2]: time : 2024-11-23_13:17:21 host : autodl-container-40b74f9912-1ab26877 rank : 2 (local_rank: 2) exitcode : -8 (pid: 5092) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 5092
Root Cause (first observed failure): [0]: time : 2024-11-23_13:17:21 host : autodl-container-40b74f9912-1ab26877 rank : 0 (local_rank: 0) exitcode : -8 (pid: 5090) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 5090
Expected behavior
不知为何之前还正常,再次启动训练出现了问题,求大佬解答! @hiyouga
Others
尝试了将torch版本和cuda版本对应,但是无效