Reminder

[X] I have read the README and searched the existing issues.

System Info

root@autodl-container-40b74f9912-1ab26877:~# llamafactory-cli env [2024-11-23 13:16:23,920] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.35
Python version: 3.12.3
PyTorch version: 2.3.0+cu121 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H20
DeepSpeed version: 0.15.4

Reproduction

[INFO|2024-11-23 13:17:00] llamafactory.cli:157 >> Initializing distributed tasks at: 127.0.0.1:26797 [2024-11-23 13:17:04,905] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-23 13:17:04,980] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-11-23 13:17:04,994] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING|2024-11-23 13:17:06] llamafactory.hparams.parser:162 >> ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. [INFO|2024-11-23 13:17:06] llamafactory.hparams.parser:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [INFO|configuration_utils.py:677] 2024-11-23 13:17:06,228 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/config.json [INFO|configuration_utils.py:746] 2024-11-23 13:17:06,230 >> Model config Qwen2VLConfig { "_name_or_path": "/root/autodl-tmp/Qwen2-VL-7B-Instruct", "architectures": [ "Qwen2VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2_vl", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.46.1", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "in_chans": 3, "model_type": "qwen2_vl", "spatial_patch_size": 14 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 }

[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file vocab.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file merges.txt [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file tokenizer_config.json [INFO|2024-11-23 13:17:06] llamafactory.hparams.parser:355 >> Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [INFO|2024-11-23 13:17:06] llamafactory.hparams.parser:355 >> Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16 [INFO|tokenization_utils_base.py:2475] 2024-11-23 13:17:06,472 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|image_processing_base.py:373] 2024-11-23 13:17:06,473 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/preprocessor_config.json [INFO|image_processing_base.py:373] 2024-11-23 13:17:06,475 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/preprocessor_config.json [INFO|image_processing_base.py:429] 2024-11-23 13:17:06,475 >> Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "max_pixels": 12845056, "min_pixels": 3136 }, "temporal_patch_size": 2 }

[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file vocab.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file merges.txt [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,476 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,476 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2475] 2024-11-23 13:17:06,705 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|processing_utils.py:755] 2024-11-23 13:17:07,088 >> Processor Qwen2VLProcessor:

image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "max_pixels": 12845056, "min_pixels": 3136 }, "temporal_patch_size": 2 }
tokenizer: Qwen2TokenizerFast(name_or_path='/root/autodl-tmp/Qwen2-VL-7B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False), added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), }

{ "processor_class": "Qwen2VLProcessor" }

[INFO|2024-11-23 13:17:07] llamafactory.data.loader:157 >> Loading dataset deepseek.json... my-dataset-is-secert <|im_end|> [INFO|configuration_utils.py:677] 2024-11-23 13:17:09,864 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/config.json [INFO|configuration_utils.py:746] 2024-11-23 13:17:09,865 >> Model config Qwen2VLConfig { "_name_or_path": "/root/autodl-tmp/Qwen2-VL-7B-Instruct", "architectures": [ "Qwen2VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2_vl", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.46.1", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "in_chans": 3, "model_type": "qwen2_vl", "spatial_patch_size": 14 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 }

[INFO|modeling_utils.py:3934] 2024-11-23 13:17:09,875 >> loading weights file /root/autodl-tmp/Qwen2-VL-7B-Instruct/model.safetensors.index.json [INFO|modeling_utils.py:1670] 2024-11-23 13:17:09,876 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1096] 2024-11-23 13:17:09,877 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 }

[INFO|modeling_utils.py:1670] 2024-11-23 13:17:09,877 >> Instantiating Qwen2VisionTransformerPretrainedModel model under default dtype torch.bfloat16. [WARNING|logging.py:168] 2024-11-23 13:17:09,890 >> Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46 Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.32it/s] [INFO|modeling_utils.py:4800] 2024-11-23 13:17:13,974 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.

[INFO|modeling_utils.py:4808] 2024-11-23 13:17:13,974 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at /root/autodl-tmp/Qwen2-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training. [INFO|configuration_utils.py:1049] 2024-11-23 13:17:13,977 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/generation_config.json [INFO|configuration_utils.py:1096] 2024-11-23 13:17:13,978 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "temperature": 0.01, "top_k": 1, "top_p": 0.001 }

[INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.checkpointing:157 >> Gradient checkpointing enabled. [INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.attention:157 >> Using torch SDPA for faster training and inference. [INFO|2024-11-23 13:17:13] llamafactory.model.adapter:157 >> Upcasting trainable params to float32. [INFO|2024-11-23 13:17:13] llamafactory.model.adapter:157 >> Fine-tuning method: LoRA [INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.misc:157 >> Found linear modules: k_proj,q_proj,o_proj,up_proj,v_proj,down_proj,gate_proj Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.32it/s] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.28it/s] /root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `CustomSeq2SeqTrainer.init`. Use `processing_class` instead. super().init(kwargs) [INFO|2024-11-23 13:17:15] llamafactory.model.loader:157 >> trainable params: 20,185,088 || all params: 8,311,560,704 || trainable%: 0.2429 /root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `CustomSeq2SeqTrainer.init`. Use `processing_class` instead. super().init(kwargs) [INFO|trainer.py:698] 2024-11-23 13:17:15,186 >> Using auto half precision backend /root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `CustomSeq2SeqTrainer.init`. Use `processing_class` instead. super().init(kwargs) [INFO|trainer.py:2313] 2024-11-23 13:17:15,653 >> **** Running training [INFO|trainer.py:2314] 2024-11-23 13:17:15,653 >> Num examples = 46 [INFO|trainer.py:2315] 2024-11-23 13:17:15,653 >> Num Epochs = 3 [INFO|trainer.py:2316] 2024-11-23 13:17:15,653 >> Instantaneous batch size per device = 2 [INFO|trainer.py:2319] 2024-11-23 13:17:15,654 >> Total train batch size (w. parallel, distributed & accumulation) = 48 [INFO|trainer.py:2320] 2024-11-23 13:17:15,654 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2321] 2024-11-23 13:17:15,654 >> Total optimization steps = 3 [INFO|trainer.py:2322] 2024-11-23 13:17:15,657 >> Number of trainable parameters = 20,185,088 0%| | 0/3 [00:00<?, ?it/s]E1123 13:17:21.332000 140454999893184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -8) local_rank: 0 (pid: 5090) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "/root/miniconda3/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 347, in wrapper return f(*args, kwargs) ^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in call** return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures: [1]: time : 2024-11-23_13:17:21 host : autodl-container-40b74f9912-1ab26877 rank : 1 (local_rank: 1) exitcode : -8 (pid: 5091) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 5091 [2]: time : 2024-11-23_13:17:21 host : autodl-container-40b74f9912-1ab26877 rank : 2 (local_rank: 2) exitcode : -8 (pid: 5092) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 5092

Root Cause (first observed failure): [0]: time : 2024-11-23_13:17:21 host : autodl-container-40b74f9912-1ab26877 rank : 0 (local_rank: 0) exitcode : -8 (pid: 5090) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 5090

Expected behavior

不知为何之前还正常，再次启动训练出现了问题，求大佬解答！ @hiyouga

Others

尝试了将torch版本和cuda版本对应，但是无效

hiyouga / LLaMA-Factory

/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED #6118