CUDA device error during multi GPU fine-tuning (多卡微調CUDA device error)

Reminder

[X] I have read the README and searched the existing issues.

System Info

- `llamafactory` version: 0.8.4.dev0
- Platform: Linux-6.6.13-1-lts-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0+cu121 (GPU)
- Transformers version: 4.45.0.dev0
- Datasets version: 2.21.0
- Accelerate version: 0.33.0
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA RTX 6000 Ada Generation

Reproduction

When fine-tuning Qwen2-VL using LoRA (not Q-LoRA) on a machine with two GPUs using the following command

llamafactory-cli train examples/train_lora/qwen2vl_lora_sft.yaml

I got the following error, which appears to be some errors during the evaluation loop.

I changed the eval_steps in the YAML file to 2.

Full error:

09/04/2024 13:54:35 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:28746
W0904 13:54:36.696000 139591565502272 torch/distributed/run.py:779] 
W0904 13:54:36.696000 139591565502272 torch/distributed/run.py:779] *****************************************
W0904 13:54:36.696000 139591565502272 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0904 13:54:36.696000 139591565502272 torch/distributed/run.py:779] *****************************************
09/04/2024 13:54:43 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
09/04/2024 13:54:43 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
09/04/2024 13:54:43 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
09/04/2024 13:54:43 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:43,979 >> loading file vocab.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/vocab.json
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:43,979 >> loading file merges.txt from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/merges.txt
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:43,979 >> loading file tokenizer.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/tokenizer.json
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:43,979 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:43,979 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:43,980 >> loading file tokenizer_config.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/tokenizer_config.json
[INFO|tokenization_utils_base.py:2426] 2024-09-04 13:54:44,238 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|image_processing_base.py:375] 2024-09-04 13:54:45,239 >> loading configuration file preprocessor_config.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/preprocessor_config.json
[INFO|image_processing_base.py:375] 2024-09-04 13:54:45,584 >> loading configuration file preprocessor_config.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/preprocessor_config.json
[INFO|image_processing_base.py:429] 2024-09-04 13:54:45,585 >> Image processor Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "max_pixels": 12845056,
    "min_pixels": 3136
  },
  "temporal_patch_size": 2
}

[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:45,905 >> loading file vocab.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/vocab.json
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:45,905 >> loading file merges.txt from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/merges.txt
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:45,905 >> loading file tokenizer.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/tokenizer.json
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:45,905 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:45,905 >> loading file special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:2182] 2024-09-04 13:54:45,905 >> loading file tokenizer_config.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/tokenizer_config.json
[INFO|tokenization_utils_base.py:2426] 2024-09-04 13:54:46,156 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|processing_utils.py:722] 2024-09-04 13:54:47,143 >> Processor Qwen2VLProcessor:
- image_processor: Qwen2VLImageProcessor {
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessor",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pixels": 3136,
  "patch_size": 14,
  "processor_class": "Qwen2VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "max_pixels": 12845056,
    "min_pixels": 3136
  },
  "temporal_patch_size": 2
}

- tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2-VL-7B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
        151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

{
  "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",
  "processor_class": "Qwen2VLProcessor"
}

09/04/2024 13:54:47 - INFO - llamafactory.data.template - Replace eos token: <|im_end|>
09/04/2024 13:54:47 - INFO - llamafactory.data.loader - Loading dataset mllm_demo.json...
09/04/2024 13:54:47 - INFO - llamafactory.data.template - Replace eos token: <|im_end|>
num_proc must be <= 6. Reducing num_proc to 6 for dataset of size 6.
Converting format of dataset (num_proc=6): 100%|█| 6/6 [00:00<00:00, 
09/04/2024 13:54:48 - INFO - llamafactory.data.loader - Loading dataset mllm_demo.json...
num_proc must be <= 6. Reducing num_proc to 6 for dataset of size 6.
Running tokenizer on dataset (num_proc=6):  17%|▏| 1/6 [00:00<00:02, num_proc must be <= 6. Reducing num_proc to 6 for dataset of size 6.
Running tokenizer on dataset (num_proc=6): 100%|█| 6/6 [00:00<00:00, 
training example:
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151652, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151655, 151653, 15191, 525, 807, 30, 151645, 198, 151644, 77091, 198, 6865, 2299, 45556, 323, 87552, 89, 4554, 504, 55591, 46204, 13, 151645, 198, 151644, 872, 198, 3838, 525, 807, 3730, 30, 151645, 198, 151644, 77091, 198, 6865, 525, 31589, 389, 279, 22174, 2070, 13, 151645]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|vision_end|>Who are they?<|im_end|>
<|im_start|>assistant
They're Kane and Gretzka from Bayern Munich.<|im_end|>
<|im_start|>user
What are they doing?<|im_end|>
<|im_start|>assistant
They are celebrating on the soccer field.<|im_end|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 6865, 2299, 45556, 323, 87552, 89, 4554, 504, 55591, 46204, 13, 151645, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 6865, 525, 31589, 389, 279, 22174, 2070, 13, 151645]
labels:
They're Kane and Gretzka from Bayern Munich.<|im_end|>They are celebrating on the soccer field.<|im_end|>
num_proc must be <= 6. Reducing num_proc to 6 for dataset of size 6.
[INFO|configuration_utils.py:668] 2024-09-04 13:54:50,452 >> loading configuration file config.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/config.json
[INFO|configuration_utils.py:735] 2024-09-04 13:54:50,458 >> Model config Qwen2VLConfig {
  "_name_or_path": "Qwen/Qwen2-VL-7B-Instruct",
  "architectures": [
    "Qwen2VLForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "image_token_id": 151655,
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2_vl",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "mrope_section": [
      16,
      24,
      24
    ],
    "type": "mrope"
  },
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.45.0.dev0",
  "use_cache": true,
  "use_sliding_window": false,
  "video_token_id": 151656,
  "vision_config": {
    "in_chans": 3,
    "model_type": "qwen2_vl",
    "spatial_patch_size": 14
  },
  "vision_end_token_id": 151653,
  "vision_start_token_id": 151652,
  "vision_token_id": 151654,
  "vocab_size": 152064
}

[INFO|modeling_utils.py:3674] 2024-09-04 13:54:50,501 >> loading weights file model.safetensors from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/model.safetensors.index.json
[INFO|modeling_utils.py:1607] 2024-09-04 13:54:50,506 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1060] 2024-09-04 13:54:50,509 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151645
}

Loading checkpoint shards: 100%|███████| 5/5 [00:04<00:00,  1.22it/s]
[INFO|modeling_utils.py:4503] 2024-09-04 13:54:55,663 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.

[INFO|modeling_utils.py:4511] 2024-09-04 13:54:55,663 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at Qwen/Qwen2-VL-7B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training.
Loading checkpoint shards: 100%|███████| 5/5 [00:04<00:00,  1.22it/s]
[INFO|configuration_utils.py:1015] 2024-09-04 13:54:56,015 >> loading configuration file generation_config.json from cache at /home/dcml0714/.cache/huggingface/hub/models--Qwen--Qwen2-VL-7B-Instruct/snapshots/cacb254f5b750fa289048fe807983c9e02e0a028/generation_config.json
[INFO|configuration_utils.py:1060] 2024-09-04 13:54:56,015 >> Generate config GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.05,
  "temperature": 0.1,
  "top_k": 1,
  "top_p": 0.001
}

09/04/2024 13:54:56 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
09/04/2024 13:54:56 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
09/04/2024 13:54:56 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
09/04/2024 13:54:56 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
09/04/2024 13:54:56 - INFO - llamafactory.model.model_utils.misc - Found linear modules: k_proj,v_proj,down_proj,gate_proj,q_proj,up_proj,o_proj
09/04/2024 13:54:56 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
09/04/2024 13:54:56 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
09/04/2024 13:54:56 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
09/04/2024 13:54:56 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
09/04/2024 13:54:56 - INFO - llamafactory.model.model_utils.misc - Found linear modules: down_proj,up_proj,v_proj,gate_proj,k_proj,o_proj,q_proj
09/04/2024 13:54:56 - INFO - llamafactory.model.loader - trainable params: 20,185,088 || all params: 8,311,560,704 || trainable%: 0.2429
[INFO|trainer.py:667] 2024-09-04 13:54:56,437 >> Using auto half precision backend
09/04/2024 13:54:56 - WARNING - llamafactory.train.callbacks - Previous trainer log in this folder will be deleted.
09/04/2024 13:54:56 - INFO - llamafactory.model.loader - trainable params: 20,185,088 || all params: 8,311,560,704 || trainable%: 0.2429
[INFO|trainer.py:2187] 2024-09-04 13:54:57,701 >> ***** Running training *****
[INFO|trainer.py:2188] 2024-09-04 13:54:57,701 >>   Num examples = 5
[INFO|trainer.py:2189] 2024-09-04 13:54:57,701 >>   Num Epochs = 3
[INFO|trainer.py:2190] 2024-09-04 13:54:57,701 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2193] 2024-09-04 13:54:57,702 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2194] 2024-09-04 13:54:57,702 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2195] 2024-09-04 13:54:57,702 >>   Total optimization steps = 3
[INFO|trainer.py:2196] 2024-09-04 13:54:57,706 >>   Number of trainable parameters = 20,185,088
  0%|                                          | 0/3 [00:00<?, ?it/s]/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
 67%|██████████████████████▋           | 2/3 [00:02<00:01,  1.25s/it][INFO|trainer.py:3960] 2024-09-04 13:55:00,999 >> 
***** Running Evaluation *****
[INFO|trainer.py:3962] 2024-09-04 13:55:00,999 >>   Num examples = 1
[INFO|trainer.py:3965] 2024-09-04 13:55:00,999 >>   Batch size = 1
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/dcml0714/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/home/dcml0714/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/home/dcml0714/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/home/dcml0714/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 93, in run_sft
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 1991, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2409, in _inner_training_loop
[rank0]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2857, in _maybe_log_save_evaluate
[rank0]:     metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2814, in _evaluate
[rank0]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank0]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 3807, in evaluate
[rank0]:     output = eval_loop(
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 4016, in evaluation_loop
[rank0]:     labels = self.accelerator.pad_across_processes(labels, dim=1, pad_index=-100)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/accelerator.py", line 2507, in pad_across_processes
[rank0]:     return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 411, in wrapper
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 678, in pad_across_processes
[rank0]:     return recursively_apply(
[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
[rank0]:     return func(data, *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 659, in _pad_across_processes
[rank0]:     sizes = gather(size).cpu()
[rank0]:             ^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 375, in wrapper
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 436, in gather
[rank0]:     return _gpu_gather(tensor)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 355, in _gpu_gather
[rank0]:     return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
[rank0]:     return func(data, *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 345, in _gpu_gather_one
[rank0]:     gather_op(output_tensors, tensor)
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3200, in all_gather_into_tensor
[rank0]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ValueError: Tensors must be CUDA and dense
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/dcml0714/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/home/dcml0714/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/home/dcml0714/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]:   File "/home/dcml0714/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 93, in run_sft
[rank1]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 1991, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2409, in _inner_training_loop
[rank1]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2857, in _maybe_log_save_evaluate
[rank1]:     metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 2814, in _evaluate
[rank1]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank1]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 3807, in evaluate
[rank1]:     output = eval_loop(
[rank1]:              ^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/transformers/trainer.py", line 4016, in evaluation_loop
[rank1]:     labels = self.accelerator.pad_across_processes(labels, dim=1, pad_index=-100)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/accelerator.py", line 2507, in pad_across_processes
[rank1]:     return pad_across_processes(tensor, dim=dim, pad_index=pad_index, pad_first=pad_first)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 411, in wrapper
[rank1]:     return function(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 678, in pad_across_processes
[rank1]:     return recursively_apply(
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
[rank1]:     return func(data, *args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 659, in _pad_across_processes
[rank1]:     sizes = gather(size).cpu()
[rank1]:             ^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 375, in wrapper
[rank1]:     return function(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 436, in gather
[rank1]:     return _gpu_gather(tensor)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 355, in _gpu_gather
[rank1]:     return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
[rank1]:     return func(data, *args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/accelerate/utils/operations.py", line 345, in _gpu_gather_one
[rank1]:     gather_op(output_tensors, tensor)
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3200, in all_gather_into_tensor
[rank1]:     work = group._allgather_base(output_tensor, input_tensor, opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: ValueError: Tensors must be CUDA and dense
 67%|██████████████████████▋           | 2/3 [00:03<00:01,  1.61s/it]
W0904 13:55:02.654000 139591565502272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1754709 closing signal SIGTERM
E0904 13:55:02.869000 139591565502272 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 1754710) of binary: /home/dcml0714/miniconda3/envs/llama-factory/bin/python
Traceback (most recent call last):
  File "/home/dcml0714/miniconda3/envs/llama-factory/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dcml0714/miniconda3/envs/llama-factory/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/dcml0714/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-04_13:55:02
  host      : s06.speech
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1754710)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Expected behavior

Expect the code to execute without error when running using multiple GPUs. When I use a single GPU to run the same code, I get no error, and the code finishes normally.

Others

No response

hiyouga / LLaMA-Factory

CUDA device error during multi GPU fine-tuning (多卡微調CUDA device error) #5352

Reminder

System Info

Reproduction

Expected behavior

Others