hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
31.34k stars 3.86k forks source link

IndexError: tuple index out of range #5209

Closed ybdesire closed 1 month ago

ybdesire commented 1 month ago

Reminder

System Info

latest LLaMA-Factory. My CUDA is 12.4. GPU server is 8*A800 (80G).

Reproduction

default config run successfully as below:

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

No error for this command.

But get error after add CUDA_VISIBLE_DEVICES:

CUDA_VISIBLE_DEVICES=4,5,6,7 llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

Error message is below

08/19/2024 09:09:24 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:20691
W0819 09:09:25.504000 140188021503808 torch/distributed/run.py:779]
W0819 09:09:25.504000 140188021503808 torch/distributed/run.py:779] *****************************************
W0819 09:09:25.504000 140188021503808 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0819 09:09:25.504000 140188021503808 torch/distributed/run.py:779] *****************************************
08/19/2024 09:09:30 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/19/2024 09:09:30 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2287] 2024-08-19 09:09:30,033 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-19 09:09:30,034 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-19 09:09:30,034 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-19 09:09:30,034 >> loading file tokenizer_config.json
08/19/2024 09:09:30 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/19/2024 09:09:30 - INFO - llamafactory.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
08/19/2024 09:09:30 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/19/2024 09:09:30 - INFO - llamafactory.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
08/19/2024 09:09:30 - WARNING - llamafactory.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
08/19/2024 09:09:30 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2533] 2024-08-19 09:09:30,309 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
08/19/2024 09:09:30 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/19/2024 09:09:30 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/19/2024 09:09:30 - INFO - llamafactory.data.loader - Loading dataset decompile/train_synth_compilable_combined.json...
08/19/2024 09:09:30 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/19/2024 09:09:30 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/19/2024 09:09:30 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/19/2024 09:09:30 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/19/2024 09:09:30 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/19/2024 09:09:30 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
Converting format of dataset (num_proc=16): 100%|████████████████████████████████████████████████████████████| 637/637 [00:00<00:00, 4055.25 examples/s]
08/19/2024 09:09:32 - INFO - llamafactory.data.loader - Loading dataset decompile/train_synth_compilable_combined.json...
08/19/2024 09:09:32 - INFO - llamafactory.data.loader - Loading dataset decompile/train_synth_compilable_combined.json...
08/19/2024 09:09:32 - INFO - llamafactory.data.loader - Loading dataset decompile/train_synth_compilable_combined.json...
Running tokenizer on dataset (num_proc=16): 100%|█████████████████████████████████████████████████████████████| 637/637 [00:02<00:00, 275.75 examples/s]
training example:
input_ids:
[128000, 128006, 882, 128007, 271, 5207, 279, 432, 4094, 97478, 12470, 2082, 369, 551, 5732, 23, 1925, 4431, 696, 517, 220, 9711, 446, 9906, 4435, 99264, 220, 471, 220, 15, 280, 92, 128009, 128006, 78191, 128007, 271, 396, 1925, 1577, 12107, 11, 1181, 3146, 6645, 340, 517, 4192, 4530, 82, 1734, 2247, 9906, 4435, 99264, 471, 220, 15, 280, 92, 128009]
inputs:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Output the Refined Decompile code for : undefined8 main(void)

{
  puts("Hello World!!");
  return 0;
}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

int main(int argc, char **argv)
{
 printf("%s\n","Hello World!!");
 return 0;
}<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 396, 1925, 1577, 12107, 11, 1181, 3146, 6645, 340, 517, 4192, 4530, 82, 1734, 2247, 9906, 4435, 99264, 471, 220, 15, 280, 92, 128009]
labels:
int main(int argc, char **argv)
{
 printf("%s\n","Hello World!!");
 return 0;
}<|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-19 09:09:34,732 >> loading configuration file /data/bbb/models/meta-llama-3.1-8b-instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-19 09:09:34,735 >> Model config LlamaConfig {
  "_name_or_path": "/data/bbb/models/meta-llama-3.1-8b-instruct/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.4",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|modeling_utils.py:3641] 2024-08-19 09:09:34,777 >> loading weights file /data/bbb/models/meta-llama-3.1-8b-instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-08-19 09:09:34,778 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-08-19 09:09:34,779 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ]
}

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.59s/it]
08/19/2024 09:10:17 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/19/2024 09:10:17 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
08/19/2024 09:10:17 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
08/19/2024 09:10:17 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/19/2024 09:10:17 - INFO - llamafactory.model.model_utils.misc - Found linear modules: gate_proj,up_proj,down_proj,o_proj,v_proj,k_proj,q_proj
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.62s/it]
[INFO|modeling_utils.py:4473] 2024-08-19 09:10:17,398 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:4481] 2024-08-19 09:10:17,398 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /data/bbb/models/meta-llama-3.1-8b-instruct/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-08-19 09:10:17,401 >> loading configuration file /data/bbb/models/meta-llama-3.1-8b-instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2024-08-19 09:10:17,401 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

08/19/2024 09:10:17 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/19/2024 09:10:17 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
08/19/2024 09:10:17 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
08/19/2024 09:10:17 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/19/2024 09:10:17 - INFO - llamafactory.model.model_utils.misc - Found linear modules: down_proj,o_proj,k_proj,q_proj,up_proj,gate_proj,v_proj
08/19/2024 09:10:17 - INFO - llamafactory.model.loader - trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605
08/19/2024 09:10:17 - INFO - llamafactory.model.loader - trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605
[INFO|trainer.py:648] 2024-08-19 09:10:17,754 >> Using auto half precision backend
[INFO|trainer.py:2526] 2024-08-19 09:10:17,755 >> Loading model from saves/llm4d/llama3-8b/lora/sft/checkpoint-85000/.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.75s/it]
08/19/2024 09:10:18 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/19/2024 09:10:18 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
08/19/2024 09:10:18 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
08/19/2024 09:10:18 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/19/2024 09:10:18 - INFO - llamafactory.model.model_utils.misc - Found linear modules: gate_proj,up_proj,down_proj,o_proj,q_proj,v_proj,k_proj
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:43<00:00, 10.79s/it]
08/19/2024 09:10:18 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/19/2024 09:10:18 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
08/19/2024 09:10:18 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
08/19/2024 09:10:18 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/19/2024 09:10:18 - INFO - llamafactory.model.model_utils.misc - Found linear modules: up_proj,k_proj,gate_proj,q_proj,v_proj,down_proj,o_proj
08/19/2024 09:10:18 - INFO - llamafactory.model.loader - trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605
08/19/2024 09:10:18 - INFO - llamafactory.model.loader - trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605
/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py:3098: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py:3098: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py:3098: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py:3098: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
[INFO|trainer.py:2134] 2024-08-19 09:10:19,833 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-08-19 09:10:19,833 >>   Num examples = 573
[INFO|trainer.py:2136] 2024-08-19 09:10:19,833 >>   Num Epochs = 30,000
[INFO|trainer.py:2137] 2024-08-19 09:10:19,833 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2140] 2024-08-19 09:10:19,833 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:2141] 2024-08-19 09:10:19,833 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2142] 2024-08-19 09:10:19,833 >>   Total optimization steps = 540,000
[INFO|trainer.py:2143] 2024-08-19 09:10:19,838 >>   Number of trainable parameters = 20,971,520
[INFO|trainer.py:2165] 2024-08-19 09:10:19,853 >>   Continuing training from checkpoint, will skip to saved global_step
[INFO|trainer.py:2166] 2024-08-19 09:10:19,853 >>   Continuing training from epoch 4722
[INFO|trainer.py:2167] 2024-08-19 09:10:19,853 >>   Continuing training from global step 85000
[INFO|trainer.py:2169] 2024-08-19 09:10:19,853 >>   Will skip the first 4722 epochs then the first 32 batches in the first epoch.
  0%|                                                                                                                        | 0/540000 [00:00<?, ?it/s]/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py:2833: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint_rng_state = torch.load(rng_file)
/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py:2833: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint_rng_state = torch.load(rng_file)
/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py:2833: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint_rng_state = torch.load(rng_file)
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank3]:     launch()
[rank3]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank3]:     run_exp()
[rank3]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
[rank3]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 2260, in _inner_training_loop
[rank3]:     self._load_rng_state(resume_from_checkpoint)
[rank3]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 2839, in _load_rng_state
[rank3]:     torch.cuda.random.set_rng_state_all(checkpoint_rng_state["cuda"])
[rank3]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank3]:     set_rng_state(state, i)
[rank3]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank3]:     _lazy_call(cb)
[rank3]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank3]:     callable()
[rank3]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 72, in cb
[rank3]:     default_generator = torch.cuda.default_generators[idx]
[rank3]:                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
[rank3]: IndexError: tuple index out of range
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
[rank1]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 2260, in _inner_training_loop
[rank1]:     self._load_rng_state(resume_from_checkpoint)
[rank1]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 2839, in _load_rng_state
[rank1]:     torch.cuda.random.set_rng_state_all(checkpoint_rng_state["cuda"])
[rank1]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank1]:     set_rng_state(state, i)
[rank1]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank1]:     _lazy_call(cb)
[rank1]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank1]:     callable()
[rank1]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 72, in cb
[rank1]:     default_generator = torch.cuda.default_generators[idx]
[rank1]:                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
[rank1]: IndexError: tuple index out of range
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 2260, in _inner_training_loop
[rank0]:     self._load_rng_state(resume_from_checkpoint)
[rank0]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 2839, in _load_rng_state
[rank0]:     torch.cuda.random.set_rng_state_all(checkpoint_rng_state["cuda"])
[rank0]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank0]:     set_rng_state(state, i)
[rank0]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank0]:     _lazy_call(cb)
[rank0]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank0]:     callable()
[rank0]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 72, in cb
[rank0]:     default_generator = torch.cuda.default_generators[idx]
[rank0]:                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
[rank0]: IndexError: tuple index out of range
/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py:2833: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint_rng_state = torch.load(rng_file)
[rank2]: Traceback (most recent call last):
[rank2]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank2]:     launch()
[rank2]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank2]:     run_exp()
[rank2]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]:   File "/data/bbb/projects/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 94, in run_sft
[rank2]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank2]:     return inner_training_loop(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 2260, in _inner_training_loop
[rank2]:     self._load_rng_state(resume_from_checkpoint)
[rank2]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/transformers/trainer.py", line 2839, in _load_rng_state
[rank2]:     torch.cuda.random.set_rng_state_all(checkpoint_rng_state["cuda"])
[rank2]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 85, in set_rng_state_all
[rank2]:     set_rng_state(state, i)
[rank2]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 75, in set_rng_state
[rank2]:     _lazy_call(cb)
[rank2]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
[rank2]:     callable()
[rank2]:   File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/cuda/random.py", line 72, in cb
[rank2]:     default_generator = torch.cuda.default_generators[idx]
[rank2]:                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
[rank2]: IndexError: tuple index out of range
  0%|                                                                                                                        | 0/540000 [00:00<?, ?it/s]
W0819 09:10:20.975000 140188021503808 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1541263 closing signal SIGTERM
W0819 09:10:20.975000 140188021503808 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1541264 closing signal SIGTERM
W0819 09:10:20.975000 140188021503808 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1541265 closing signal SIGTERM
E0819 09:10:21.316000 140188021503808 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1541262) of binary: /home/aaa/anaconda3/envs/env_llama_factory_py311/bin/python
Traceback (most recent call last):
  File "/home/aaa/anaconda3/envs/env_llama_factory_py311/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aaa/anaconda3/envs/env_llama_factory_py311/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/data/bbb/projects/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-19_09:10:20
  host      : ubuntu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1541262)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

How can I add CUDA_VISIBLE_DEVICES for selecting GPUs ?

Expected behavior

No response

Others

No response

hiyouga commented 1 month ago

Please change output_dir or use overwrite_output_dir