meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.
15.29k stars 2.21k forks source link

[Solved] [Llama-11B-Vision] [Lora Finetune] [PEFT] `IndexError: list index out of range` when saving checkpoints #780

Closed yifan-gao-dev closed 3 days ago

yifan-gao-dev commented 1 week ago

System Info

pytorch: 2.2.0 cuda: 11.8 gpu: V100 num of gpus: 4

Information

šŸ› Describe the bug

Hi! Thanks for your great work! Here I got a IndexError: list index out of range when running the command you provided in 'recipes/quickstart/finetuning/finetune_vision_model.md':

 torchrun --nnodes 1 --nproc_per_node 4  recipes/quickstart/finetuning/finetuning.py --enable_fsdp --lr 1e-5  --num_epochs 3 --batch_size_training 2 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder fine-tuned  --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "recipes/quickstart/finetuning/datasets/ocrvqa_dataset.py"  --run_validation True --batching_strategy padding  --use_peft --peft_method lora

The model is Llama-3.2-11B-Vision-Instruct.

I successfully finished one epoch's training and evaluation. However, when it came to save the model, an IndexError: list index out of range occurred.

image

Then the program shut down. There was a PATH/to/save/PEFT/model dir left in the root dir.

Error logs

The whole error logs is below:

Starting epoch 0/3 train_config.max_train_step: 0 /home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. warnings.warn( Training Epoch: 1: 0%| | 0/225 [00:00<?, ?it/s]/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False. use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False. use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False. use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False. Training Epoch: 1/3, step 224/225 completed (loss: 0.12238156795501709): 100%|ā–ˆ| 225/225 [34:02<00: Training Epoch: 1/3, step 224/225 completed (loss: 0.11427164822816849): 100%|ā–ˆ| 225/225 [34:04<00: Training Epoch: 1/3, step 224/225 completed (loss: 0.020826280117034912): 100%|ā–ˆ| 225/225 [34:04<00 Training Epoch: 1/3, step 224/225 completed (loss: 0.32271334528923035): 100%|ā–ˆ| 225/225 [34:05<00: Max CUDA memory allocated was 16 GB Max CUDA memory reserved was 19 GB Peak active CUDA memory was 16 GB CUDA Malloc retries : 0 CPU Total Peak Memory consumed during the train (max): 4 GB evaluating Epoch: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 50/50 [03:00<00:00, 3.61s/it] evaluating Epoch: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 50/50 [03:00<00:00, 3.61s/it] evaluating Epoch: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 50/50 [03:00<00:00, 3.61s/it] evaluating Epoch: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 50/50 [03:00<00:00, 3.61s/it] eval_ppl=tensor(1.2279, device='cuda:0') eval_epoch_loss=tensor(0.2053, device='cuda:0') we are about to save the PEFT modules Repo card metadata block was not found. Setting CardData to empty. Traceback (most recent call last): File "/home/gaoyifan/llama-recipes/recipes/quickstart/finetuning/finetuning.py", line 8, in fire.Fire(main) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) File "/home/gaoyifan/llama-recipes/src/llama_recipes/finetuning.py", line 391, in main results = train( File "/home/gaoyifan/llama-recipes/src/llama_recipes/utils/train_utils.py", line 241, in train save_peft_checkpoint(model, train_config.output_dir) File "/home/gaoyifan/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py", line 276, in save_peft_checkpoint state_dict = get_model_state_dict(model, options=options) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py", line 600, in get_model_state_dict model_state_dict = _get_model_state_dict(model, info) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py", line 326, in _get_model_state_dict fqns = _get_fqns(model, key) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py", line 160, in _get_fqns if obj_names[i + 1] == FLAT_PARAM: IndexError: list index out of range [2024-11-10 07:14:07,091] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4919 closing signal SIGTERM [2024-11-10 07:14:07,092] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4922 closing signal SIGTERM [2024-11-10 07:14:07,092] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4925 closing signal SIGTERM [2024-11-10 07:14:07,609] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4918) of binary: /home/gaoyifan/miniconda3/envs/llama-11b/bin/python Traceback (most recent call last): File "/home/gaoyifan/miniconda3/envs/llama-11b/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')()) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(args, **kwargs) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

recipes/quickstart/finetuning/finetuning.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-11-10_07:14:07 host : ubuntu rank : 0 (local_rank: 0) exitcode : 1 (pid: 4918) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ### Expected behavior The reloadable chekcpoint should be saved correctly. I would appreciate it if you could give me a response in time. Thanks!
yifan-gao-dev commented 1 week ago

Starting epoch 0/3 train_config.max_train_step: 0 /home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/cuda/memory.py:330: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. warnings.warn( Training Epoch: 1: 0%| | 0/225 [00:00<?, ?it/s]/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( /home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/utils/checkpoint.py:90: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False. use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False. use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False. use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False. Training Epoch: 1/3, step 224/225 completed (loss: 0.12238156795501709): 100%|ā–ˆ| 225/225 [34:02<00: Training Epoch: 1/3, step 224/225 completed (loss: 0.11427164822816849): 100%|ā–ˆ| 225/225 [34:04<00: Training Epoch: 1/3, step 224/225 completed (loss: 0.020826280117034912): 100%|ā–ˆ| 225/225 [34:04<00 Training Epoch: 1/3, step 224/225 completed (loss: 0.32271334528923035): 100%|ā–ˆ| 225/225 [34:05<00: Max CUDA memory allocated was 16 GB Max CUDA memory reserved was 19 GB Peak active CUDA memory was 16 GB

CUDA Malloc retries : 0 CPU Total Peak Memory consumed during the train (max): 4 GB evaluating Epoch: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 50/50 [03:00<00:00, 3.61s/it] evaluating Epoch: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 50/50 [03:00<00:00, 3.61s/it] evaluating Epoch: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 50/50 [03:00<00:00, 3.61s/it] evaluating Epoch: 100%|ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 50/50 [03:00<00:00, 3.61s/it] eval_ppl=tensor(1.2279, device='cuda:0') eval_epoch_loss=tensor(0.2053, device='cuda:0') we are about to save the PEFT modules Repo card metadata block was not found. Setting CardData to empty. Traceback (most recent call last): File "/home/gaoyifan/llama-recipes/recipes/quickstart/finetuning/finetuning.py", line 8, in fire.Fire(main) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/gaoyifan/llama-recipes/src/llama_recipes/finetuning.py", line 391, in main results = train( File "/home/gaoyifan/llama-recipes/src/llama_recipes/utils/train_utils.py", line 241, in train save_peft_checkpoint(model, train_config.output_dir) File "/home/gaoyifan/llama-recipes/src/llama_recipes/model_checkpointing/checkpoint_handler.py", line 276, in save_peft_checkpoint state_dict = get_model_state_dict(model, options=options) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py", line 600, in get_model_state_dict model_state_dict = _get_model_state_dict(model, info) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py", line 326, in _get_model_state_dict fqns = _get_fqns(model, key) File "/home/gaoyifan/miniconda3/envs/llama-11b/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py", line 160, in _get_fqns if obj_names[i + 1] == FLAT_PARAM: IndexError: list index out of range [2024-11-10 07:14:07,091] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4919 closing signal SIGTERM [2024-11-10 07:14:07,092] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4922 closing signal SIGTERM [2024-11-10 07:14:07,092] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4925 closing signal SIGTERM [2024-11-10 07:14:07,609] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4918) of binary: /home/gaoyifan/miniconda3/envs/llama-11b/bin/python

HamidShojanazeri commented 3 days ago

cc : @wukaixingxp

yifan-gao-dev commented 3 days ago

@HamidShojanazeri Sorry guys, it's not your promblem. It is Pytorch's fault. I use the Pytorch 2.2.0, in this version of torch, there is no "index out of bound check" in the older version, which I used: image

in the newer version of torch, this bug is fixed: image

but not updated in torch==2.2.0.