haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.59k stars 2.16k forks source link

[Question] Finetune error. #610

Open ltttpku opened 11 months ago

ltttpku commented 11 months ago

Question

Hi, thanks for your great work!

When I finetune the model on my own custome dataset, I encountered the following error:

[2023-10-19 00:51:08,556] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-19 00:51:10,671] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-10-19 00:51:10,672] [INFO] [runner.py:555:main] cmd = /home/leiting/.conda/envs/llava/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed ./scripts/zero3.json --model_name_or_path /network_space/storage43/lttt/huggingface/llava-v1.5-7b --version v1 --data_path ./playground/data/IKCEST/ft_data.json --image_folder ./playground/data --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length False --bf16 True --output_dir ./checkpoints/llava-v1.5-7b --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 500 --save_total_limit 3 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb --freeze_backbone
[2023-10-19 00:51:11,932] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-19 00:51:14,020] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [2]}
[2023-10-19 00:51:14,020] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-10-19 00:51:14,020] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-10-19 00:51:14,020] [INFO] [launch.py:163:main] dist_world_size=1
[2023-10-19 00:51:14,020] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=2
[2023-10-19 00:51:16,892] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-19 00:51:17,554] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-10-19 00:51:17,554] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-10-19 00:51:17,554] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-10-19 00:51:20,490] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 6.76B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.14s/it]
[2023-10-19 00:51:33,422] [WARNING] [partition_parameters.py:836:_post_init_method] param `class_embedding` in CLIPVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-10-19 00:51:33,569] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 7.06B parameters
Formatting inputs...Skip in lazy mode
Parameter Offload: Total persistent parameters: 599040 in 312 params
Traceback (most recent call last):
  File "/home/leiting/LLaVA/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/home/leiting/LLaVA/llava/train/train.py", line 926, in train
    trainer.train(resume_from_checkpoint=True)
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1676, in _inner_training_loop
    deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/transformers/deepspeed.py", line 383, in deepspeed_load_checkpoint
    load_path, _ = deepspeed_engine.load_checkpoint(
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2604, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2663, in _load_checkpoint
    self.load_module_state_dict(checkpoint=checkpoint,
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2480, in load_module_state_dict
    param.ds_tensor.data.copy_(saved_frozen_params[name].data)
RuntimeError: The size of tensor a (131072000) must match the size of tensor b (21845334) at non-singleton dimension 0
[2023-10-19 00:51:38,046] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 89082
[2023-10-19 00:51:38,046] [ERROR] [launch.py:321:sigkill_handler] ['/home/leiting/.conda/envs/llava/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=0', '--deepspeed', './scripts/zero3.json', '--model_name_or_path', '/network_space/storage43/lttt/huggingface/llava-v1.5-7b', '--version', 'v1', '--data_path', './playground/data/IKCEST/ft_data.json', '--image_folder', './playground/data', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'False', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.5-7b', '--num_train_epochs', '1', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '3', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb', '--freeze_backbone'] exits with return code = 1

Here's the script that I used for finetuning:

#!/bin/bash

deepspeed --include localhost:2 llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path /network_space/storage43/lttt/huggingface/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/IKCEST/ft_data.json \
    --image_folder ./playground/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb \
    --freeze_backbone

Any idea on why this could happen?

haotian-liu commented 11 months ago

It seems that you have some model checkpoints already saved in your experiment dir, as "--save_steps 500" is quite easy to reach, and you changed something in the configuration that is different from that previous run. It thus fails in automatically loading the checkpoint.

nj159 commented 11 months ago

问题

嗨,感谢您的出色工作!

当我在自己的自定义数据集上微调模型时,我遇到了以下错误:

[2023-10-19 00:51:08,556] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-19 00:51:10,671] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-10-19 00:51:10,672] [INFO] [runner.py:555:main] cmd = /home/leiting/.conda/envs/llava/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed ./scripts/zero3.json --model_name_or_path /network_space/storage43/lttt/huggingface/llava-v1.5-7b --version v1 --data_path ./playground/data/IKCEST/ft_data.json --image_folder ./playground/data --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length False --bf16 True --output_dir ./checkpoints/llava-v1.5-7b --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 500 --save_total_limit 3 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb --freeze_backbone
[2023-10-19 00:51:11,932] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-19 00:51:14,020] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [2]}
[2023-10-19 00:51:14,020] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-10-19 00:51:14,020] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-10-19 00:51:14,020] [INFO] [launch.py:163:main] dist_world_size=1
[2023-10-19 00:51:14,020] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=2
[2023-10-19 00:51:16,892] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-19 00:51:17,554] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-10-19 00:51:17,554] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-10-19 00:51:17,554] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-10-19 00:51:20,490] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 6.76B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.14s/it]
[2023-10-19 00:51:33,422] [WARNING] [partition_parameters.py:836:_post_init_method] param `class_embedding` in CLIPVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-10-19 00:51:33,569] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 7.06B parameters
Formatting inputs...Skip in lazy mode
Parameter Offload: Total persistent parameters: 599040 in 312 params
Traceback (most recent call last):
  File "/home/leiting/LLaVA/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/home/leiting/LLaVA/llava/train/train.py", line 926, in train
    trainer.train(resume_from_checkpoint=True)
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1676, in _inner_training_loop
    deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/transformers/deepspeed.py", line 383, in deepspeed_load_checkpoint
    load_path, _ = deepspeed_engine.load_checkpoint(
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2604, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2663, in _load_checkpoint
    self.load_module_state_dict(checkpoint=checkpoint,
  File "/home/leiting/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2480, in load_module_state_dict
    param.ds_tensor.data.copy_(saved_frozen_params[name].data)
RuntimeError: The size of tensor a (131072000) must match the size of tensor b (21845334) at non-singleton dimension 0
[2023-10-19 00:51:38,046] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 89082
[2023-10-19 00:51:38,046] [ERROR] [launch.py:321:sigkill_handler] ['/home/leiting/.conda/envs/llava/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=0', '--deepspeed', './scripts/zero3.json', '--model_name_or_path', '/network_space/storage43/lttt/huggingface/llava-v1.5-7b', '--version', 'v1', '--data_path', './playground/data/IKCEST/ft_data.json', '--image_folder', './playground/data', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'False', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.5-7b', '--num_train_epochs', '1', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '3', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb', '--freeze_backbone'] exits with return code = 1

这是我用于微调的脚本:

#!/bin/bash

deepspeed --include localhost:2 llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path /network_space/storage43/lttt/huggingface/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/IKCEST/ft_data.json \
    --image_folder ./playground/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb \
    --freeze_backbone

知道为什么会发生这种情况吗?

Have you solved your problem? I have a similar issue

haotian-liu commented 11 months ago

Hi, there was some issue in the earlier version on finetuning from LLaVA-1.5. Please check out the latest code base, thanks.

https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md