--model_name_or_path in the training workflow : should it be vicuna

YerongLi commented 1 year ago

When did you clone our code?

I cloned the code base after 5/1/23

Describe the issue

Issue: scripts/deepspeed/finetune_lora.sh I think in training workflow --model_name_or_path should not be vicuna-7b right, it reports some error as You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.

--model_name_or_path ./checkpoints/vicuna-7b-v1.1 \

I tried both --model_name_or_path LLaVA-7B-v0 \ or --model_name_or_path ./checkpoints/vicuna-7b-v1.1 \, neither works though Command:

#!/bin/bash

WEIGHT_VERSION=v1-1
PROMPT_VERSION=v1
MODEL_VERSION="7b"

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/deepspeed/zero3.json \
    --lora_enable True \
    --model_name_or_path ./checkpoints/vicuna-7b-v1.1 \
    --version $PROMPT_VERSION \
    --data_path ./playground/data/llava_instruct_150k.json \
    --image_folder /scratch/yerong/Multimodal-GPT/data/coco/train2017 \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./checkpoints/mm_projector/llava-7b-pretrain.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir ./checkpoints/deepspeed_llava-$MODEL_VERSION-$WEIGHT_VERSION-finetune_lora \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 4 \
    --report_to wandb.

Log:

[2023-07-19 04:52:29,958] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.             
[2023-07-19 04:52:29,971] [INFO] [runner.py:541:main] cmd = /scratch/yerong/.conda/envs/llava/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2Nh
bGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed ./scripts/deepspeed/zero3.js
on --lora_enable True --model_name_or_path ./checkpoints/vicuna-7b-v1.1 --version v1 --data_path ./playground/data/llava_instruct_150k.json --image_folder /sc
ratch/yerong/Multimodal-GPT/data/coco/train2017 --vision_tower openai/clip-vit-large-patch14 --pretrain_mm_mlp_adapter ./checkpoints/mm_projector/llava-7b-pre
train.bin --mm_vision_select_layer -2 --mm_use_im_start_end True --bf16 True --output_dir ./checkpoints/deepspeed_llava-7b-v1-1-finetune_lora --num_train_epoc
hs 3 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_step
s 50000 --save_total_limit 1 --learning_rate 2e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_l
ength 2048 --gradient_checkpointing True --lazy_preprocess True --dataloader_num_workers 4 --report_to wandb                                                  
[2023-07-19 04:52:31,695] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1]}                                                                  
[2023-07-19 04:52:31,695] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=2, node_rank=0                                                                
[2023-07-19 04:52:31,696] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})                                  
[2023-07-19 04:52:31,696] [INFO] [launch.py:247:main] dist_world_size=2                                                                                       
[2023-07-19 04:52:31,696] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1                                                                        
[2023-07-19 04:52:35,338] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl                                      
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.        
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.        
[2023-07-19 04:52:42,501] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters                                     
Loading checkpoint shards:   0%|                                                                                                        | 0/3 [00:00<?, ?it/s]
[2023-07-19 04:52:54,728] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49686                                                                     
[2023-07-19 04:52:54,731] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49687                                                                     
[2023-07-19 04:52:55,035] [ERROR] [launch.py:434:sigkill_handler] ['/scratch/yerong/.conda/envs/llava/bin/python', '-u', 'llava/train/train_mem.py', '--local_
rank=1', '--deepspeed', './scripts/deepspeed/zero3.json', '--lora_enable', 'True', '--model_name_or_path', './checkpoints/vicuna-7b-v1.1', '--version', 'v1', 
'--data_path', './playground/data/llava_instruct_150k.json', '--image_folder', '/scratch/yerong/Multimodal-GPT/data/coco/train2017', '--vision_tower', 'openai
/clip-vit-large-patch14', '--pretrain_mm_mlp_adapter', './checkpoints/mm_projector/llava-7b-pretrain.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_star
t_end', 'True', '--bf16', 'True', '--output_dir', './checkpoints/deepspeed_llava-7b-v1-1-finetune_lora', '--num_train_epochs', '3', '--per_device_train_batch_
size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_step
s', '50000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--loggi
ng_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--lazy_preprocess', 'True', '--dataloader_num_workers', '
4', '--report_to', 'wandb'] exits with return code = -7

Screenshots: You may attach screenshots if it better explains the issue.

YerongLi commented 1 year ago

With this --model_name_or_path ./checkpoints/vicuna-7b-v1.1 \ I got error

You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.        
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.

With this --model_name_or_path LLaVA-7B-v0 \ I got this error.

Traceback (most recent call last):
  File "/scratch/yerong/LLaVA/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/scratch/yerong/LLaVA/llava/train/train.py", line 656, in train
    model = LlavaLlamaForCausalLM.from_pretrained(
  File "/scratch/yerong/.conda/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2643, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/scratch/yerong/.conda/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2952, in _load_pretrained_model
    state_dict = load_state_dict(shard_file)
  File "/scratch/yerong/.conda/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 431, in load_state_dict
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'LLaVA-7B-v0/pytorch_model-00001-of-00002.bin' at 'LLaVA-7B-v0/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
[2023-07-19 04:54:54,848] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49958

YerongLi commented 1 year ago

use --model_name_or_path ./checkpoints/vicuna-7b-v1.1 You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
This is expected

haotian-liu / LLaVA

--model_name_or_path in the training workflow : should it be vicuna #281

When did you clone our code?

Describe the issue