Closed anonymo-user closed 1 year ago
I will polish the README and release the weights this week. I will also check the error you mentioned.
After reviewing the code, I discovered that switching to Vicuna may resolve the error.
Thank you for your response. After I change the vicuna-7b checkpoint per your suggestion, this error happens:
First it says that input_embeddings
isn't assigned but referenced in this line: https://github.com/jshilong/GPT4RoI/blob/main/gpt4roi/models/spi_llava.py#L289
Then I edited L260-L263 from:
num_new_tokens = num_new_tokens + num_spi_tokens
if num_new_tokens > 0:
input_embeddings = self.get_input_embeddings().weight.data
output_embeddings = self.get_output_embeddings().weight.data
to
num_new_tokens = num_new_tokens + num_spi_tokens
input_embeddings = self.get_input_embeddings().weight.data
output_embeddings = self.get_output_embeddings().weight.data
if num_new_tokens > 0:
...
However, it unfortunately returns the same error above and it doesn't work with the vicuna-7b checkpoint in your link. Below is the full log, please help me to have a look. Appreciate!
torchrun --nnodes=1 --nproc_per_node=1 --master_port=25002 gpt4roi/train/train_mem.py --model_name_or_path /vicuna-7b/ --vision_tower openai/clip-vit-large-patch14 --pretrain_mm_mlp_adapter /LLaVA/LLaVA-Pretrained-Projectors/LLaVA-7b-pretrain-projector-v0-CC3M-595K-original_caption.bin --mm_vision_select_layer -2 --mm_use_im_start_end True --fp16 True --output_dir ./exp/stage1 --num_train_epochs 2 --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 1000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.003 --lr_scheduler_type cosine --logging_steps 1 --tf32 False --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess True --report_to none --seed 0
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 0%| | 0/14 [00:00<?, ?it/s]
Loading checkpoint shards: 7%|▋ | 1/14 [00:01<00:15, 1.16s/it]
Loading checkpoint shards: 14%|█▍ | 2/14 [00:02<00:15, 1.29s/it]
Loading checkpoint shards: 21%|██▏ | 3/14 [00:04<00:15, 1.37s/it]
Loading checkpoint shards: 29%|██▊ | 4/14 [00:05<00:14, 1.43s/it]
Loading checkpoint shards: 36%|███▌ | 5/14 [00:07<00:12, 1.44s/it]
Loading checkpoint shards: 43%|████▎ | 6/14 [00:08<00:12, 1.54s/it]
Loading checkpoint shards: 50%|█████ | 7/14 [00:10<00:10, 1.47s/it]
Loading checkpoint shards: 57%|█████▋ | 8/14 [00:11<00:08, 1.49s/it]
Loading checkpoint shards: 64%|██████▍ | 9/14 [00:13<00:07, 1.47s/it]
Loading checkpoint shards: 71%|███████▏ | 10/14 [00:14<00:05, 1.48s/it]
Loading checkpoint shards: 79%|███████▊ | 11/14 [00:16<00:04, 1.59s/it]
Loading checkpoint shards: 86%|████████▌ | 12/14 [00:17<00:03, 1.54s/it]
Loading checkpoint shards: 93%|█████████▎| 13/14 [00:19<00:01, 1.54s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:20<00:00, 1.29s/it]
Loading checkpoint shards: 100%|██████████| 14/14 [00:20<00:00, 1.43s/it]
Some weights of SPILlavaMPTForCausalLM were not initialized from the model checkpoint at /vicuna-7b/ and are newly initialized: ['model.spi_module.mlvl_fuse.input_conv.1.weight', 'model.spi_module.roi_align.pconvs.3.bias', 'model.spi_module.mlvl_fuse.fuse_convs.3.gn.bias', 'model.spi_module.roi_align.flatten_linear.weight', 'model.spi_module.roi_align.pconvs.0.weight', 'model.spi_module.roi_align.pos_embedd.0.bias', 'model.spi_module.roi_align.pconvs.1.bias', 'model.spi_module.mlvl_fuse.input_conv.0.bias', 'model.spi_module.mlvl_fuse.input_conv.3.bias', 'model.spi_module.roi_align.pos_embedd.3.bias', 'model.spi_module.mlvl_fuse.fuse_convs.4.gn.bias', 'model.spi_module.mlvl_fuse.input_conv.1.bias', 'model.spi_module.mlvl_fuse.input_conv.2.weight', 'model.spi_module.roi_align.pos_embedd.5.weight', 'model.spi_module.roi_align.pos_embedd.2.bias', 'model.spi_module.roi_align.pconvs.2.bias', 'model.spi_module.mlvl_fuse.fuse_convs.2.gn.weight', 'model.spi_module.mlvl_fuse.fuse_convs.3.gn.weight', 'model.spi_module.mlvl_fuse.fuse_convs.0.gn.bias', 'model.spi_module.roi_align.pconvs.2.weight', 'model.spi_module.roi_align.pos_embedd.5.bias', 'model.spi_module.mlvl_fuse.input_conv.3.weight', 'model.spi_module.mlvl_fuse.fuse_convs.0.conv.weight', 'model.spi_module.roi_align.pos_embedd.3.weight', 'model.spi_module.mlvl_fuse.fuse_convs.4.conv.weight', 'model.spi_module.mlvl_fuse.fuse_convs.4.gn.weight', 'model.spi_module.roi_align.updims.weight', 'model.spi_module.mlvl_fuse.fuse_convs.2.gn.bias', 'model.spi_module.mlvl_fuse.fuse_convs.1.gn.weight', 'model.spi_module.roi_align.pos_embedd.2.weight', 'model.spi_module.mlvl_fuse.fuse_convs.0.gn.weight', 'model.spi_module.roi_align.pconvs.1.weight', 'model.spi_module.mlvl_fuse.fuse_convs.3.conv.weight', 'model.spi_module.roi_align.pconvs.3.weight', 'model.spi_module.roi_align.pos_embedd.0.weight', 'model.spi_module.mlvl_fuse.fuse_convs.2.conv.weight', 'model.spi_module.roi_align.flatten_linear.bias', 'model.spi_module.roi_align.updims.bias', 'model.spi_module.mlvl_fuse.fuse_convs.1.conv.weight', 'model.spi_module.mlvl_fuse.fuse_convs.1.gn.bias', 'model.spi_module.mlvl_fuse.input_conv.0.weight', 'model.spi_module.mlvl_fuse.input_conv.2.bias', 'model.spi_module.roi_align.pconvs.0.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at openai/clip-vit-large-patch14 were not used when initializing CLIPVisionModel: ['text_model.encoder.layers.4.mlp.fc1.bias', 'text_model.encoder.layers.7.self_attn.k_proj.bias', 'text_model.encoder.layers.9.layer_norm1.bias', 'text_model.encoder.layers.2.self_attn.v_proj.bias', 'text_model.encoder.layers.7.self_attn.out_proj.bias', 'text_model.encoder.layers.2.layer_norm1.weight', 'text_model.encoder.layers.1.layer_norm2.bias', 'text_model.encoder.layers.10.self_attn.q_proj.weight', 'text_model.encoder.layers.3.layer_norm2.bias', 'text_model.encoder.layers.4.self_attn.q_proj.weight', 'text_model.encoder.layers.0.mlp.fc2.bias', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.9.mlp.fc1.bias', 'text_model.encoder.layers.5.mlp.fc1.weight', 'text_model.encoder.layers.9.mlp.fc2.bias', 'text_model.encoder.layers.6.self_attn.out_proj.weight', 'text_model.encoder.layers.10.mlp.fc2.weight', 'text_model.encoder.layers.10.layer_norm2.bias', 'text_model.encoder.layers.4.mlp.fc1.weight', 'text_model.encoder.layers.7.layer_norm1.bias', 'text_model.encoder.layers.8.self_attn.out_proj.bias', 'text_model.encoder.layers.2.mlp.fc1.weight', 'text_model.encoder.layers.3.mlp.fc2.weight', 'text_model.embeddings.token_embedding.weight', 'text_model.encoder.layers.10.layer_norm1.bias', 'text_model.encoder.layers.0.self_attn.v_proj.bias', 'text_model.encoder.layers.0.layer_norm1.weight', 'text_model.encoder.layers.6.self_attn.v_proj.bias', 'text_model.encoder.layers.0.layer_norm2.weight', 'text_model.encoder.layers.4.self_attn.v_proj.bias', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.0.mlp.fc1.bias', 'text_model.encoder.layers.10.layer_norm1.weight', 'text_model.encoder.layers.9.layer_norm1.weight', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_projection.weight', 'text_model.encoder.layers.1.mlp.fc1.bias', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.embeddings.position_ids', 'text_model.encoder.layers.2.self_attn.q_proj.bias', 'text_model.encoder.layers.6.layer_norm1.bias', 'text_model.encoder.layers.9.self_attn.k_proj.bias', 'text_model.encoder.layers.5.layer_norm2.bias', 'text_model.encoder.layers.7.self_attn.k_proj.weight', 'text_model.encoder.layers.7.mlp.fc2.bias', 'text_model.encoder.layers.9.self_attn.out_proj.weight', 'logit_scale', 'text_model.encoder.layers.8.mlp.fc1.weight', 'text_model.encoder.layers.8.layer_norm1.weight', 'text_model.encoder.layers.7.mlp.fc2.weight', 'text_model.encoder.layers.11.mlp.fc1.bias', 'text_model.encoder.layers.2.self_attn.k_proj.weight', 'text_model.encoder.layers.2.layer_norm1.bias', 'text_model.encoder.layers.4.self_attn.v_proj.weight', 'text_model.encoder.layers.5.mlp.fc1.bias', 'text_model.encoder.layers.9.layer_norm2.bias', 'text_model.encoder.layers.6.layer_norm2.weight', 'text_model.encoder.layers.5.self_attn.k_proj.bias', 'text_model.encoder.layers.1.self_attn.v_proj.weight', 'text_model.encoder.layers.10.layer_norm2.weight', 'text_model.encoder.layers.8.self_attn.k_proj.weight', 'text_model.encoder.layers.5.layer_norm1.weight', 'text_model.encoder.layers.8.self_attn.q_proj.weight', 'text_model.encoder.layers.3.self_attn.out_proj.weight', 'text_model.encoder.layers.0.self_attn.k_proj.weight', 'text_model.encoder.layers.1.self_attn.k_proj.bias', 'text_model.encoder.layers.8.self_attn.k_proj.bias', 'text_model.encoder.layers.5.self_attn.v_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.weight', 'text_model.encoder.layers.4.self_attn.out_proj.bias', 'text_model.encoder.layers.0.self_attn.q_proj.weight', 'text_model.encoder.layers.6.mlp.fc1.bias', 'text_model.encoder.layers.3.self_attn.out_proj.bias', 'text_model.encoder.layers.3.self_attn.q_proj.bias', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.1.self_attn.q_proj.bias', 'text_model.encoder.layers.6.self_attn.out_proj.bias', 'text_model.encoder.layers.7.self_attn.v_proj.weight', 'text_model.encoder.layers.3.layer_norm1.weight', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.0.self_attn.q_proj.bias', 'text_model.encoder.layers.5.layer_norm2.weight', 'text_model.encoder.layers.2.self_attn.out_proj.weight', 'text_model.encoder.layers.5.self_attn.q_proj.bias', 'text_model.encoder.layers.8.self_attn.out_proj.weight', 'text_model.encoder.layers.3.mlp.fc2.bias', 'text_model.encoder.layers.6.self_attn.v_proj.weight', 'text_model.encoder.layers.5.self_attn.k_proj.weight', 'text_model.encoder.layers.3.layer_norm1.bias', 'text_model.encoder.layers.10.mlp.fc1.weight', 'text_model.encoder.layers.5.mlp.fc2.bias', 'text_model.encoder.layers.6.self_attn.q_proj.weight', 'text_model.encoder.layers.6.self_attn.k_proj.weight', 'text_model.encoder.layers.7.mlp.fc1.bias', 'text_model.encoder.layers.10.mlp.fc2.bias', 'text_model.encoder.layers.0.mlp.fc2.weight', 'text_model.encoder.layers.8.self_attn.v_proj.bias', 'text_model.encoder.layers.1.self_attn.k_proj.weight', 'text_model.encoder.layers.4.layer_norm1.weight', 'text_model.encoder.layers.6.layer_norm1.weight', 'text_model.encoder.layers.10.self_attn.out_proj.bias', 'text_model.encoder.layers.6.mlp.fc2.weight', 'text_model.encoder.layers.6.self_attn.k_proj.bias', 'text_model.encoder.layers.0.layer_norm1.bias', 'text_model.encoder.layers.4.self_attn.q_proj.bias', 'text_model.encoder.layers.2.layer_norm2.bias', 'text_model.encoder.layers.0.self_attn.out_proj.bias', 'text_model.encoder.layers.3.mlp.fc1.weight', 'text_model.encoder.layers.10.self_attn.k_proj.weight', 'text_model.encoder.layers.1.mlp.fc2.weight', 'text_model.encoder.layers.4.layer_norm2.bias', 'text_model.encoder.layers.9.mlp.fc2.weight', 'text_model.final_layer_norm.bias', 'text_model.encoder.layers.7.self_attn.q_proj.weight', 'text_model.encoder.layers.9.self_attn.q_proj.weight', 'text_model.encoder.layers.8.layer_norm1.bias', 'text_model.encoder.layers.4.self_attn.k_proj.bias', 'text_model.encoder.layers.2.self_attn.k_proj.bias', 'visual_projection.weight', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.encoder.layers.8.self_attn.v_proj.weight', 'text_model.encoder.layers.0.mlp.fc1.weight', 'text_model.encoder.layers.1.layer_norm1.weight', 'text_model.encoder.layers.4.mlp.fc2.bias', 'text_model.encoder.layers.3.layer_norm2.weight', 'text_model.encoder.layers.8.self_attn.q_proj.bias', 'text_model.encoder.layers.8.mlp.fc2.weight', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.9.self_attn.v_proj.bias', 'text_model.encoder.layers.9.self_attn.v_proj.weight', 'text_model.encoder.layers.1.mlp.fc1.weight', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.8.mlp.fc2.bias', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.2.self_attn.v_proj.weight', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.7.layer_norm2.weight', 'text_model.encoder.layers.2.mlp.fc2.bias', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.encoder.layers.3.self_attn.k_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.weight', 'text_model.encoder.layers.6.layer_norm2.bias', 'text_model.encoder.layers.4.self_attn.out_proj.weight', 'text_model.encoder.layers.5.self_attn.out_proj.bias', 'text_model.encoder.layers.6.self_attn.q_proj.bias', 'text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.2.mlp.fc2.weight', 'text_model.encoder.layers.10.self_attn.v_proj.weight', 'text_model.encoder.layers.10.mlp.fc1.bias', 'text_model.encoder.layers.1.self_attn.q_proj.weight', 'text_model.encoder.layers.8.layer_norm2.weight', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.6.mlp.fc2.bias', 'text_model.encoder.layers.5.layer_norm1.bias', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.5.self_attn.out_proj.weight', 'text_model.encoder.layers.0.self_attn.k_proj.bias', 'text_model.encoder.layers.2.layer_norm2.weight', 'text_model.encoder.layers.10.self_attn.out_proj.weight', 'text_model.encoder.layers.7.layer_norm1.weight', 'text_model.embeddings.position_embedding.weight', 'text_model.encoder.layers.3.self_attn.v_proj.bias', 'text_model.final_layer_norm.weight', 'text_model.encoder.layers.2.self_attn.out_proj.bias', 'text_model.encoder.layers.6.mlp.fc1.weight', 'text_model.encoder.layers.10.self_attn.v_proj.bias', 'text_model.encoder.layers.7.self_attn.v_proj.bias', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.3.self_attn.q_proj.weight', 'text_model.encoder.layers.8.layer_norm2.bias', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.9.mlp.fc1.weight', 'text_model.encoder.layers.9.self_attn.q_proj.bias', 'text_model.encoder.layers.1.layer_norm1.bias', 'text_model.encoder.layers.1.self_attn.v_proj.bias', 'text_model.encoder.layers.7.mlp.fc1.weight', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.encoder.layers.4.layer_norm2.weight', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.4.self_attn.k_proj.weight', 'text_model.encoder.layers.7.self_attn.q_proj.bias', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_model.encoder.layers.10.self_attn.k_proj.bias', 'text_model.encoder.layers.10.self_attn.q_proj.bias', 'text_model.encoder.layers.3.mlp.fc1.bias', 'text_model.encoder.layers.5.mlp.fc2.weight', 'text_model.encoder.layers.9.layer_norm2.weight', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.7.self_attn.out_proj.weight', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_model.encoder.layers.4.layer_norm1.bias', 'text_model.encoder.layers.8.mlp.fc1.bias', 'text_model.encoder.layers.9.self_attn.out_proj.bias', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.1.mlp.fc2.bias']
- This IS expected if you are initializing CLIPVisionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPVisionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
File "/gpt4roi/gpt4roi/train/train_mem.py", line 16, in <module>
train()
File "/gpt4roi/gpt4roi/train/train.py", line 641, in train
model.initialize_vision_tokenizer(mm_use_im_start_end=model_args.mm_use_im_start_end,
File "/gpt4roi/gpt4roi/models/spi_llava.py", line 295, in initialize_vision_tokenizer
raise ValueError(
ValueError: Unexpected embed_tokens_weight shape. Pretrained: torch.Size([2, 4096]). Current: torch.Size([32000, 4096]). Numer of new tokens: 0.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 185731) of binary: /bin/python
Traceback (most recent call last):
File "/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
gpt4roi/train/train_mem.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-18_00:51:04
host : c1703
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 185731)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Hi, I believe that the incorrect weight belongs to the LLaVA pretrained projectors: https://github.com/jshilong/GPT4RoI/blob/main/gpt4roi/models/spi_llava.py#L284-L287
Do you have any idea what the correct projector is?
I have published the checkpoint and polished the README today, so you can now pull them and use this repository more smoothly.
Hi, I believe that the incorrect weight belongs to the LLaVA pretrained projectors: https://github.com/jshilong/GPT4RoI/blob/main/gpt4roi/models/spi_llava.py#L284-L287
Do you have any idea what the correct projector is?
I have polished the README, especially on how to prepare data and checkpoints to launch the training. You can now pull the repository. Please feel free to reach out to me again if you encounter any further issues.
I can reproduce the training stage now. Thank you for your support!
Hi,
I appreciate the effort you put into your framework, but I encountered some confusion while attempting to retrain it. The guidance suggests using the original LLaMA weights for training, but I noticed in your script that the model name input is set as
vicuna-7b
:/mnt/petrelfs/share_data/zhangshilong/vicuna-7b/
.I attempted to use both the original LLaMA and LLaVA huggingface format (haven't applied your delta since it haven't been released yet), but it always resulted in this error:
I would appreciate your guidance in resolving the error and making the code runnable. Could you please provide me with the necessary steps or adjustments to address the issue?