mbzuai-oryx / GeoChat

[CVPR 2024 🔥] GeoChat, the first grounded Large Vision Language Model for Remote Sensing
https://mbzuai-oryx.github.io/GeoChat
410 stars 30 forks source link

【RuntimeError: Size Mismatch】 #11

Closed Luo-Z13 closed 7 months ago

Luo-Z13 commented 7 months ago

I use the finetune_lora.sh to train, the context:

 deepspeed --master_port=$((RANDOM + 10000)) --include localhost:0,1 geochat/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --lora_enable True \
    --model_name_or_path /checkpoints/llava-v1.5-7b \
    --version $PROMPT_VERSION \
    --data_path /geochat/GeoChat_Instruct.json \
    --image_folder /Dataset/geochat/final_images_llava  \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --pretrain_mm_mlp_adapter /checkpoints/llava-v1.5-7b/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --bf16 True \
    --output_dir ./out_checkpoints/geochat \
    --num_train_epochs 1 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_steps 7000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 16 

and get the following error: [ File "/project/GeoChat/geochat/model/geochat_arch.py", line 96, in encode_images image_features = self.encode_images(images) File "/project/GeoChat/geochat/model/geochat_arch.py", line 96, in encode_images image_features = self.get_model().get_vision_tower()(images) File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl image_features = self.get_model().get_vision_tower()(images) File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) ... File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 866, in forward hidden_states = self.embeddings(pixel_values) File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl hidden_states = self.embeddings(pixel_values) File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 200, in forward return forward_call(args, kwargs) File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 200, in forward embeddings = embeddings + self.position_embedding(self.position_ids) RuntimeError: The size of tensor a (577) must match the size of tensor b (1297) at non-singleton dimension 1 embeddings = embeddings + self.position_embedding(self.position_ids) RuntimeError: The size of tensor a (577) must match the size of tensor b (1297) at non-singleton dimension 1]

KjAeRsTuIsK commented 7 months ago

Hi @Luo-Z13, thank you for your interest. You need to change the image size from 336 to 504. image = processor.preprocess(image,do_resize=True,crop_size ={'height': 504, 'width': 504},size = {'shortest_edge': 504}, return_tensors='pt')['pixel_values'][0] can you please change this line in train.py file, line 690,691. I have made the changes in the codebase as well. Let me know if it works now.

Luo-Z13 commented 7 months ago

Hi @Luo-Z13, thank you for your interest. You need to change the image size from 336 to 504. image = processor.preprocess(image,do_resize=True,crop_size ={'height': 504, 'width': 504},size = {'shortest_edge': 504}, return_tensors='pt')['pixel_values'][0] can you please change this line in train.py file, line 690,691. I have made the changes in the codebase as well. Let me know if it works now.

Thank you for the response, the previous issue has now been resolved. However, I am encountering OOM when using 4*A100(40 GB), details are as follows:

  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/peft/tuners/lora.py", line 822, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
      File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/peft/tuners/lora.py", line 822, in forward
return forward_call(*args, **kwargs)
  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/peft/tuners/lora.py", line 822, in forward
    return forward_call(*args, **kwargs)
  File "/miniconda-3/envs/geochat/lib/python3.10/site-packages/peft/tuners/lora.py", line 822, in forward
            self.lora_B[self.active_adapter](self.lora_B[self.active_adapter](self.lora_B[self.active_adapter](

torch.cudatorch.cudatorch.cuda...OutOfMemoryErrorOutOfMemoryErrorOutOfMemoryError: : : CUDA out of memory. Tried to allocate 1.04 GiB (GPU 3; 39.39 GiB total capacity; 29.67 GiB already allocated; 1.02 GiB free; 36.57 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 1.07 GiB (GPU 1; 39.39 GiB total capacity; 30.12 GiB already allocated; 397.12 MiB free; 37.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFCUDA out of memory. Tried to allocate 1.04 GiB (GPU 2; 39.39 GiB total capacity; 29.76 GiB already allocated; 911.12 MiB free; 36.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

    self.lora_B[self.active_adapter](
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.06 GiB (GPU 0; 39.39 GiB total capacity; 29.99 GiB already allocated; 719.12 MiB free; 36.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

  0%|          | 0/2413 [00:44<?, ?it/s]
[2024-03-04 22:06:58,942] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 176970
[2024-03-04 22:06:59,647] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 176971
[2024-03-04 22:06:59,665] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 176972
[2024-03-04 22:06:59,681] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 176973
Luo-Z13 commented 7 months ago

Scripts merge_lora_weights.py seems to have an issue at the beginning (from llava... ?). After I changed it from

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path

to

from geochat.model.builder import load_pretrained_model
from geochat.mm_utils import get_model_name_from_path

an error occurred: Traceback (most recent call last): File "GeoChat/scripts/merge_lora_weights.py", line 24, in merge_lora(args) File "GeoChat/scripts/merge_lora_weights.py", line 10, in merge_lora tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, device_map='cpu') File "GeoChat/geochat/model/builder.py", line 110, in load_pretrained_model model = AutoModelForCausalLM.from_pretrained(model_base, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto") File "miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 461, in from_pretrained config, kwargs = AutoConfig.from_pretrained( File "miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 998, in from_pretrained config_class = CONFIG_MAPPING[config_dict["model_type"]] File "miniconda-3/envs/geochat/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 710, in getitem raise KeyError(key) KeyError: 'llava'

KjAeRsTuIsK commented 7 months ago

@Luo-Z13, can you please check what is the name of the "model_type" in your base model and the saved checkpoint in the config.json file? Please change it to "geochat", if it is "llava". Let me know if that works.

Luo-Z13 commented 7 months ago

@Luo-Z13, can you please check what is the name of the "model_type" in your base model and the saved checkpoint in the config.json file? Please change it to "geochat", if it is "llava". Let me know if that works.

Thank you very much, it works now.

KjAeRsTuIsK commented 7 months ago

Closing this issue for now, please reopen if you find any difficulties.