[Question] Only Tensors of floating point and complex dtype can require gradients

wangzhao0217 commented 6 months ago

Question

Dear Haotian,

Congratulations on your outstanding work! I am currently engaged in a project sponsored by the European Space Agency, aiming to utilize satellite imagery and machine learning to determine road surface conditions. I plan to assess the capabilities of the LLaVA model in this scenario by fine-tuning it with our dataset.

For the fine-tuning process, I utilized Google Colab, and it was successful.

!deepspeed /content/LLaVA/llava/train/train_mem.py \
    --deepspeed /content/LLaVA/scripts/zero2.json \
    --lora_enable True \
    --lora_r 128 \
    --lora_alpha 256 \
    --mm_projector_lr 2e-5 \
    --bits 4 \
    --model_name_or_path /content/LLaVA/llava-v1.5-7b \
    --version llava_llama_2 \
    --data_path /content/drive/MyDrive/ML_llava_train/train/dataset.json \
    --image_folder /content/drive/MyDrive/ML_llava_train/images \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 False \
    --output_dir /content/LLaVA/llava/checkpoints/ESA_llava \
    --num_train_epochs 1 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 32 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

However, I encountered an issue when loading the model.

!python /content/LLaVA/llava/eval/run_llava.py --model-path /content/LLaVA/llava/checkpoints/ESA_llava  \
--model-base /content/LLaVA/llava-v1.5-7b \
--image-file /content/drive/MyDrive/ML_llava_test/test.jpg \
--query "What is the condition of this road"

Error:

[2024-03-03 12:22:49,116] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-03-03 12:22:50.450603: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-03 12:22:50.450653: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-03 12:22:50.452294: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-03 12:22:51.645656: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading LLaVA from base model...
Loading checkpoint shards:   0% 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100% 2/2 [00:04<00:00,  2.28s/it]
Traceback (most recent call last):
  File "/content/LLaVA/llava/eval/run_llava.py", line 145, in <module>
    eval_model(args)
  File "/content/LLaVA/llava/eval/run_llava.py", line 55, in eval_model
    tokenizer, model, image_processor, context_len = load_pretrained_model(
  File "/content/LLaVA/llava/model/builder.py", line 153, in load_pretrained_model
    model.resize_token_embeddings(len(tokenizer))
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 1811, in resize_token_embeddings
    model_embeds = self._resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 1847, in _resize_token_embeddings
    new_lm_head = self._get_resized_lm_head(old_lm_head, new_num_tokens)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2016, in _get_resized_lm_head
    new_lm_head = nn.Linear(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 96, in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parameter.py", line 39, in __new__
    return torch.Tensor._make_subclass(cls, data, requires_grad)
RuntimeError: Only Tensors of floating point and complex dtype can require gradients

Do you have any suggestions on how to resolve this issue?

Best wishes, Zhao

wangzhao0217 commented 6 months ago

The issue solved by adding: --tune_mm_mlp_adapter True

However, got new error :

Command:

!python /content/LLaVA/llava/eval/run_llava.py  --model-path /content/LLaVA/llava/checkpoints/ESA-llava-qlora \
--model-base /content/LLaVA/llava-v1.5-7b \
--image-file "/content/drive/MyDrive/train_rgb/Bad/R1C1 N1_103.tif" \
--query "What is the condition of this road"

Error:

]
!python /content/LLaVA/llava/eval/run_llava.py  --model-path /content/LLaVA/llava/checkpoints/ESA-llava-qlora \
--model-base /content/LLaVA/llava-v1.5-7b \
--image-file "/content/drive/MyDrive/train_rgb/Bad/R1C1 N1_103.tif" \
--query "What is the condition of this road"

[2024-03-08 16:07:29,340] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-03-08 16:07:30.682805: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-08 16:07:30.682860: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-08 16:07:30.684501: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-08 16:07:31.916176: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading LLaVA from base model...
Loading checkpoint shards:   0% 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100% 2/2 [00:04<00:00,  2.30s/it]
Loading additional LLaVA weights...
Loading LoRA weights...
Merging LoRA weights...
/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/bnb.py:272: UserWarning: Merge lora module to 4-bit linear may get different generations due to rounding errors.
  warnings.warn(
Model is loaded...
Traceback (most recent call last):
  File "/content/LLaVA/llava/eval/run_llava.py", line 145, in <module>
    eval_model(args)
  File "/content/LLaVA/llava/eval/run_llava.py", line 115, in eval_model
    output_ids = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/content/LLaVA/llava/model/language_model/llava_llama.py", line 137, in generate
    return super().generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1525, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2622, in sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/content/LLaVA/llava/model/language_model/llava_llama.py", line 91, in forward
    return super().forward(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1201, in forward
    logits = self.lm_head(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 414, in forward
    assert self.weight.shape[1] == 1
AssertionError

wangzhao0217 commented 5 months ago

solved using zero3.json

LordUky commented 3 months ago

May I ask did you just changing zero2.json to zero3.json in finetune_qlora.sh? I tried but the ' assert self.weight.shape[1] == 1' error still exists.

wangzhao0217 commented 3 months ago

May I ask did you just changing zero2.json to zero3.json in finetune_qlora.sh? I tried but the ' assert self.weight.shape[1] == 1' error still exists.

yes, that's work for me

HadeerArafa commented 1 month ago

can you help me , when i used zero3 i got this error : ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

LordUky commented 1 month ago

can you help me , when i used zero3 i got this error : ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

I think maybe it can be solved by just commenting the 'device_map' parameter in the code, and pass low_cpu_mem_usage=False or commenting it as well.

HadeerArafa commented 1 month ago

Thank you @LordUky ! I still encountered an assertion error. Do you have a solution for this? Also, could you please tell me the PyTorch version you used?

LordUky commented 1 month ago

Hi I completely followed the provided commands to set up the environment. Unfortunately, as mentioned in my previous comments, I still got this assertion error after switching to zero3.json. Actually I am not working on this project now, but good luck!

haotian-liu / LLaVA

[Question] Only Tensors of floating point and complex dtype can require gradients #1217

Question