[Usage] 7B Inferece CUDA Out of Memory for RTX 4090 24GB VRAM

Issue: I am running an inference of LLaVA-1.6 7B on the demo code using an RTX 4090 with 24 GB of memory. I consistently obtained a CUDA out of memory error when running the inference, although I guess 24 GB is fair enough for the 7B model since many have tried to run it using the same VRAM. I've restarted the command line, but the issue is still there. I've also checked the CUDA setups, and they all look good. When I use the 4-bit version of LLaVA, the program works without crashing the CUDA memory. What might be the potential issue for me not being able to run this 7B model using 24 GB VRAM?

Command: This is the exact code from the Demo.

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model

model_path = "liuhaotian/llava-v1.5-7b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path),
)

model_path = "liuhaotian/llava-v1.5-7b"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

args = type('Args', (), {
    "model_path": model_path,
    "model_base": None,
    "model_name": get_model_name_from_path(model_path),
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512
})()

eval_model(args)

Log:

You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                                                  | 0/2 [00:00<?, ?it/s]/home/user/miniconda3/envs/videolm/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.16it/s]
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.74it/s]
Traceback (most recent call last):
  File "/home/user/projects/videolm/LLaVA/test.py", line 31, in <module>
    eval_model(args)
  File "/home/user/projects/videolm/LLaVA/llava/eval/run_llava.py", line 115, in eval_model
    output_ids = model.generate(
  File "/home/user/miniconda3/envs/videolm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/projects/videolm/LLaVA/llava/model/language_model/llava_llama.py", line 125, in generate
    ) = self.prepare_inputs_labels_for_multimodal(
  File "/home/user/projects/videolm/LLaVA/llava/model/llava_arch.py", line 202, in prepare_inputs_labels_for_multimodal
    image_features = self.encode_images(images)
  File "/home/user/projects/videolm/LLaVA/llava/model/llava_arch.py", line 141, in encode_images
    image_features = self.get_model().get_vision_tower()(images)
  File "/home/user/miniconda3/envs/videolm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/videolm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/videolm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/user/miniconda3/envs/videolm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/projects/videolm/LLaVA/llava/model/multimodal_encoder/clip_encoder.py", line 54, in forward
    image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
RuntimeError: CUDA error: out of memory

haotian-liu / LLaVA

[Usage] 7B Inferece CUDA Out of Memory for RTX 4090 24GB VRAM #1570