CUDA out of memory when hitting multimodel API

Describe the bug

Can run build-in llava-13b multimodel on web UI. However, when hitting the API, got VRAM OOM issue. Even tried to reduce the max gpu memory from 10 to 6, still got the same OOM issue.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Starting command: python server.py --model wojtab_llava-13b-v0-4bit-128g --multimodal-pipeline llava-13b --api --extensions multimodal

Screenshot

No response

Logs

127.0.0.1 - - [29/Jun/2023 22:55:30] "POST /api/v1/generate HTTP/1.1" 200 -
Traceback (most recent call last):
  File "/home/yangyang/Workspace/llm/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/home/yangyang/Workspace/llm/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 423, in generate
    return self.model.generate(**kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
    outputs = self.model(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
    layer_outputs = decoder_layer(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 109, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/functional.py", line 1845, in softmax
    ret = input.softmax(dim, dtype=dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 522.00 MiB (GPU 0; 11.72 GiB total capacity; 8.67 GiB already allocated; 71.19 MiB free; 10.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Info

Ubuntu 22.04
GPU: 12Gb 4070ti

oobabooga / text-generation-webui