oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.53k stars 5.31k forks source link

CUDA out of memory when hitting multimodel API #2929

Closed yangyang-nus-lv closed 1 year ago

yangyang-nus-lv commented 1 year ago

Describe the bug

Can run build-in llava-13b multimodel on web UI. However, when hitting the API, got VRAM OOM issue. Even tried to reduce the max gpu memory from 10 to 6, still got the same OOM issue.

Is there an existing issue for this?

Reproduction

Starting command: python server.py --model wojtab_llava-13b-v0-4bit-128g --multimodal-pipeline llava-13b --api --extensions multimodal

Screenshot

No response

Logs

127.0.0.1 - - [29/Jun/2023 22:55:30] "POST /api/v1/generate HTTP/1.1" 200 -
Traceback (most recent call last):
  File "/home/yangyang/Workspace/llm/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/home/yangyang/Workspace/llm/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 423, in generate
    return self.model.generate(**kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
    outputs = self.model(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
    layer_outputs = decoder_layer(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 109, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
  File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/functional.py", line 1845, in softmax
    ret = input.softmax(dim, dtype=dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 522.00 MiB (GPU 0; 11.72 GiB total capacity; 8.67 GiB already allocated; 71.19 MiB free; 10.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Info

Ubuntu 22.04
GPU: 12Gb 4070ti
github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.