Can run build-in llava-13b multimodel on web UI. However, when hitting the API, got VRAM OOM issue.
Even tried to reduce the max gpu memory from 10 to 6, still got the same OOM issue.
127.0.0.1 - - [29/Jun/2023 22:55:30] "POST /api/v1/generate HTTP/1.1" 200 -
Traceback (most recent call last):
File "/home/yangyang/Workspace/llm/text-generation-webui/modules/callbacks.py", line 55, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/home/yangyang/Workspace/llm/text-generation-webui/modules/text_generation.py", line 289, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 423, in generate
return self.model.generate(**kwargs)
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 1572, in generate
return self.sample(
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/generation/utils.py", line 2619, in sample
outputs = self(
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
outputs = self.model(
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
layer_outputs = decoder_layer(
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 109, in forward
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
File "/home/yangyang/.pyenv/versions/txt-webui/lib/python3.10/site-packages/torch/nn/functional.py", line 1845, in softmax
ret = input.softmax(dim, dtype=dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 522.00 MiB (GPU 0; 11.72 GiB total capacity; 8.67 GiB already allocated; 71.19 MiB free; 10.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Describe the bug
Can run build-in llava-13b multimodel on web UI. However, when hitting the API, got VRAM OOM issue. Even tried to reduce the max gpu memory from 10 to 6, still got the same OOM issue.
Is there an existing issue for this?
Reproduction
Starting command:
python server.py --model wojtab_llava-13b-v0-4bit-128g --multimodal-pipeline llava-13b --api --extensions multimodal
Screenshot
No response
Logs
System Info