oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.41k stars 5.3k forks source link

4bit LLaMA-30B: Out of memory #297

Closed alexl83 closed 11 months ago

alexl83 commented 1 year ago

Dear All, I'm running 30B in 4bit on my 4090 24 GB + Ryzen 7700X and 64GB ram

after generating some tokens asking to produce code I get out of memory errors using --gpu-memory has no effects server line python server.py --auto-devices --gpu-memory 20 --load-in-4bit --cai-chat --listen --extensions gallery llama_prompts --model llama-30b-4bit

Using a reddit-found character and preset:

{
    "char_name": "LLaMA-Precise",
    "char_persona": "LLaMA-Precise is a helpful AI chatbot that always provides useful and detailed answers to User's requests and questions. LLaMA-Precise tries to be as informative and friendly as possible.",
    "char_greeting": "Hello! I am LLaMA-Precise, your informative assistant. How may I help you today?",
    "world_scenario": "",
    "example_dialogue": "{{user}}: Hi. Can you help me with something?\n{{char}}: Hello, this is LLaMA-Precise. How can I help?\n{{user}}: Have you heard of the latest nuclear fusion experiment from South Korea? I heard their experiment got hotter than the sun.\n{{char}}: Yes, I have heard about the experiment. Scientists in South Korea have managed to sustain a nuclear fusion reaction running at temperatures in excess of 100 million°C for 30 seconds for the first time and have finally been able to achieve a net energy gain when carrying out a nuclear fusion experiment. That's nearly seven times hotter than the core of the Sun, which has a temperature of 15 million degrees kelvins! That's exciting!\n{{user}}: Wow! That's super interesting to know. Change of topic, I plan to change to the iPhone 14 this year.\n{{char}}: I see. What makes you want to change to iPhone 14?\n{{user}}: My phone right now is too old, so I want to upgrade.\n{{char}}: That's always a good reason to upgrade. You should be able to save money by trading in your old phone for credit. I hope you enjoy your new phone when you upgrade."
}
temperature=0.7
repetition_penalty=1.1764705882352942
top_k=40
top_p=0.1

I get this error

Loading the extension "gallery"... Ok.
Loading the extension "llama_prompts"... Ok.
Loading llama-30b-4bit...
Loading model ...
Done.
Loaded the model in 6.55 seconds.
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 20.14 seconds (6.80 tokens/s, 137 tokens)
Output generated in 15.50 seconds (4.39 tokens/s, 68 tokens)
Output generated in 15.71 seconds (4.33 tokens/s, 68 tokens)
Exception in thread Thread-6 (gentask):
Traceback (most recent call last):
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/text_generation.py", line 191, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
    outputs = self.model(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 318, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 233, in forward
    key_states = torch.cat([past_key_value[0], key_states], dim=2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.64 GiB total capacity; 21.75 GiB already allocated; 25.50 MiB free; 22.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any help is appreciated Thank you!

dnhkng commented 1 year ago

Are you on windows or Linux?

alexl83 commented 1 year ago

Linux, Ubuntu 22.10 with a local miniconda environment

remghoost commented 1 year ago

This issue seems to be related to what you're experiencing. https://github.com/oobabooga/text-generation-webui/issues/256

It seems that --gpu-memory is bugged. I've also been having issues with --auto-devices. There might also be a memory leak somewhere.

I should be feasibly be able to run the 13b model with my 1060 6gb with --auto-devices enabled but I haven't had any luck with it.

alexl83 commented 1 year ago

I'm now working around it by lowering "Maximum prompt size in tokens" to 1024 - I'm using 512 right now

ImpossibleExchange commented 1 year ago

@alexl83

Have you watched the behavior live as it is processing? Either through nvidiaXserver or nvidia-smi (would have to spam this but still works), if you did this you could have more information of the behavior as it is happening. Just a suggestion, and could lead to more specific answers.

alexl83 commented 1 year ago

@ImpossibleExchange - model is consistently taking around 17.7 GB VRAM - regardless of any command-line option On top of that, upon first launch it's loaded into RAM (around 33GB) and then moved to VRAM killing the model and reloading seems to skip the RAM step

ImpossibleExchange commented 1 year ago

@alexl83

Okay just spun up the 4bit and was running some text to it. For reference, I am running on 2x ampere cards for 48 GB total VRAM.

What I found out by sitting and spamming nvidia-smi is that I was getting around 22-23GB total used VRAM while the text was being generated. It would drop back down after it was finished, but it was hitting the numbers around where you were getting your "out of memory" error.

So, I would assume that is perhaps "normal" behavior/ usage for the time being. I also had the token length set for 200 tokens, not higher. This would lead me to assume if you have a higher token threshold for generation, you could be going higher.

Don't really know how much "help" this is, but I can confirm the Vram usage seems to be normal. I wasn't getting the out of ram message due to having 2x cards.

Perhaps try a smaller model is likely the best suggestion I can give sadly. = \ As I didn't see anything different on my box. Xubuntu OS.

Peace and all the best.

alexl83 commented 1 year ago

Thanks @ImpossibleExchange I appreciate your support investigating this :) Let's see, things are moving fast!

ImpossibleExchange commented 1 year ago

@alexl83

Also just ran the 30B on a single card, and yeah got an out of memory error. So, I guess that is that.

Ph0rk0z commented 1 year ago

For some reason in 4 bit I am not getting OOM at all. Yes it takes 17gb but IME that is enough for 2k context.

ImpossibleExchange commented 1 year ago

Sorry about the slow response, was battling getting Clear linux to work to try it out.

Are you able to generate ? Or are you crashing?

I was getting similar usage at the beginning, but would get VRAM usage spiked during generation of outputs. This was my experience on both Manjaro and Xubuntu.

MillionthOdin16 commented 1 year ago

@alexl83 Did you try the --auto-devices start argument? If other flags aren't helping you, this one at least got rid of the OOM errors.

int19h commented 1 year ago

I'm seeing the same issue with llama-30b-4bit-128g, and it seems to be worse compared to older 4-bit .pt models, so perhaps there was some recent change (from the batch that added .safetensors support) that causes increased VRAM use?

int19h commented 1 year ago

Okay, so this is down to groupsize. If you don't want to constantly run out of VRAM with llama-30b running on 24Gb, make sure that you use a model quantized with groupsize=1 rather than groupsize=128. E.g. one of these: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617

github-actions[bot] commented 11 months ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.