turboderp / exui

Web UI for ExLlamaV2
MIT License
411 stars 38 forks source link

VRAM Usage #39

Open ec111 opened 5 months ago

ec111 commented 5 months ago

VRAM usage is oddly much higher than compared to ooba.

I have only tried YI-34B-200k models. I have 4090 and YI-34B-200K at 55k context uses only 23gb on ooba.

On the exui, the model bugged out at 30k context where vram usage would spike and then the output would be gibberish. I suspect the issue is due to chunk size. I am not sure what chunk size is used for, possibly summarization?

turboderp commented 5 months ago

That sounds very strange. Which model are you using? (bitrate, etc.)

The chunk size is just how much VRAM is reserved at the end of the context when generating, and the step size for rolling the cache when the max context length is exceeded.

ec111 commented 5 months ago

This what I'm using: https://huggingface.co/brucethemoose/Yi-34B-200K-RPMerge-exl2-40bpw

4bpw. Using 4q cache with no speculative. Upon loading the model in exui, i find python consuming 23gb vram, a bit more ooba.

In two separate chat sessions before crapping out, I notice that the prompt length of the last successful message is exactly 27338 tokens. Given the exact the number, I don't think believe it is a coincidence.

When I try to continue the conversation of these bugged sessions, it out puts either nothing or gibberish. Yet at the same time, vram usage raises to 23.6vram and finally 24vram. After each additional attempt, indicating something sort of memory leak.

It is my understanding that vram should not raise significantly (or at all) after the model has loaded. It is certainly what I see with ooba. Ooba can manage at least 55k tokens.

Is it related to chunk tokens? I don't see an equivalent option in ooba. Does EXUI have a built in summarization attempt? Or some special caching mechanism?

turboderp commented 5 months ago

I'm having a hard time reproducing this. With Q4 cache and a context length of 55k, that model sits at a consistent VRAM usage just under 22 GB for me. I've tested inference up to 100k tokens without issue.

Can you share a little more info about the system? Windows/Linux? CUDA version? Are you using flash-attn? Etc.

ec111 commented 5 months ago

UPDATE: Just installed flash-attention and memory usage seems to be resolved, don't have any spikes/leaks. If anyone needs it, they should use the pre-built wheels from ooba (https://github.com/oobabooga/flash-attention/releases/). I still am able to replicate the problem after pushing context a bit further and my old sessions remain broken. It seems to happen when the context is reached and the halving the prompt occurs when vram is near full, I then get "AssertionError: Total sequence length exceeds cache size in model.forward" (See Error 2 at the bottom for full stack trace). What is the expected behavior after the model has been completely loaded for whatever reason it needs more vram and it is not available? What else persists in a session other than what's in the json?

Windows 10. Cuda 12.1.

Haven't installed flash-attn, from the git page, my understanding it is possible to install for windows but difficult. This might be the issue.

The issue definitely related to VRAM nearing its peak, so if you have a lot more VRAM (which I assume you must), you probably won't be able to reproduce it. I was able to reproduce the issue in a new session. I managed to get passed 30k and reached about 33k when I finally edited part of the conversation. I am using it mainly for story writing assistance, so the block was fairly large. After editing I hit generate, I noticed my VRAM usage for python.exe spike to around 24.1gb. (II am not sure if editing the block had anything to do with it, I may have edited after noticing an issue.)

Thus, I can only assume that the problem has to do with creating the cache. I confirmed the problem persists with the session even after restarting EXUI. The model will load. Task Manager reports 23.1gb after loading at 45k. I go hit generate a few times in the session and it reports 24gb and nothing is outputted.

Afterwards, I just dumped a whole block of text of roughly 33k token into a new session. No response. Clicking generate again caused VRAM spike. So it seems to be a matter of simply maxing out VRAM and then forcing cache to be rebuilt.

I would note that ooba has a much more limited ability for modifying the chat history, but I do not encounter any issues going into 50k+ context except for what I supposed is slower generation of context. From what I understand, ooba automatically installs flash-attention for you... so that may be the reason.

Lastly, a couple of points that might help:

  1. If I make an extremely large prompt, such as entire chapter of a novel, even at the start of the session, I will not get a response. However, if I a issue a smaller prompt, it will respond as if there was no problem. My theory is that it runs out of VRAM while generating the cache and creates an incomplete one.
  2. I have noticed that even after deleting or modifying a part of the history, the next generation still seems to act is if I had not, or at least it retains some of the information that should not be in the context.

On occasions I have tried continuing corrupted sessions sometimes I would get errors like these: Error 1: ERROR:waitress:Exception while serving /api/generate Traceback (most recent call last): File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\waitress\channel.py", line 428, in service task.service() File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\waitress\task.py", line 168, in service self.execute() File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\waitress\task.py", line 458, in execute for chunk in app_iter: File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\werkzeug\wsgi.py", line 256, in next return self._next() ^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\werkzeug\wrappers\response.py", line 32, in _iter_encoded for item in iterable: File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\flask\helpers.py", line 113, in generator yield from gen File "C:\Users\ec\Desktop\exui\backend\sessions.py", line 441, in generate generator.begin_stream(context_ids, gen_settings, token_healing = False) File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\generator\streaming.py", line 157, in begin_stream self._gen_begin_reuse(input_ids, gen_settings) File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\generator\streaming.py", line 406, in _gen_begin_reuse self._gen_begin(in_tokens, gen_settings) File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\generator\streaming.py", line 392, in _gen_begin self.model.forward(self.sequence_ids[:, :-1], self.cache, preprocess_only = True, loras = self.active_loras, input_mask = self.input_mask, position_offsets = self.position_offsets) File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\model.py", line 585, in forward r, ls = self._forward(input_ids = input_ids[:, chunk_begin : chunk_end], ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\model.py", line 649, in _forward x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\attn.py", line 370, in forward k_states = batch_keys.narrow(0, 0, batch_size).narrow(1, past_len, q_len) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: start (44918) + length (93) exceeds dimension size (45000).

Error 2 (Not sure if they is because I changed the model size, doing a lot of different things to see if I can revive a corrupted session by reducing context to save vram, no success):

ERROR:waitress:Exception while serving /api/generate Traceback (most recent call last): File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\waitress\channel.py", line 428, in service task.service() File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\waitress\task.py", line 168, in service self.execute() File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\waitress\task.py", line 458, in execute for chunk in app_iter: File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\werkzeug\wsgi.py", line 256, in next return self._next() ^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\werkzeug\wrappers\response.py", line 32, in _iter_encoded for item in iterable: File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\flask\helpers.py", line 113, in generator yield from gen File "C:\Users\ec\Desktop\exui\backend\sessions.py", line 492, in generate chunk, eos, tokens = generator.stream() ^^^^^^^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\generator\streaming.py", line 193, in stream chunk, eos, chunk_tokenids, probs, , _, logits = self._stream() ^^^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\generator\streaming.py", line 249, in _stream next_token, next_ptokens, next_pprobs, next_prob, eos, next_logits = self._gen_single_token(self.settings) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\generator\streaming.py", line 452, in _gen_single_token logits = self.model.forward(self.sequence_ids[:, -1:], self.cache, loras = self.active_loras, input_mask = self.input_mask, position_offsets = self.position_offsets).float().cpu() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\ec\AppData\Local\Programs\Python\Python311\Lib\site-packages\exllamav2\model.py", line 553, in forward assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Total sequence length exceeds cache size in model.forward

turboderp commented 5 months ago

The AssertionError you're getting is not due to VRAM limitations but some sort of bug in the context management.

One issue is that there's no feedback for extremely long prompt processing, and up until recently the client would time out waiting for the server to finish starting a generation. Then the server would eventually finish and silently add the response to the session, but nothing would show in the client until you switch to a different session and back again.

The timeout is much longer now but there's still no visual feedback at the moment, so you can probably still probably end up with a confused context with very long prompts. And if a single block of text is longer than the model's whole max context length, there's no mechanism at the moment for cutting that up into smaller chunks.