turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

Low context bugs and errors. #382

Closed ThomasBaruzier closed 6 months ago

ThomasBaruzier commented 6 months ago

Hello again.

I was trying to run Qwen 72B on my rtx 3090 at 2.4bpw (LoneStriker quant). To do so, I had to lower the context length to 200 tokens.

The issue is that exllama either crashes or get stuck when the context window goes below 315 tokens (heavily varies depending on models, does not replicate that behavior all the time).

Here is an example of it getting stuck for 5+ min.

screen -mS inference python examples/chat.py -m  ~/storage/gpu-models/LoneStriker_Qwen1.5-72B-Chat-2.4bpw-h6-exl2/ -mode chatml --length 200 --cache_8bit -amnesia 
 -- Model: /home/tyra/storage/gpu-models/LoneStriker_Qwen1.5-72B-Chat-2.4bpw-h6-exl2/                                                                           
 -- Options: ['length: 200']                                                    
 -- Loading model...                                                            
 -- Loading tokenizer...                                                        
 -- Prompt format: chatml                                                       
 -- System prompt:                                                              

You are Chatbort, a large language model. Answer as concisely as possible.      

User: hi                                                                        

^CTraceback (most recent call last):                                            
  File "/home/tyra/files/ai/exllama/examples/chat.py", line 268, in <module>    
    active_context = get_tokenized_context(model.config.max_seq_len - min_space_in_context)                                                                     
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                     
  File "/home/tyra/files/ai/exllama/examples/chat.py", line 179, in get_tokenized_context                                                                       
    for turn in range(len(user_prompts)):                                       
                ^^^^^^^^^^^^^^^^^^^^^^^^                                        
KeyboardInterrupt                 

And here is a crash example:

screen -mS inference python examples/chat.py -m  ~/storage/gpu-models/silicon-maid-7b-8.0bpw-h8-exl2/ -mode chatml --length 314 --cache_8bit -amnesia
 -- Model: /home/tyra/storage/gpu-models/silicon-maid-7b-8.0bpw-h8-exl2/        
 -- Options: ['length: 314']                                                    
 -- Loading model...                                                            
 -- Loading tokenizer...                                                        
 -- Prompt format: chatml                                                       
 -- System prompt:                                                              

You are Chatbort, a large language model. Answer as concisely as possible.      

User: hi                                                                        

Traceback (most recent call last):                                              
  File "/home/tyra/files/ai/exllama/examples/chat.py", line 289, in <module>    
    res = generator.stream_ex()                                                 
          ^^^^^^^^^^^^^^^^^^^^^                                                 
  File "/home/tyra/files/ai/envs/exllama/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 184, in stream_ex                                 
    chunk, eos, chunk_token_ids, probs, ptokens, pprobs, logits = self._stream()
                                                                  ^^^^^^^^^^^^^^
  File "/home/tyra/files/ai/envs/exllama/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 267, in _stream                                   
    next_token, next_ptokens, next_pprobs, next_prob, eos, next_logits = self._gen_single_token(self.settings)                                                  
                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                  
  File "/home/tyra/files/ai/envs/exllama/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 522, in _gen_single_token                         
    position_offsets = self.position_offsets).float().cpu()                     
                                              ^^^^^                             
AttributeError: 'NoneType' object has no attribute 'float' 
turboderp commented 6 months ago

The example chatbot has a somewhat simplistic context management system. By default it reserves 250 tokens for the reply and doesn't have a robust mechanism for handling a total context length that isn't somewhat longer than that. You can limit it by running it with e.g. --response_chunk 50 which reserves less space of the overall context at a time, leaving room for more prompt. There will be a speed tradeoff and the model will be somewhat unpredictable when used like this.

Not sure about the situation when running Tabby with an extremely small context, but yeah it's probably related.

ThomasBaruzier commented 6 months ago

Alright, thanks!

I guess i'll have to quantize more 🫠 The context here was to evaluate Qwen72B as llama.cpp simply crashes my whole computer (i guess a b450 with a ryzen 3950x is bad) Anyway 200 tokens was just right for Lonestrickers' lowest quant of Qwen72B for me to somewhat test. It's more of an extreme case rather than anything useful.

Thanks for the reply, have a great day!

turboderp commented 6 months ago

If you want to squeeze a little extra VRAM out of it, probably enough for a couple hundred more tokens at least, you can also use the --cache_q4 and --low_mem options. FWIW. Also if you haven't already, make sure Flash-Attention is installed.