turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

How can you clear the cache of the exllamav2? #163

Closed Rajmehta123 closed 4 months ago

Rajmehta123 commented 9 months ago

I am loading 2 7 B models. The cache problem only occurs when I use autosplit to load these models. The first model prediction works fine, but when I ask subsequent questions, it keeps returning the previous response to every subsequent request/question.

Issue: When the model caches the response, the subsequent request doesn't clear the cache and the model returns the previous responses to all the next questions.

Question: How do you clear the cache after every request to model?

Code:

Model 1:
  model_directory =  "./TheBloke_Genz-70b-GPTQ"

  config = ExLlamaV2Config()
  config.model_dir = model_directory
  config.prepare()
  model = ExLlamaV2(config)
  model.load([1,5,20,17])
  tokenizer = ExLlamaV2Tokenizer(config)
  cache = ExLlamaV2Cache(model)
  generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
  inferencer = ExLlamaV2BaseGenerator(model, cache, tokenizer)

Model 2:
  model_directory = "./TheBloke_dolphin-2.2.1-mistral-7B-GPTQ" 

  config2 = ExLlamaV2Config()
  config2.model_dir = model_directory
  config2.prepare()
  model2 = ExLlamaV2(config2)
  cache2 = ExLlamaV2Cache(model2, lazy = True)
  model2.load_autosplit(cache2)
  tokenizer2 = ExLlamaV2Tokenizer(config2)
  generator2 = ExLlamaV2BaseGenerator(model2, cache2, tokenizer2)
  streamer2 = ExLlamaV2StreamingGenerator(model2, cache2, tokenizer2)
Rajmehta123 commented 9 months ago

@turboderp Any suggestions?

turboderp commented 9 months ago

Could you share the code that calls the generator functions? The problem is likely with those rather than with how the models are loaded.