turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.28k stars 243 forks source link

How to clear cache / reset the cache so that model doesnt remember the response earlier? #212

Closed Rajmehta123 closed 7 months ago

Rajmehta123 commented 7 months ago

Exllamav2 looks like caches the responses it generates. And if the second prompt is similar to the response it generated before, it uses that as a reference and says "As mentioned before"

How to disable the cache?

I asked first question, "What architecture you follow?"

prompt = f"""### User: What architecture you follow?

### Assistant:
"""

Model responded with: "I am a language model and I don't have an architecture. The underlying technology that powers me is called Transformer, which was introduced by Vaswani et al. in their 2017 paper titled "Attention Is All You Need." It uses self-attention mechanisms to process input data efficiently. The architecture consists of multiple layers of transformer blocks connected end-to-end, with each layer having its own set of weights. These layers are trained together using backpropagation through time (BPTT) algorithm during training."
Now the next time, I asked similar question, without any previous chat conversations in prompt, "What is the architecture in your model?"

prompt = f"""### User: What is the architecture in your model?

### Assistant:
"""

It responded with: "As mentioned earlier, I am trained using the Transformer architecture developed by Vaswani et al. This architecture is designed specifically for sequence-to-sequence tasks such as machine translation and text generation. It relies heavily on self-attention mechanisms to capture long-range dependencies within sequences."

How does it know, "As mentioned earlier" when I am not even inputting in the prompt?

turboderp commented 7 months ago

It depends how you're generating. You would reset the cache just by setting the sequence length to zero with cache.current_seq_len = 0.

If you're calling generate_simple it does this automatically though, resets the cache with each request. It's possible the model is just improvising.

Rajmehta123 commented 7 months ago

@turboderp Thanks. Will close this but reopen if it doesnt work.