Closed Rajmehta123 closed 7 months ago
It depends how you're generating. You would reset the cache just by setting the sequence length to zero with cache.current_seq_len = 0
.
If you're calling generate_simple
it does this automatically though, resets the cache with each request. It's possible the model is just improvising.
@turboderp Thanks. Will close this but reopen if it doesnt work.
Exllamav2 looks like caches the responses it generates. And if the second prompt is similar to the response it generated before, it uses that as a reference and says "As mentioned before"
How to disable the cache?
How does it know, "As mentioned earlier" when I am not even inputting in the prompt?