Closed lopuhin closed 7 months ago
Nope, sorry for the delay. But this was a bug in the attention function. Fixed with the latest commit.
Great, thanks for the fix, it works 👍 I confirm it allows to fit more cache into memory, at a slight runtime performance penalty.
Thanks for a great library and for providing
multiple_caches.py
example, it's really helpful to build inflight batching HTTP server on top.I tried replacing
ExLlamaV2Cache
withExLlamaV2Cache_8bit
in the example to save memory inmultiple_caches.py
without doing other changes, and this resulted in an error:I'm using
exllamav2
installed from pre-built wheel at https://github.com/turboderp/exllamav2/releases/tag/v0.0.10 using python 3.10 and CUDA 12.1 on Linux.I tried a few naive fixes but they crashed or provided incorrect generations.
Do you think it could be that more changes are required in multiple_caches.py in order to use the 8 bit cache?