turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.69k stars 283 forks source link

[BUG] How can we increase or reduce the cache size #665

Closed royallavanya140 closed 1 week ago

royallavanya140 commented 3 weeks ago

OS

Windows

GPU Library

CUDA 12.x

Python version

3.10

Pytorch version

2.4.0

Model

bartowski/Phi-3-medium-128k-instruct-exl2

Describe the bug

when i am trying to load this model i am OOM error even in 2bit quantization so can you suggest me how to increase/decrease the cache size(paged attn). i have 40GB of vram space still it is giving OOM

Reproduction steps

load this model in any example file in the repo

Expected behavior

It ll raise the OOM error

Logs

No response

Additional context

No response

Acknowledgements

turboderp commented 3 weeks ago

Set max_seq_len when creating the cache, e.g.:

cache = ExLlamaV2Cache(model, max_seq_len = 2048, lazy = True)