when i am trying to load this model i am OOM error even in 2bit quantization so can you suggest me how to increase/decrease the cache size(paged attn). i have 40GB of vram space still it is giving OOM
Reproduction steps
load this model in any example file in the repo
Expected behavior
It ll raise the OOM error
Logs
No response
Additional context
No response
Acknowledgements
[X] I have looked for similar issues before submitting this one.
[X] I understand that the developers have lives and my issue will be answered when possible.
[X] I understand the developers of this program are human, and I will ask my questions politely.
OS
Windows
GPU Library
CUDA 12.x
Python version
3.10
Pytorch version
2.4.0
Model
bartowski/Phi-3-medium-128k-instruct-exl2
Describe the bug
when i am trying to load this model i am OOM error even in 2bit quantization so can you suggest me how to increase/decrease the cache size(paged attn). i have 40GB of vram space still it is giving OOM
Reproduction steps
load this model in any example file in the repo
Expected behavior
It ll raise the OOM error
Logs
No response
Additional context
No response
Acknowledgements