predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
https://loraexchange.ai
Apache License 2.0
2.13k stars 139 forks source link

Fix quant cache OOM #494

Closed flozi00 closed 4 months ago

flozi00 commented 4 months ago

What does this PR do?

Somestime OOM occures during warumup when using quantized models Trying to patch it using larger dtype for calculating free blocks, so more free vRam is available

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.