Closed RonanKMcGovern closed 8 months ago
That's not a OOM error though. I will look into why AWQ is throwing an indexing error. BTW you do not need quantization if you have so much VRAM at your disposal.
Thanks!
So the kv cache is stored in bf16 then on the GPU? Not quantized.
Even still, using quantization should leave more space for more kv cache for more context length? Or am I misunderstanding?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Is this still being investigated ?
We unfortunately are not able to reproduce, and we don't have A600 to test it on really.
Launching with CUDA_LAUNCH_BLOCKING=1
should help diagnose a bit better (in all likelihood it's AWQ tha't s causing the issue).
It's probably linked to compute_cap < 7.5 tbh, which is going to be hard to fix. Using a different quantized (GPTQ, EETQ, BNB) or non quantized, should help there. Without reproducing it's hard to fix though.
System Info
docker image 1.3.0
public runpod template: https://runpod.io/gsc?template=3uvdgyo0yy&ref=jmfkcdio
Information
Tasks
Reproduction
Run the runpod template (which uses a docker image) on an A6000 (48 GB).
Expected behavior
The same model, on an A6000, can run with 32,000 tokens of input on vLLM. So, the GPU is capable.
Possibly TGI is over-reserving memory (or, maybe paged attention is implemented differently)?
It would be great if this could be addressed because TGI is faster than vLLM for longer contexts (possibly because of flash decoding?). But this benefit can't be brought to bear if there's OOM when setting a long context.