huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.85k stars 1.04k forks source link

TGI over-reserving memory #1338

Closed RonanKMcGovern closed 8 months ago

RonanKMcGovern commented 9 months ago

System Info

docker image 1.3.0

public runpod template: https://runpod.io/gsc?template=3uvdgyo0yy&ref=jmfkcdio

Information

Tasks

Reproduction

Run the runpod template (which uses a docker image) on an A6000 (48 GB).

Expected behavior

It would be great if this could be addressed because TGI is faster than vLLM for longer contexts (possibly because of flash decoding?). But this benefit can't be brought to bear if there's OOM when setting a long context.

OlivierDehaene commented 9 months ago

That's not a OOM error though. I will look into why AWQ is throwing an indexing error. BTW you do not need quantization if you have so much VRAM at your disposal.

RonanKMcGovern commented 9 months ago

Thanks!

So the kv cache is stored in bf16 then on the GPU? Not quantized.

Even still, using quantization should leave more space for more kv cache for more context length? Or am I misunderstanding?

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

QLutz commented 8 months ago

Is this still being investigated ?

Narsil commented 8 months ago

We unfortunately are not able to reproduce, and we don't have A600 to test it on really.

Launching with CUDA_LAUNCH_BLOCKING=1 should help diagnose a bit better (in all likelihood it's AWQ tha't s causing the issue).

It's probably linked to compute_cap < 7.5 tbh, which is going to be hard to fix. Using a different quantized (GPTQ, EETQ, BNB) or non quantized, should help there. Without reproducing it's hard to fix though.