huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.74k stars 1.01k forks source link

Queue size increases indefinitely #2192

Closed QLutz closed 3 weeks ago

QLutz commented 2 months ago

System Info

OS version: Linux Model being used (curl 127.0.0.1:8080/info | jq): TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ Hardware used (GPUs, how many, on which cloud) (nvidia-smi): 1xL40S The current version being used: 2.0.4

Information

Tasks

Reproduction

Launch TGI using max_total_tokens=max_batch_prefill_tokens=16384; max_input_length=16383; quantize=awq. After making a few hundred requests, the pod returns empty packets and only a few seconds after the request a has been made. Monitoring reveals that tgi_queue_size increases steadily but does not ever go down.

Expected behavior

No stutters.

Hugoch commented 2 months ago

Hey @QLutz , I suspect it may be related to #2099. Can you try to run TGI with --cuda-graphs 0 see if you still see the hang?

HoKim98 commented 1 month ago

I had the same problem, and was able to solve it by trying --cuda-graphs 0 method. This obviously caused major performance problems, but it was at least a better option than being broken.

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.