ikb-a / vector-inference

Efficient LLM inference on Slurm clusters using vLLM.
0 stars 0 forks source link

Increasing the number of decodes causes call to hang as request is swapped #1

Open ikb-a opened 4 weeks ago

ikb-a commented 4 weeks ago

Describe the bug

By increasing the number of requested completions, eventually the request will be "swapped" and the api call will never return.

To Reproduce

See https://github.com/ikb-a/vector-inference/tree/bug/swapped

Run this script to startup Llama2 7B on a t4 partition. Once up, get the URL as per usual.

You can test that the server is working with this script

You can see that by increasing the number of requested completions to 10, the job will hang indefinitely. See this script which is identical to the previous, but with n=10 (i.e., 10 completions requested).

Expected behavior

Ideally, some sort of error raised by the API call, or the server terminating. I imagine there's an option to do this somewhere, I think this'd be the more useful default behaviour for research.

Example vLLM output:

WARNING 08-22 14:38:16 scheduler.py:1089] Sequence group cmpl-6fdf35b90e7f47bcbfb7d612b1a5dda4 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
INFO 08-22 14:38:26 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.3 tokens/s, Running: 0 reqs, Swapped: 1 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 8.0%.
ikb-a commented 3 weeks ago

Marshall is familiar with the behaviour, and is looking into whether there's a way to configure the vLLM server to exit instead.