Run this script to startup Llama2 7B on a t4 partition. Once up, get the URL as per usual.
You can test that the server is working with this script
You can see that by increasing the number of requested completions to 10, the job will hang indefinitely. See this script which is identical to the previous, but with n=10 (i.e., 10 completions requested).
Expected behavior
Ideally, some sort of error raised by the API call, or the server terminating.
I imagine there's an option to do this somewhere, I think this'd be the more useful default behaviour for research.
Example vLLM output:
WARNING 08-22 14:38:16 scheduler.py:1089] Sequence group cmpl-6fdf35b90e7f47bcbfb7d612b1a5dda4 is preempted by PreemptionMode.SWAP mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
INFO 08-22 14:38:26 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 29.3 tokens/s, Running: 0 reqs, Swapped: 1 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 8.0%.
Describe the bug
By increasing the number of requested completions, eventually the request will be "swapped" and the api call will never return.
To Reproduce
See https://github.com/ikb-a/vector-inference/tree/bug/swapped
Run this script to startup Llama2 7B on a t4 partition. Once up, get the URL as per usual.
You can test that the server is working with this script
You can see that by increasing the number of requested completions to 10, the job will hang indefinitely. See this script which is identical to the previous, but with
n=10
(i.e., 10 completions requested).Expected behavior
Ideally, some sort of error raised by the API call, or the server terminating. I imagine there's an option to do this somewhere, I think this'd be the more useful default behaviour for research.
Example vLLM output: