vLLM stops all processing when CPU KV cache is used, has to be shut down and restarted.

TheBloke commented 1 year ago

Hi

The issue: with --swap-space X specified, as soon as CPU KV cache is used, vLLM stops all processing. CPU and GPU usage go to 0%, and the request never returns. Any future requests are also never answered. There is no error.

I am testing the latest vLLM code (commit 6fc2a38) in a Docker container. I have experienced the issue since I first started using vLLM about 4 days ago, so it's not specific to the latest commits.

I am launching vLLM with the following args:

--model lmsys/vicuna-7b-v1.3 --host 0.0.0.0 --tokenizer hf-internal-testing/llama-tokenizer --swap-space 100

I am currently testing on a 1 x 4090 system, but I have experienced it on all GPU types I've tried, including A6000 and H100.

The following test code will quickly trigger the issue on a 1 x 4090 system:

import time
import requests

url = 'http://localhost:8000/v1/completions'
headers = {'Content-Type': 'application/json'}
data = {
    "model": "lmsys/vicuna-7b-v1.3",
    "prompt": " Write a story about a cat named George."*40,
    "max_tokens": 950,
    "temperature": 0.7,
    "n":125
}
s=time.time()
response = requests.post(url, headers=headers, json=data)
print(time.time()-s)

Here's a screenshot demonstrating the issue:

In the screenshot you can see that only 7.9% of CPU KV cache is used, but this is enough to cause all processing to stop. The server will now never answer this request, and never answer any new requests either. It is effectively dead.

If I leave out --swap-space X then the server aborts RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error., which is what I'm trying to avoid. It would be nice to be able to use CPU RAM as an overflow buffer, in case I occasionally exceed VRAM.

Thanks in advance.

syskn commented 1 year ago

I too can confirm that this issue persists with the default settings of 4GB swap space, in the first release version and the most recent versions.

Lawliet-Xie commented 1 year ago

I had the same problem, did you solve it？

TheBloke commented 1 year ago

No, I'm not sure it's something we can solve ourselves. Might need a code fix.

What I am doing now, as a workaround, is running without --swap-space with a monitoring script that restarts vLLM whenever it aborts with RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.

Not ideal at all but it works for now.

syskn commented 1 year ago

desperadoola commented 1 year ago

https://github.com/vllm-project/vllm/blob/66c54aa9c33555a6b41421d57d3ad6c1bf004ec9/vllm/engine/async_llm_engine.py#L67-L75

I comment this await asyncio.sleep(0) and it seems to temporarily solve the stuck.

SatoshiReport commented 1 year ago

Same issue. Cache fills up and then vLLM stops working.

tydia commented 1 year ago

this issue makes vllm impossible for production use

chi2liu commented 9 months ago

this issue makes vllm impossible for production use

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

hmellor commented 6 months ago

@TheBloke are you still experiencing this issue?

shyringo commented 5 months ago

this issue makes vllm impossible for production use

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

Wondering how can I set the swap space directly to 0?

hmellor commented 5 months ago

--swap-space 0 - docs

haoxiongliu commented 2 months ago

this issue makes vllm impossible for production use

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.

When using vllm 0.5.1, setting swap_space=0 will cause the process to terminate once vllm tries to preempt a sequence group, despite # CPU blocks being 0.

ERROR 07-11 00:49:14 async_llm_engine.py:53] self._preempt_by_swap(seq_group, blocks_to_swap_out) ERROR 07-11 00:49:14 async_llm_engine.py:53] File "/opt/miniconda3/envs/working/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1145, in _preempt_by_swap ERROR 07-11 00:49:14 async_llm_engine.py:53] self._swap_out(seq_group, blocks_to_swap_out) ERROR 07-11 00:49:14 async_llm_engine.py:53] File "/opt/miniconda3/envs/working/lib/python3.11/site-packages/vllm/core/scheduler.py", line 1165, in _swap_out ERROR 07-11 00:49:14 async_llm_engine.py:53] RuntimeError: Aborted due to the lack of CPU swap space. Please increase the swap space to avoid this error.

hmellor commented 2 months ago

Potentially this is a bug that's been fixed in BlockSpaceMangerV2?

You can enable it using --use-v2-block-manager

vllm-project / vllm

vLLM stops all processing when CPU KV cache is used, has to be shut down and restarted. #546