vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.84k stars 4.11k forks source link

No response from OpenAI Chat API with vLLM #1879

Closed SafeyahShemali closed 6 months ago

SafeyahShemali commented 10 months ago

Hello, I have been trying to work with OpenAI Chat API with vLLM. I've launched the server as follow: python -m vllm.entrypoints.openai.api_server --model codellama/CodeLlama-13b-Instruct-hf --tensor-parallel-size=2 --dtype='float16' --host hostname --tokenizer='hf-internal-testing/llama-tokenizer' --max-model-len=1600

this works with my limited GPU resources and the result in the server side as follow: WARNING 12-01 06:25:41 config.py:346] Casting torch.bfloat16 to torch.float16. 2023-12-01 06:25:44,410 INFO worker.py:1673 -- Started a local Ray instance. INFO 12-01 06:25:45 llm_engine.py:72] Initializing an LLM engine with config: model='codellama/CodeLlama-13b-Instruct-hf', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1600, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0) INFO 12-01 06:28:04 llm_engine.py:207] # GPU blocks: 37, # CPU blocks: 655 INFO: Started server process [4527] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://instance-4:8000 (Press CTRL+C to quit)

On the client side, I tried two ways. query the server directly as follow: curl http://instance-4:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "codellama/CodeLlama-13b-Instruct-hf", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }'

However, I don't get any response nor error as shown in image below:

Screenshot 2023-12-01 at 1 40 20 AM

I tried also through python script as follow: """Example Python client for vllm.entrypoints.api_server"""

import argparse import json from typing import Iterable, List

import requests

def clear_line(n: int = 1) -> None: LINE_UP = '\033[1A' LINECLEAR = '\x1b[2K' for in range(n): print(LINE_UP, end=LINE_CLEAR, flush=True)

def post_http_request(prompt: str, api_url: str, n: int = 1, stream: bool = False) -> requests.Response:

headers = {"User-Agent": "Test Client"}
pload = {
   "model":"codellama/CodeLlama-13b-Instruct-hf",
   "messages":[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
      ],
    "n": n,
    "use_beam_search": True,
    "temperature": 0.0,
    "max_tokens": 16,
    "stream": stream,
}
response = requests.post(api_url, headers=headers, json=pload, stream=True)
return response

def get_streaming_response(response: requests.Response) -> Iterable[List[str]]: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"): if chunk: data = json.loads(chunk.decode("utf-8")) print('lll', data, '\n\n') output = data["text"] yield output

def get_response(response: requests.Response) -> List[str]: data = json.loads(response.content) print('jjj', data, '\n\n') output = data["choices"] return output

if name == "main": parser = argparse.ArgumentParser() parser.add_argument("--host", type=str, default="instance-4") parser.add_argument("--port", type=int, default=8000) parser.add_argument("--n", type=int, default=4) parser.add_argument("--prompt", type=str, default="why you are silly?") parser.add_argument("--stream", action="store_true") args = parser.parse_args() prompt = args.prompt api_url = f"http://{args.host}:{args.port}/v1/chat/completions" n = args.n stream = args.stream

print(f"Prompt: {prompt!r}\n", flush=True)
response = post_http_request(prompt, api_url, n, stream)

if True:
    num_printed_lines = 0
    for h in get_streaming_response(response):
        clear_line(num_printed_lines)
        num_printed_lines = 0
        for i, line in enumerate(h):
            num_printed_lines += 1
            print(f"Beam candidate {i}: {line['text']}", flush=True)
else:
    output = get_response(response)
    for i, line in enumerate(output):
        print(f"Beam candidate {i}: {line['text']}", flush=True)

Same problem here: no response or error. I would like to know how to make this works as the examples mentioned do not work as Openai API have been updated

simon-mo commented 10 months ago

Which hardware are you using? It looks like after processing the prompt, there's very little free space left for computing the generation tokens. See (# GPU blocks: 37). Maybe consider a 7B model to see whether the same issue occurs?

SafeyahShemali commented 10 months ago

Which hardware are you using? It looks like after processing the prompt, there's very little free space left for computing the generation tokens. See (# GPU blocks: 37). Maybe consider a 7B model to see whether the same issue occurs?

Hello Simon,

I am using a GCP VM with (2 GPUs of Nividia T4 with 16 vCPU, 8 cores, 60 GB memory for each). I need to use the codellama/CodeLlama-13b-Instruct-hf model for my research. I already faced 'CUDA out of memory' issue before but it works with this configuration.

So you think getting higher computation power (more GPU's or memory) would help?

simon-mo commented 10 months ago

Yeah it does look like two T4 gives you 32G GPU memory. The 13B model takes about 26G in parameters, which leaves every little for KV Cache. Maybe use just one more T4 or consider A10 or L4 which have higher memory available.

viktor-ferenczi commented 9 months ago

Also happens on vLLM 0.2.3, 0.2.4, 0.2.5 and main while running any model tensor-parallel on 2x 4090 (2x4090). Once the VRAM KV cache is full vLLM hangs, it just stops running any processing on the GPU. It does not even try to swap anything out to the CPU KV cache.

Last log lines before the freeze:

INFO 12-16 00:57:56 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 316.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 84.9%, CPU KV cache usage: 0.0%
INFO 12-16 00:58:01 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 311.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 91.5%, CPU KV cache usage: 0.0%
INFO 12-16 00:58:06 llm_engine.py:649] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 311.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 97.3%, CPU KV cache usage: 0.0%

On killing it with SIGTERM (or Ctrl-C):

^CINFO:     Shutting down
INFO:     Waiting for background tasks to complete. (CTRL+C to force quit)

Then it continues to hang.

Command:

python -O -u -m vllm.entrypoints.openai.api_server \
  --model=TheBloke/CodeLlama-34B-Instruct-AWQ \
  --chat-template=$HOME/bin/templates/llama-2-chat.jinja \
  --quantization=awq \
  --dtype=float16 \
  --served-model-name=model \
  --host=0.0.0.0 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --tensor-parallel-size=2 \
  --swap-space=8 \
  --gpu-memory-utilization=0.8 \
  --disable-log-requests

The chat template does not matter, that's just to get it right with the CodeLlama model.

chi2liu commented 9 months ago

At present, we have found a workaround and set the swap space directly to 0. This way, we will not call the CPU swap space and will not report any errors. However, the CPU blocks will also become 0, which may slow down the speed a bit, but at least it will not hang and die.