Closed fishingcatgo closed 4 months ago
You can use the official OpenAI Python client to access vLLM's OpenAI-compatible server. The latest stable release supports Models API, Chat Completions API and Completions API.
I have the same problem. It is strange, because i can run it a few days ago. Have you solved it?
ttp://localhost:8000/generate
is only valid for the example server, not the OpenAI-compatible server. In the original post,
python -m vllm.entrypoints.openai.api_server --model vicuna-7b-v1.5 --trust-remote-code
runs the OpenAI-compatiable server instead of the example server.
python -m vllm.entrypoints.api_server \ --model /workspace/opt-125m \ --port 9760 \ --gpu-memory-utilization 0.5
this is the code i run the server. I got 404 when i run both curl http://localhost:9760/v1/models
and client.completions.create
python -m vllm.entrypoints.api_server \ --model /workspace/opt-125m \ --port 9760 \ --gpu-memory-utilization 0.5
this is the code i run the server. I got 404 when i run bothcurl http://localhost:9760/v1/models
andclient.completions.create
Can you try setting --host 0.0.0.0
instead of the default --host localhost
?
I have tried, it seems not work.
python -m vllm.entrypoints.api_server \ --model /workspace/opt-125m \ --port 9760 \ --gpu-memory-utilization 0.5
this is the code i run the server. I got 404 when i run bothcurl http://localhost:9760/v1/models
andclient.completions.create
Can you try setting
--host 0.0.0.0
instead of the default--host localhost
?
I have tried, it seems not work.
Can you show the log of both the server and client?
INFO 06-04 02:31:05 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/workspace/opt-125m', speculative_config=None, tokenizer='/workspace/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/workspace/opt-125m)
INFO 06-04 02:31:06 model_runner.py:146] Loading model weights took 0.2389 GB
INFO 06-04 02:31:07 gpu_executor.py:83] # GPU blocks: 34192, # CPU blocks: 7281
INFO 06-04 02:31:10 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-04 02:31:10 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-04 02:31:13 model_runner.py:924] Graph capturing finished in 3 secs.
INFO: Started server process [486571]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9760 (Press CTRL+C to quit)
INFO: 127.0.0.1:33664 - "GET / HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:35442 - "GET / HTTP/1.1" 404 Not Found
INFO: 127.0.0.1:56994 - "POST /v1/completions HTTP/1.1" 404 Not Found
^CINFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
Traceback (most recent call last):
File "/workspace/api_inference.py", line 11, in <module>
completion = client.completions.create(model="/workspace/opt-125m",
File "/opt/conda/lib/python3.10/site-packages/openai/_utils/_utils.py", line 275, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/openai/resources/completions.py", line 516, in create
return self._post(
File "/opt/conda/lib/python3.10/site-packages/openai/_base_client.py", line 1233, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File "/opt/conda/lib/python3.10/site-packages/openai/_base_client.py", line 922, in request
return self._request(
File "/opt/conda/lib/python3.10/site-packages/openai/_base_client.py", line 1013, in _request
raise self._make_status_error_from_response(err.response) from None
openai.NotFoundError: Error code: 404 - {'detail': 'Not Found'}
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://0.0.0.0:9760/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
for i in range(10):
completion = client.completions.create(model="/workspace/opt-125m",
prompt=f"San Francisco is a {i}")
print("Completion result:", completion)
I could use this a few days ago, it's also ok to run vllm offline. So, I wonder whether it's a problem of network or openai's version. I have tried to change openai's version several times, but still can't work. Or, does the close of ECC influence? The ECC on our remote server has been disabled due to other experiments.
Can you try using curl
to send requests directly? (By the way, when you were hosting on localhost
, did you also access the server via localhost
?)
I have tried curl http://localhost:9760/v1/models
,receive {"detail":"Not Found"}
and INFO: 127.0.0.1:54932 - "GET /v1/models HTTP/1.1" 404 Not Found
in server's log.
For detailed debugging, I recommend to attach a debugger to the OpenAI-compatible server and add a breakpoint in fastapi.FastAPI.__call__
(which handles the incoming request) so that you can go through the code step by step and see where it breaks.
An example debugger config in VSCode would be:
{
"name": "Python Debugger: OpenAI server",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/vllm/entrypoints/openai/api_server.py",
"console": "integratedTerminal",
"args": [
"--port",
"9760",
"--model",
"/workspace/opt-125m",
],
"justMyCode": false
},
It's difficult for me to help you debug since I can't repro your network environment. Hope this helps!
@GanZhengha it seems you are using vllm.entrypoints.api_server
, which is deprecated and does not support http://localhost:8000/v1/models
.
You should use vllm.entrypoints.openai.api_server
. It does not support /generate
though. Typical usage is through openai client.
Sorry for the confusion, but we have two api servers with different urls to respond.
run with python -m vllm.entrypoints.openai.api_server --model vicuna-7b-v1.5 --trust-remote-code
curl http://localhost:8000/generate -d '{"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:write a bubble sort using python\n\n### Response:","max_tokens":256}'
INFO: 127.0.0.1:55766 - "POST /generate HTTP/1.1" 404 Not Found INFO 05-28 15:37:03 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO: 127.0.0.1:45622 - "GET /generate HTTP/1.1" 404 Not Found
python cli.py Prompt: 'San Francisco is a'
Traceback (most recent call last): File "/root/autodl-tmp/chat-main/vllm_cli.py", line 79, in
output = get_response(response)
File "/root/autodl-tmp/chat-main/vllm_cli.py", line 46, in get_response
output = data["text"]
KeyError: 'text'
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.