curl http://localhost:8000/generate {"detail":"Not Found"}[Usage] generate relu can not ues

fishingcatgo commented 5 months ago

run with python -m vllm.entrypoints.openai.api_server --model vicuna-7b-v1.5 --trust-remote-code

curl http://localhost:8000/generate -d '{"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:write a bubble sort using python\n\n### Response:","max_tokens":256}'

INFO: 127.0.0.1:55766 - "POST /generate HTTP/1.1" 404 Not Found INFO 05-28 15:37:03 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0% INFO: 127.0.0.1:45622 - "GET /generate HTTP/1.1" 404 Not Found

python cli.py Prompt: 'San Francisco is a'

Traceback (most recent call last): File "/root/autodl-tmp/chat-main/vllm_cli.py", line 79, in output = get_response(response) File "/root/autodl-tmp/chat-main/vllm_cli.py", line 46, in get_response output = data["text"] KeyError: 'text'

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

DarkLight1337 commented 5 months ago

You can use the official OpenAI Python client to access vLLM's OpenAI-compatible server. The latest stable release supports Models API, Chat Completions API and Completions API.

GanZhengha commented 5 months ago

I have the same problem. It is strange, because i can run it a few days ago. Have you solved it?

DarkLight1337 commented 5 months ago

ttp://localhost:8000/generate is only valid for the example server, not the OpenAI-compatible server. In the original post, python -m vllm.entrypoints.openai.api_server --model vicuna-7b-v1.5 --trust-remote-code runs the OpenAI-compatiable server instead of the example server.

GanZhengha commented 5 months ago

python -m vllm.entrypoints.api_server \ --model /workspace/opt-125m \ --port 9760 \ --gpu-memory-utilization 0.5 this is the code i run the server. I got 404 when i run both curl http://localhost:9760/v1/models and client.completions.create

DarkLight1337 commented 5 months ago

python -m vllm.entrypoints.api_server \ --model /workspace/opt-125m \ --port 9760 \ --gpu-memory-utilization 0.5 this is the code i run the server. I got 404 when i run both curl http://localhost:9760/v1/models and client.completions.create

Can you try setting --host 0.0.0.0 instead of the default --host localhost?

GanZhengha commented 5 months ago

I have tried, it seems not work.

python -m vllm.entrypoints.api_server \ --model /workspace/opt-125m \ --port 9760 \ --gpu-memory-utilization 0.5 this is the code i run the server. I got 404 when i run both curl http://localhost:9760/v1/models and client.completions.create

Can you try setting --host 0.0.0.0 instead of the default --host localhost?

DarkLight1337 commented 5 months ago

I have tried, it seems not work.

Can you show the log of both the server and client?

GanZhengha commented 5 months ago

server:

INFO 06-04 02:31:05 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/workspace/opt-125m', speculative_config=None, tokenizer='/workspace/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/workspace/opt-125m)
INFO 06-04 02:31:06 model_runner.py:146] Loading model weights took 0.2389 GB
INFO 06-04 02:31:07 gpu_executor.py:83] # GPU blocks: 34192, # CPU blocks: 7281
INFO 06-04 02:31:10 model_runner.py:854] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-04 02:31:10 model_runner.py:858] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-04 02:31:13 model_runner.py:924] Graph capturing finished in 3 secs.
INFO:     Started server process [486571]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9760 (Press CTRL+C to quit)
INFO:     127.0.0.1:33664 - "GET / HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:35442 - "GET / HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:56994 - "POST /v1/completions HTTP/1.1" 404 Not Found
^CINFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.

client:

Traceback (most recent call last):
  File "/workspace/api_inference.py", line 11, in <module>
    completion = client.completions.create(model="/workspace/opt-125m",
  File "/opt/conda/lib/python3.10/site-packages/openai/_utils/_utils.py", line 275, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/openai/resources/completions.py", line 516, in create
    return self._post(
  File "/opt/conda/lib/python3.10/site-packages/openai/_base_client.py", line 1233, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
  File "/opt/conda/lib/python3.10/site-packages/openai/_base_client.py", line 922, in request
    return self._request(
  File "/opt/conda/lib/python3.10/site-packages/openai/_base_client.py", line 1013, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.NotFoundError: Error code: 404 - {'detail': 'Not Found'}

code in api_inference:

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://0.0.0.0:9760/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
for i in range(10):
    completion = client.completions.create(model="/workspace/opt-125m",
                                        prompt=f"San Francisco is a {i}")
    print("Completion result:", completion)

I could use this a few days ago, it's also ok to run vllm offline. So, I wonder whether it's a problem of network or openai's version. I have tried to change openai's version several times, but still can't work. Or, does the close of ECC influence? The ECC on our remote server has been disabled due to other experiments.

DarkLight1337 commented 5 months ago

Can you try using curl to send requests directly? (By the way, when you were hosting on localhost, did you also access the server via localhost?)

GanZhengha commented 5 months ago

I have tried curl http://localhost:9760/v1/models ，receive {"detail":"Not Found"} and INFO: 127.0.0.1:54932 - "GET /v1/models HTTP/1.1" 404 Not Found in server's log.

DarkLight1337 commented 5 months ago

For detailed debugging, I recommend to attach a debugger to the OpenAI-compatible server and add a breakpoint in fastapi.FastAPI.__call__ (which handles the incoming request) so that you can go through the code step by step and see where it breaks.

An example debugger config in VSCode would be:


        {
            "name": "Python Debugger: OpenAI server",
            "type": "debugpy",
            "request": "launch",
            "program": "${workspaceFolder}/vllm/entrypoints/openai/api_server.py",
            "console": "integratedTerminal",
            "args": [
                "--port",
                "9760",
                "--model",
                "/workspace/opt-125m",
            ],
            "justMyCode": false
        },

It's difficult for me to help you debug since I can't repro your network environment. Hope this helps!

youkaichao commented 4 months ago

@GanZhengha it seems you are using vllm.entrypoints.api_server, which is deprecated and does not support http://localhost:8000/v1/models.

You should use vllm.entrypoints.openai.api_server . It does not support /generate though. Typical usage is through openai client.

Sorry for the confusion, but we have two api servers with different urls to respond.

vllm-project / vllm