runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
213 stars 81 forks source link

BadRequestError on runsync route, or what is the correct method to hit handler.py's locally run API? #65

Closed dpkirchner closed 3 months ago

dpkirchner commented 4 months ago

I'm getting a BadRequestError when I try to test the vllm worker locally.

I'm running my handler locally for testing, using MODEL_NAME=/models/stablelm-3b-4e1t python3 -u /src/handler.py --rp_serve_api --rp_api_port 8000 --rp_api_host 0.0.0.0, in a docker image built using the instructions found at https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#option-2-build-docker-image-with-model-inside, and I'm trying to send test requests to the runsync route based on what is described here:

https://blog.runpod.io/workers-local-api-server-introduced-with-runpod-python-0-9-13/

I've tried using the api test forms on the http://localhost:8000/docs page and I've also tried with curl:

curl -H 'content-type: application/json' -d '{"input":{"message":"blah de blah"}}' http://localhost:8000/runsync

However, I always get this response:

{
  "id": "test-1b8405d8-3e00-438e-b3cd-4bae73fc5e7a",
  "status": "COMPLETED",
  "output": [
    {
      "error": {
        "object": "error",
        "message": "",
        "type": "BadRequestError",
        "param": null,
        "code": 400
      }
    }
  ]
}

I also tried the {"input": {"number":123}} body shown in the blog post, same result.

What am I doing wrong?

Here's the full output from handler.py:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-12 23:19:33 llm_engine.py:87] Initializing an LLM engine with config: model='/models/stablelm-3b-4e1t', tokenizer='/models/stablelm-3b-4e1t', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir='/models/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-12 23:19:35 weight_utils.py:257] Loading safetensors took 1.01s
INFO 04-12 23:19:37 llm_engine.py:357] # GPU blocks: 1111, # CPU blocks: 819
WARNING 04-12 23:19:37 cache_engine.py:103] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 04-12 23:19:37 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-12 23:19:37 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-12 23:19:43 model_runner.py:756] Graph capturing finished in 7 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 04-12 23:19:44 serving_chat.py:306] No chat template provided. Chat API will not work.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
--- Starting Serverless Worker |  Version 1.6.2 ---
INFO   | Starting API server.
DEBUG  | Not deployed on RunPod serverless, pings will not be sent.
INFO:     Started server process [252]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
DEBUG  | test-1b8405d8-3e00-438e-b3cd-4bae73fc5e7a | Using Async Generator
DEBUG  | test-1b8405d8-3e00-438e-b3cd-4bae73fc5e7a | Async Generator output: {'error': {'object': 'error', 'message': '', 'type': 'BadRequestError', 'param': None, 'code': 400}}
INFO   | test-1b8405d8-3e00-438e-b3cd-4bae73fc5e7a | Finished running generator.
alpayariyak commented 3 months ago

For non-openai compat usage, the input must either have messages or prompt: https://github.com/runpod-workers/worker-vllm/tree/0.3.2#request-input-parameters