Failed to parse JSON response from LLM served using vLLM

arpaiva commented 3 months ago

I want to try DSPy using a local LLM served using vLLM. I followed the instructions from https://dspy-docs.vercel.app/docs/deep-dive/language_model_clients/local_models/HFClientVLLM The model was downloaded previously and stored on a local folder and served with:

python -m vllm.entrypoints.api_server \
    --model /scratch/meta-llama/Meta-Llama-3-8B-Instruct \
    --port 12058 \
    --tensor-parallel-size=1 \
    --dtype=float16

but running

import dspy
model = dspy.HFClientVLLM(model="/scratch/meta-llama/Meta-Llama-3-8B-Instruct", port=12058)
model._generate(prompt='What is the capital of Paris?')

yields

Failed to parse JSON response: {"detail":"Not Found"}
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/scratch/ap/repos/dspy/dsp/modules/hf_client.py in _generate(self, prompt, **kwargs)
    231                 json_response = response.json()
--> 232                 completions = json_response["choices"]
    233                 response = {

KeyError: 'choices'

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-44-72e96199a4c4> in <cell line: 1>()
----> 1 model._generate(prompt='What is the capital of Paris?')

/scratch/ap/repos/dspy/dsp/modules/hf_client.py in _generate(self, prompt, **kwargs)
    239             except Exception:
    240                 print("Failed to parse JSON response:", response.text)
--> 241                 raise Exception("Received invalid JSON response from server")
    242
    243 @CacheMemory.cache(ignore=['arg'])

Exception: Received invalid JSON response from server

I tried also calling model directly (i.e., without using ._generate and instantiating dspy.HFClientVLLM with model_type='chat'. All resulted in the same outcome.

On the server side I got:

INFO:     Started server process [223688]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on[ http://0.0.0.0:12058](http://0.0.0.0:12058/) (Press CTRL+C to quit)
INFO:     [127.0.0.1:49614](http://127.0.0.1:49614/) - "POST /v1/completions HTTP/1.1" 404 Not Found
INFO:     [127.0.0.1:49616](http://127.0.0.1:49616/) - "POST /v1/completions HTTP/1.1" 404 Not Found
INFO:     [127.0.0.1:49620](http://127.0.0.1:49620/) - "POST /v1/chat/completions HTTP/1.1" 404 Not Found
INFO:     [192.168.246.11:50358](http://192.168.246.11:50358/) - "GET /v1/models/ HTTP/1.1" 404 Not Found
INFO:     [192.168.246.11:50360](http://192.168.246.11:50360/) - "GET /v1/ HTTP/1.1" 404 Not Found

Lastly, I also tried using the OpenAI API entrypoint:

python -m vllm.entrypoints.openai.api_server \
    --model /scratch/meta-llama/Meta-Llama-3-8B-Instruct \
    --served-model-name meta-llama/Meta-Llama-3-8B-Instruct \
    --port 12058 \
    --tensor-parallel-size=1 \
    --dtype=float16

but that also triggered the same error.

This is running from a cloning the repo with the latest commit of the main branch 55510ee`.

isaacbmiller commented 3 months ago

I had to deal with this error a little while ago. It had to do with detecting “chat” vs “instruct” in the model name. I will follow up with a fix

isaacbmiller commented 3 months ago

This also should be fixed by the backend refactor if you try that branch