stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
17.47k stars 1.33k forks source link

Failed to parse JSON response from LLM served using vLLM #1242

Closed arpaiva closed 2 months ago

arpaiva commented 3 months ago

I want to try DSPy using a local LLM served using vLLM. I followed the instructions from https://dspy-docs.vercel.app/docs/deep-dive/language_model_clients/local_models/HFClientVLLM The model was downloaded previously and stored on a local folder and served with:

python -m vllm.entrypoints.api_server \
    --model /scratch/meta-llama/Meta-Llama-3-8B-Instruct \
    --port 12058 \
    --tensor-parallel-size=1 \
    --dtype=float16

but running

import dspy
model = dspy.HFClientVLLM(model="/scratch/meta-llama/Meta-Llama-3-8B-Instruct", port=12058)
model._generate(prompt='What is the capital of Paris?')

yields

Failed to parse JSON response: {"detail":"Not Found"}
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/scratch/ap/repos/dspy/dsp/modules/hf_client.py in _generate(self, prompt, **kwargs)
    231                 json_response = response.json()
--> 232                 completions = json_response["choices"]
    233                 response = {

KeyError: 'choices'

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-44-72e96199a4c4> in <cell line: 1>()
----> 1 model._generate(prompt='What is the capital of Paris?')

/scratch/ap/repos/dspy/dsp/modules/hf_client.py in _generate(self, prompt, **kwargs)
    239             except Exception:
    240                 print("Failed to parse JSON response:", response.text)
--> 241                 raise Exception("Received invalid JSON response from server")
    242
    243 @CacheMemory.cache(ignore=['arg'])

Exception: Received invalid JSON response from server

I tried also calling model directly (i.e., without using ._generate and instantiating dspy.HFClientVLLM with model_type='chat'. All resulted in the same outcome.

On the server side I got:

INFO:     Started server process [223688]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on[ http://0.0.0.0:12058](http://0.0.0.0:12058/) (Press CTRL+C to quit)
INFO:     [127.0.0.1:49614](http://127.0.0.1:49614/) - "POST /v1/completions HTTP/1.1" 404 Not Found
INFO:     [127.0.0.1:49616](http://127.0.0.1:49616/) - "POST /v1/completions HTTP/1.1" 404 Not Found
INFO:     [127.0.0.1:49620](http://127.0.0.1:49620/) - "POST /v1/chat/completions HTTP/1.1" 404 Not Found
INFO:     [192.168.246.11:50358](http://192.168.246.11:50358/) - "GET /v1/models/ HTTP/1.1" 404 Not Found
INFO:     [192.168.246.11:50360](http://192.168.246.11:50360/) - "GET /v1/ HTTP/1.1" 404 Not Found

Lastly, I also tried using the OpenAI API entrypoint:

python -m vllm.entrypoints.openai.api_server \
    --model /scratch/meta-llama/Meta-Llama-3-8B-Instruct \
    --served-model-name meta-llama/Meta-Llama-3-8B-Instruct \
    --port 12058 \
    --tensor-parallel-size=1 \
    --dtype=float16

but that also triggered the same error.

This is running from a cloning the repo with the latest commit of the main branch 55510ee`.

isaacbmiller commented 3 months ago

I had to deal with this error a little while ago. It had to do with detecting “chat” vs “instruct” in the model name. I will follow up with a fix

isaacbmiller commented 3 months ago

This also should be fixed by the backend refactor if you try that branch

JPonsa commented 2 months ago

+1

Hi. I am not sure what is the root cause. Sometime I get the error or runs correctly with the same script different datasets. Also I have tried 2 instruct models, "meta-llama/Meta-Llama-3-8B-Instruct" and "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"; and had the issue with llma3 but not mixstral

JPonsa commented 2 months ago

@isaacbmiller is there any workaround? Thanks @arpaiva did the backend refactor branch fixed the issue?

JPonsa commented 2 months ago

@isaacbmiller I tried the "backend-refactor" branch and the issue persists.

JPonsa commented 2 months ago

Traceback (most recent call last): File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 231, in _generate completions = json_response["choices"]


KeyError: 'choices'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/./src/rag/ReAct.py", line 712, in <module>
    main(args)
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/./src/rag/ReAct.py", line 633, in main
    result = react_module(question=question)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dspy/primitives/program.py", line 26, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dspy/predict/react.py", line 116, in forward
    output = self.react[hop](**args)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dspy/predict/predict.py", line 69, in __call__
    return self.forward(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dspy/predict/predict.py", line 132, in forward
    x, C = dsp.generate(template, **config)(x, stage=self.stage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dsp/primitives/predict.py", line 120, in do_generate
    completions: list[dict[str, Any]] = generator(prompt, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dsp/modules/hf.py", line 190, in __call__
    response = self.request(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dsp/modules/lm.py", line 26, in request
    return self.basic_request(prompt, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dsp/modules/hf.py", line 147, in basic_request
    response = self._generate(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 240, in _generate
    raise Exception("Received invalid JSON response from server")
Exception: Received invalid JSON response from server
JPonsa commented 2 months ago

@isaacbmiller is this issue related to https://github.com/stanfordnlp/dspy/issues/1002 ?

Update: I tried implementing PR https://github.com/stanfordnlp/dspy/pull/1012 but the issue persists.

arpaiva commented 2 months ago

@isaacbmiller - Thank you so much for the comments. @JPonsa - On my recent tests, this seems to have been fixed by commit 110a282c from a few days ago as long as I set vLLM to use the OpenAI entrypoint.

JPonsa commented 2 months ago

@arpaiva I tried dspy 2.4.12 (updated using poetry) and using OpenIA entry point but I got another error https://github.com/stanfordnlp/dspy/issues/1276

given that you had to replace the HFClientVLLM for the OpenIA endpoint is it fully resolved or a work around? @isaacbmiller should we reopen this issue or should I open a new one?

isaacbmiller commented 2 months ago

I think that you should still be able to use the HFCLIENTVLLM so I will reopen the issue and investigate tomorrow. Sorry for delay on looking into this

JPonsa commented 2 months ago

Sorry. I think it is kind of a user error. I am running this on a server and some messages get written in different files and it is harder to track the messages.

It seems it could be some sort of error in the parsing in the ReAct and fails to leave the loop filling the context window.

Function Response: The study population in clinical trial NCT00001109 is adult patients with HIV infection. # The answer is ready, so I should use the Finish action. # Action: Finish[The study population in clinical trial NCT00001109 is adult patients with HIV infection.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do. # Action: Finish[The task is now complete.] # The task is now complete. # Thought: There is no more work to do
Failed to parse JSON response: {"object":"error","message":"This model's maximum context length is 8192 tokens. However, you requested 9527 tokens (8527 in the messages, 1000 in the completion). Please reduce the length of the messages or completion.","type":"BadRequestError","param":null,"code":400}
arpaiva commented 2 months ago

@JPonsa - I think that you misunderstood me. I did not have to change HFClientVLLM or anything on DSPy. vLLM can serve the LLM in several ways, they call them entrypoints. I was simply noting that vLLM should be set to serve the LLM with an OpenAI-type API interface, which one gets via the vLLM OpenAI entrypoint.

JPonsa commented 2 months ago

Got it! I was already using the open ai entrypoint

MODEL=meta-llama/Meta-Llama-3-8B-Instruct
MODEL_NAME=llama3_8b
PORT=8045

pip install poetry
poetry run python -m vllm.entrypoints.openai.api_server --model $MODEL --trust-remote-code --port $PORT --dtype half --enforce-eager \
--gpu-memory-utilization 0.80 &
JPonsa commented 2 months ago

@isaacbmiller, please feel free to close this issue. Mine was very likely a user error and arpaiva's is solved. Sorry for the inconvenience

brando90 commented 2 weeks ago

There are a lot of issues open wrt this but no crisp answer of what solves it.