stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models
https://dspy-docs.vercel.app/
MIT License
18.34k stars 1.41k forks source link

dspy.HFClientVLLM broken since 2.4.9 #1025

Closed Wolfsauge closed 5 months ago

Wolfsauge commented 5 months ago

Using the dspy.HFClientVLLM method is broken since dsp version 2.4.9.

I have reproduced the error with the below script and the versions 2.4.5, 2.4.6 and 2.4.7 of dspy work as expected, while 2.4.8 doesn't seem to exist and 2.4.9 is broken.

I've been doing these tests with Python 3.11.9.

The erroneous run looks like this:

$ /home/ns/.pyenv/versions/3.11.9/bin/python /home/ns/GitHub/Playground/dspy_examples/repro.py
/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by promote_options='default'.
  table = cls._concat_blocks(blocks, axis=0)

### Generate Response ###

Failed to parse JSON response: {"object":"error","message":"[{'type': 'extra_forbidden', 'loc': ('body', 'port'), 'msg': 'Extra inputs are not permitted', 'input': 8000}, {'type': 'extra_forbidden', 'loc': ('body', 'url'), 'msg': 'Extra inputs are not permitted', 'input': ['http://mnemosyne.local:8000']}]","type":"BadRequestError","param":null,"code":400}
Traceback (most recent call last):
  File "/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 199, in _generate
    completions = json_response["choices"]
                  ~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'choices'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ns/GitHub/Playground/dspy_examples/repro.py", line 40, in <module>
    pred = generate_answer(question=example.question)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/dspy/predict/predict.py", line 61, in __call__
    return self.forward(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/dspy/predict/predict.py", line 103, in forward
    x, C = dsp.generate(template, **config)(x, stage=self.stage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/primitives/predict.py", line 77, in do_generate
    completions: list[dict[str, Any]] = generator(prompt, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/modules/hf.py", line 190, in __call__
    response = self.request(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/modules/lm.py", line 26, in request
    return self.basic_request(prompt, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/modules/hf.py", line 147, in basic_request
    response = self._generate(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 208, in _generate
    raise Exception("Received invalid JSON response from server")
Exception: Received invalid JSON response from server

When it's working it looks like this:

$ /home/ns/.pyenv/versions/3.11.9/bin/python /home/ns/GitHub/Playground/dspy_examples/repro.py
/home/ns/.pyenv/versions/3.11.9/lib/python3.11/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by promote_options='default'.
  table = cls._concat_blocks(blocks, axis=0)

### Generate Response ###

Question: What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?
Predicted Answer: American

Question: What is the name of the popular cooking competition show hosted by Gordon Ramsay?
Answer: Hell's Kitchen

Question: What is the name of the popular cooking competition show hosted by Bobby Flay?
Answer: Beat Bobby Flay

Question: What is the name of the popular cooking competition show hosted by Giada De Laurentiis?
Answer: Diners, Drive-Ins and Dives

Question: What is the name of the popular cooking competition show hosted by Alton Brown?
Answer: Good Eats

Question: What is the name of the popular cooking competition show hosted by Padma Lakshmi?
Answer: Top Chef

Question: What is the name of the popular cooking competition show hosted by Guy Fieri

Example script to reproduce the error:

#!/usr/bin/env python3
"""Example dspy script"""

import dspy
from dspy.datasets import HotPotQA

local_llm = dspy.HFClientVLLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    port=8000,
    url="http://mnemosyne.local",
)
# local_llm = dspy.OpenAI(
#     api_base="http://mnemosyne.local:8000/v1/",
#     api_key="sk-111111",
#     model="meta-llama/Meta-Llama-3-8B-Instruct",
# )
colbertv2_wiki17_abstracts = dspy.ColBERTv2(
    url="http://20.102.90.50:2017/wiki17_abstracts"
)
dspy.settings.configure(lm=local_llm, rm=colbertv2_wiki17_abstracts)
dataset = HotPotQA(
    train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0
)

trainset = [x.with_inputs("question") for x in dataset.train]
devset = [x.with_inputs("question") for x in dataset.dev]

example = devset[18]

class BasicQA(dspy.Signature):  # A. Signature
    """Answer questions with short factoid answers."""

    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

print("\n### Generate Response ###\n")
generate_answer = dspy.Predict(BasicQA)
pred = generate_answer(question=example.question)
print(f"Question: {example.question}\nPredicted Answer: {pred.answer}")

Please note, the downside of using the dspy.OpenAI method (the commented section in the script above) as a workaround in 2.4.9, will result in only 1 request going on towards vLLM at any time. It's possible to use the openai-python module with a self-created httpx client, which can then have arbitrary settings for the connection pool size ... HTH

Wolfsauge commented 5 months ago

Reproduced with latest versions of vLLM backend in podman (0.4.2) and model (meta-llama/Meta-Llama-3-8B-Instruct, commit c4a54320a52ed5f88b7a2f84496903ea4ff07b45, 2024-05-13) downloaded from hf with latest updates to the configs.

$ podman run \
    --replace \
    --device nvidia.com/gpu=all \
    --name=vllm \
    -e TRANSFORMERS_OFFLINE=1 -e HOST_IP=192.168.0.185 \
    --ipc=host \
    -dit \
    -p 8000:8000 \
    -v /models:/workspace/models \
    docker.io/vllm/vllm-openai:v0.4.2 \
    --model /workspace/models/meta-llama/Meta-Llama-3-8B-Instruct \
    --served-model-name meta-llama/Meta-Llama-3-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.95

Example curl on the same endpoint:

$ curl -s http://mnemosyne.local:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "Compose a poem that explains the concept of recursion in programming."
      }
    ]
  }'|jq .
{
  "id": "cmpl-7dcf3aa6399a48a2ba42da95dfc748dc",
  "object": "chat.completion",
  "created": 1715720212,
  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "What a fascinating topic! Here's a poem that attempts to explain recursion in programming:\n\nIn code, a function calls itself, it's true,\nA recursive dance, with loops anew.\nA method calls its own name, a clever trick,\nTo solve a problem, with a clever kick.\n\nIt's a function within a function, a nest,\nA recursive call, to pass the test.\nThe function calls itself, with parameters in tow,\nTo solve a problem, that's too complex to know.\n\nThe function calls itself, again and again,\nUntil the base case, is reached and sustained.\nThe recursion unwinds, like a coiled spring,\nUntil the solution, is finally brought to sing.\n\nThe function returns, with its value in hand,\nThe recursive call, is unwound, like a thread in a strand.\nThe stack is cleared, the memory is freed,\nThe recursive function, has solved the problem, indeed.\n\nRecursion's magic, is a wondrous thing,\nA way to solve problems, that would otherwise sting.\nIt's a clever tool, in a programmer's kit,\nTo tackle complexity, with a clever hit.\n\nSo here's to recursion, a programming delight,\nA way to solve problems, with a recursive might.\nIt's a concept that's tricky, but oh so grand,\nA way to solve problems, in a programming land.\n\nNote: I've used a few technical terms in the poem, such as \"function\", \"parameters\", \"base case\", and \"stack\", to make it more accurate and accessible to programmers. I hope it helps to explain the concept of recursion in a poetic way!"
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 128009
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "total_tokens": 352,
    "completion_tokens": 329
  }
}

Currently known workaround:

$ pip uninstall dspy-ai -y
Found existing installation: dspy-ai 2.4.9
Uninstalling dspy-ai-2.4.9:
  Successfully uninstalled dspy-ai-2.4.9
$ pip install dspy-ai==2.4.7
Collecting dspy-ai==2.4.7
  Using cached dspy_ai-2.4.7-py3-none-any.whl.metadata (36 kB)
[...]
WARNING: The candidate selected for download or install is a yanked version: 'dspy-ai' candidate (version 2.4.7 at https://files.pythonhosted.org/packages/b3/78/e383bf31195aa9ced4748b1a012d9be928dfda782e7abae02858898afa6f/dspy_ai-2.4.7-py3-none-any.whl (from https://pypi.org/simple/dspy-ai/) (requires-python:>=3.9))
Reason for being yanked: <none given>
Using cached dspy_ai-2.4.7-py3-none-any.whl (200 kB)
Installing collected packages: dspy-ai
Successfully installed dspy-ai-2.4.7
Wolfsauge commented 5 months ago
$ pip list|grep openai
openai             1.30.1
Wolfsauge commented 5 months ago

When dumping the traffic, while reproducing the problem, the relevant parts of the request packets look like the following.

Dump command:

$ sudo tcpdump -nq -X -i enp4s0 port 8000

Bad traffic:

23:17:23.671993 IP 192.168.0.158.51483 > 192.168.0.185.8000: tcp 479
        0x0000:  4500 0213 0000 4000 4006 b63d c0a8 009e  E.....@.@..=....
        0x0010:  c0a8 00b9 c91b 1f40 c913 aef2 86c5 7202  .......@......r.
        0x0020:  8018 080a 4fe3 0000 0101 080a c48d 9569  ....O..........i
        0x0030:  3571 d7a5 7b22 6d6f 6465 6c22 3a20 226d  5q..{"model":."m
        0x0040:  6574 612d 6c6c 616d 612f 4d65 7461 2d4c  eta-llama/Meta-L
        0x0050:  6c61 6d61 2d33 2d38 422d 496e 7374 7275  lama-3-8B-Instru
        0x0060:  6374 222c 2022 7072 6f6d 7074 223a 2022  ct",."prompt":."
        0x0070:  416e 7377 6572 2071 7565 7374 696f 6e73  Answer.questions
        0x0080:  2077 6974 6820 7368 6f72 7420 6661 6374  .with.short.fact
        0x0090:  6f69 6420 616e 7377 6572 732e 5c6e 5c6e  oid.answers.\n\n
        0x00a0:  2d2d 2d5c 6e5c 6e46 6f6c 6c6f 7720 7468  ---\n\nFollow.th
        0x00b0:  6520 666f 6c6c 6f77 696e 6720 666f 726d  e.following.form
        0x00c0:  6174 2e5c 6e5c 6e51 7565 7374 696f 6e3a  at.\n\nQuestion:
        0x00d0:  2024 7b71 7565 7374 696f 6e7d 5c6e 416e  .${question}\nAn
        0x00e0:  7377 6572 3a20 6f66 7465 6e20 6265 7477  swer:.often.betw
        0x00f0:  6565 6e20 3120 616e 6420 3520 776f 7264  een.1.and.5.word
        0x0100:  735c 6e5c 6e2d 2d2d 5c6e 5c6e 5175 6573  s\n\n---\n\nQues
        0x0110:  7469 6f6e 3a20 5768 6174 2069 7320 7468  tion:.What.is.th
        0x0120:  6520 6e61 7469 6f6e 616c 6974 7920 6f66  e.nationality.of
        0x0130:  2074 6865 2063 6865 6620 616e 6420 7265  .the.chef.and.re
        0x0140:  7374 6175 7261 7465 7572 2066 6561 7475  staurateur.featu
        0x0150:  7265 6420 696e 2052 6573 7461 7572 616e  red.in.Restauran
        0x0160:  743a 2049 6d70 6f73 7369 626c 653f 5c6e  t:.Impossible?\n
        0x0170:  416e 7377 6572 3a22 2c20 2274 656d 7065  Answer:",."tempe
        0x0180:  7261 7475 7265 223a 2030 2e30 2c20 226d  rature":.0.0,."m
        0x0190:  6178 5f74 6f6b 656e 7322 3a20 3135 302c  ax_tokens":.150,
        0x01a0:  2022 746f 705f 7022 3a20 312c 2022 6672  ."top_p":.1,."fr
        0x01b0:  6571 7565 6e63 795f 7065 6e61 6c74 7922  equency_penalty"
        0x01c0:  3a20 302c 2022 7072 6573 656e 6365 5f70  :.0,."presence_p
        0x01d0:  656e 616c 7479 223a 2030 2c20 226e 223a  enalty":.0,."n":
        0x01e0:  2031 2c20 2270 6f72 7422 3a20 3830 3030  .1,."port":.8000
        0x01f0:  2c20 2275 726c 223a 205b 2268 7474 703a  ,."url":.["http:
        0x0200:  2f2f 6d6e 656d 6f73 796e 653a 3830 3030  //mnemosyne:8000
        0x0210:  225d 7d                                  "]}

Good traffic:

23:23:43.681564 IP 192.168.0.158.53008 > 192.168.0.185.8000: tcp 364
        0x0000:  4500 01a0 0000 4000 4006 b6b0 c0a8 009e  E.....@.@.......
        0x0010:  c0a8 00b9 cf10 1f40 476b aab6 c420 032a  .......@Gk.....*
        0x0020:  8018 080a 9519 0000 0101 080a 8bda dacb  ................
        0x0030:  3577 a40e 7b22 6d6f 6465 6c22 3a20 226d  5w..{"model":."m
        0x0040:  6574 612d 6c6c 616d 612f 4d65 7461 2d4c  eta-llama/Meta-L
        0x0050:  6c61 6d61 2d33 2d38 422d 496e 7374 7275  lama-3-8B-Instru
        0x0060:  6374 222c 2022 7072 6f6d 7074 223a 2022  ct",."prompt":."
        0x0070:  416e 7377 6572 2071 7565 7374 696f 6e73  Answer.questions
        0x0080:  2077 6974 6820 7368 6f72 7420 6661 6374  .with.short.fact
        0x0090:  6f69 6420 616e 7377 6572 732e 5c6e 5c6e  oid.answers.\n\n
        0x00a0:  2d2d 2d5c 6e5c 6e46 6f6c 6c6f 7720 7468  ---\n\nFollow.th
        0x00b0:  6520 666f 6c6c 6f77 696e 6720 666f 726d  e.following.form
        0x00c0:  6174 2e5c 6e5c 6e51 7565 7374 696f 6e3a  at.\n\nQuestion:
        0x00d0:  2024 7b71 7565 7374 696f 6e7d 5c6e 416e  .${question}\nAn
        0x00e0:  7377 6572 3a20 6f66 7465 6e20 6265 7477  swer:.often.betw
        0x00f0:  6565 6e20 3120 616e 6420 3520 776f 7264  een.1.and.5.word
        0x0100:  735c 6e5c 6e2d 2d2d 5c6e 5c6e 5175 6573  s\n\n---\n\nQues
        0x0110:  7469 6f6e 3a20 5768 6174 2069 7320 7468  tion:.What.is.th
        0x0120:  6520 6e61 7469 6f6e 616c 6974 7920 6f66  e.nationality.of
        0x0130:  2074 6865 2063 6865 6620 616e 6420 7265  .the.chef.and.re
        0x0140:  7374 6175 7261 7465 7572 2066 6561 7475  staurateur.featu
        0x0150:  7265 6420 696e 2052 6573 7461 7572 616e  red.in.Restauran
        0x0160:  743a 2049 6d70 6f73 7369 626c 653f 5c6e  t:.Impossible?\n
        0x0170:  416e 7377 6572 3a22 2c20 226d 6178 5f74  Answer:",."max_t
        0x0180:  6f6b 656e 7322 3a20 3135 302c 2022 7465  okens":.150,."te
        0x0190:  6d70 6572 6174 7572 6522 3a20 302e 307d  mperature":.0.0}
Wolfsauge commented 5 months ago

I am now using this workaround to hf_client.py below.

I'm not understanding the concept behind what is happening here, but vLLM definitely chokes on unexpected parameters in the payload, like port and url. With "unexpected" I mean anything not defined in the SamplingParams class vllm/sampling_params.py of vLLM.

I not sure it's the way to go adding the top_p with a value of 0.1. There doesn't seem to be any intelligence that removes top_p, when - for example - min_p is being defined. However, I haven't touched it for this workaround.

$ diff -U 2 dspy/dsp/modules/hf_client.py ../../.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/modules/hf_client.py
--- dspy/dsp/modules/hf_client.py       2024-05-15 00:10:49.383824119 +0200
+++ ../../.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/modules/hf_client.py  2024-05-15 00:12:37.948489659 +0200
@@ -149,5 +149,8 @@
         url = self.urls.pop(0)
         self.urls.append(url)
-        
+
+        list_of_elements_to_exclude = [ "port", "url" ]
+        req_kwargs = {k: v for k, v in kwargs.items() if k not in list_of_elements_to_exclude}
+
         if self.model_type == "chat":
             system_prompt = kwargs.get("system_prompt",None)
@@ -158,5 +161,5 @@
                 "model": self.kwargs["model"],
                 "messages": messages,
-                **kwargs,
+                **req_kwargs,
             }
             response = send_hfvllm_request_v01_wrapped(
@@ -184,5 +187,5 @@
                 "model": self.kwargs["model"],
                 "prompt": prompt,
-                **kwargs,
+                **req_kwargs,
             }

If you want me to, I can create a branch, push that to your repo, so you can easily merge a fix.

Wolfsauge commented 5 months ago

Updated.

$ diff -U 2 dspy/dsp/modules/hf_client.py ../../.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/modules/hf_client.py
--- dspy/dsp/modules/hf_client.py       2024-05-15 00:10:49.383824119 +0200
+++ ../../.pyenv/versions/3.11.9/lib/python3.11/site-packages/dsp/modules/hf_client.py  2024-05-15 00:30:33.058225538 +0200
@@ -149,5 +149,37 @@
         url = self.urls.pop(0)
         self.urls.append(url)
-        
+
+        list_of_elements_to_allow = [
+            "n",
+            "best_of",
+            "presence_penalty",
+            "frequency_penalty",
+            "repetition_penalty",
+            "temperature",
+            "top_p",
+            "top_k",
+            "min_p",
+            "seed",
+            "use_beam_search",
+            "length_penalty",
+            "early_stopping",
+            "stop",
+            "stop_token_ids",
+            "include_stop_str_in_output",
+            "ignore_eos",
+            "max_tokens",
+            "min_tokens",
+            "logprobs",
+            "prompt_logprobs",
+            "detokenize",
+            "skip_special_tokens",
+            "spaces_between_special_tokens",
+            "logits_processors",
+            "truncate_prompt_tokens",
+        ]
+        req_kwargs = {
+            k: v for k, v in kwargs.items() if k in list_of_elements_to_allow
+        }
+
         if self.model_type == "chat":
             system_prompt = kwargs.get("system_prompt",None)
@@ -158,5 +190,5 @@
                 "model": self.kwargs["model"],
                 "messages": messages,
-                **kwargs,
+                **req_kwargs,
             }
             response = send_hfvllm_request_v01_wrapped(
@@ -184,5 +216,5 @@
                 "model": self.kwargs["model"],
                 "prompt": prompt,
-                **kwargs,
+                **req_kwargs,
             }
Wolfsauge commented 5 months ago

PR #1012 will fix the issue.

Wolfsauge commented 5 months ago

I haven't found Issue 974 and that PR #1012 awaiting approval previous to fixing it myself. Too bad!

Wolfsauge commented 5 months ago

Linked a more thorough (but still not good) draft above to https://github.com/stanfordnlp/dspy/pull/1029.

rakataprime commented 5 months ago

The vLLM client is still broken on the main branch right now


File /opt/conda/lib/python3.11/site-packages/dsp/modules/hf_client.py:232, in HFClientVLLM._generate(self, prompt, **kwargs)
    231 json_response = response.json()
--> 232 completions = json_response["choices"]
    233 response = {
    234     "prompt": prompt,
    235     "choices": [{"text": c["text"]} for c in completions],
    236 }

KeyError: 'choices' 
timchen0618 commented 3 months ago

I am also getting this error. Is this going to be fixed anytime?

brando90 commented 1 month ago

similar issue:

(uutils) brando9@skampere1~ $ python ~/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py

  0%|                                                                                                                   | 0/3 [00:00<?, ?it/s]Failed to parse JSON response: {"detail":"Not Found"}
2024-09-14T00:02:54.083214Z [error    ] Failed to run or to evaluate example Example({'question': 'What is the capital of France?', 'answer': 'Paris'}) (input_keys={'question'}) with <function exact_match_metric at 0x7f4ccde3c0e0> due to Received invalid JSON response from server. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
Failed to parse JSON response: {"detail":"Not Found"}
2024-09-14T00:02:54.087907Z [error    ] Failed to run or to evaluate example Example({'question': "Who wrote '1984'?", 'answer': 'George Orwell'}) (input_keys={'question'}) with <function exact_match_metric at 0x7f4ccde3c0e0> due to Received invalid JSON response from server. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
Failed to parse JSON response: {"detail":"Not Found"}
2024-09-14T00:02:54.091655Z [error    ] Failed to run or to evaluate example Example({'question': 'What is the boiling point of water?', 'answer': '100°C'}) (input_keys={'question'}) with <function exact_match_metric at 0x7f4ccde3c0e0> due to Received invalid JSON response from server. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 169.86it/s]
Bootstrapped 0 full traces after 3 examples in round 0.
Failed to parse JSON response: {"detail":"Not Found"}
Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 243, in _generate
    completions = json_response["choices"]
                  ~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'choices'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py", line 61, in <module>
    pred = compiled_simple_qa(my_question)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/primitives/program.py", line 26, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py", line 46, in forward
    prediction = self.generate_answer(question=question)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/primitives/program.py", line 26, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/chain_of_thought.py", line 36, in forward
    return self._predict(signature=signature, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/predict.py", line 91, in __call__
    return self.forward(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/predict.py", line 129, in forward
    completions = old_generate(demos, signature, kwargs, config, self.lm, self.stage)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/predict.py", line 156, in old_generate
    x, C = dsp.generate(template, **config)(x, stage=stage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/primitives/predict.py", line 73, in do_generate
    completions: list[dict[str, Any]] = generator(prompt, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf.py", line 193, in __call__
    response = self.request(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/lm.py", line 27, in request
    return self.basic_request(prompt, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf.py", line 147, in basic_request
    response = self._generate(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 252, in _generate
    raise Exception("Received invalid JSON response from server")
Exception: Received invalid JSON response from server

code

"""
ref: https://chatgpt.com/g/g-cH94JC5NP-dspy-guide-v2024-2-7

python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-hf --port 8080
"""
import dspy
from dspy.teleprompt import BootstrapFewShot
from dspy.evaluate.evaluate import Evaluate

# Step 1: Configure DSPy to use the local LLaMA model running on a vLLM server.
# The server is hosted locally at port 8080.
vllm_llama2 = dspy.HFClientVLLM(model="meta-llama/Llama-2-7b-hf", port=8080, url="http://localhost")
dspy.settings.configure(lm=vllm_llama2)

# Step 2: Define a small, high-quality hardcoded dataset (3-5 examples).
train_data = [
    {"question": "What is the capital of France?", "answer": "Paris"},
    {"question": "Who wrote '1984'?", "answer": "George Orwell"},
    {"question": "What is the boiling point of water?", "answer": "100°C"},
]

# Dev set for evaluating model generalization on unseen examples.
dev_data = [
    {"question": "Who discovered penicillin?", "answer": "Alexander Fleming"},
    {"question": "What is the capital of Japan?", "answer": "Tokyo"},
]

# Convert the dataset into DSPy examples with input/output fields.
trainset = [dspy.Example(question=x["question"], answer=x["answer"]).with_inputs('question') for x in train_data]
devset = [dspy.Example(question=x["question"], answer=x["answer"]).with_inputs('question') for x in dev_data]

# Step 3: Define the Simple QA program using DSPy.
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""
    question = dspy.InputField()
    answer = dspy.OutputField()

class SimpleQA(dspy.Module):
    def __init__(self):
        super().__init__()
        # ChainOfThought generates answers using the configured local LLaMA LM via vLLM.
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        # Pass the question through the local LM (LLaMA) to generate an answer.
        prediction = self.generate_answer(question=question)
        return dspy.Prediction(answer=prediction.answer)

# Step 4: Metric to evaluate exact match between predicted and expected answer.
def exact_match_metric(example, pred, trace=None):
    return example['answer'].lower() == pred.answer.lower()

# Step 5: Use the teleprompter (BootstrapFewShot) to optimize few-shot examples.
teleprompter = BootstrapFewShot(metric=exact_match_metric)

# Compile the SimpleQA program with optimized few-shots from the train set.
compiled_simple_qa = teleprompter.compile(SimpleQA(), trainset=trainset)

# Step 6: Test with a sample question and evaluate the performance.
my_question = "What is the capital of Japan?"
pred = compiled_simple_qa(my_question)

# Output the predicted answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")

# Evaluate the compiled program on the dev set using the exact match metric.
evaluate_on_dev = Evaluate(devset=devset, num_threads=1, display_progress=False)
evaluation_score = evaluate_on_dev(compiled_simple_qa, metric=exact_match_metric)

print(f"Evaluation Score on Dev Set: {evaluation_score}")

launch:

# vllm with dspy

```bash
this installed flash but vllm didn't say in it's output it was using it
pip install torch==2.4.0
pip install vllm==0.5.4
pip install flash-attn==2.6.3
pip install vllm-flash-attn==2.6.1

# later try with py 3.10
# python3xxx -m venv ~/.virtualenvs/flash_attn_test_py10
# source ~/.virtualenvs/flash_attn_test_py10/bin/activate
pip install --upgrade pip
pip install torch==2.4.0
pip install vllm==0.5.4
pip install flash-attn==2.6.3
export CUDA_VISIBLE_DEVICES=5
python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-hf --port 8080

python ~/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py