Unexpected return value from dspy.Predict()

nattaylor commented 5 months ago

In the following program, calling dspy.Predict() at first produces the expected "yes" or "no" but then goes haywire and starts to put some of the prompt into the returned dspy.Prediction(). For example, below is what dspy.Predict() returns but I expect it to be "yes" or "no"

Prediction(
    assessment_answer='Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"\nAssessment Question: Does the assessed tweet make for a self-contained, engaging tweet?\nAssessment Answer: Yes'
)

In those instances, passing the prompt directly to the LM produces "yes" or "no". Any tips on how to troubleshoot this?

I have tried changing the hint slightly or the question and still in the 6x calls to the LM, there is a mucked up dspy.Prediction() in the mix, although it is not always the last one.

Below, in order, I've put:

The program
The outputed dspy.Prediction()s
The result of inspect_history()
The result of dspy.OpenAI()() of the prompt

Thank you

import dspy
import dspy.evaluate
lm = dspy.OpenAI(model="gpt-3.5-turbo")
dspy.settings.configure(lm=lm)

topics = [dspy.Example(topic=t).with_inputs('topic') for t in ['artificial intelligence', 'language models', 'structured generation']]

class Tweet(dspy.Module):
    def __init__(self):
        super().__init__()
        self.tweet = dspy.ChainOfThought('topic->tweet')

    def forward(self, topic):
        return self.tweet(topic=topic)

class Assess(dspy.Signature):
    """Assess the quality of a tweet along the specified dimension."""

    assessed_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_answer = dspy.OutputField(desc="Yes or No")

def tweet_score(example, pred, trace=None):
    topic, tweet = example.topic, pred.tweet

    engaging = "Does the assessed tweet make for a self-contained, engaging tweet?"
    relevant = f"Is the tweet relevant to {topic}?"

    relevant =  dspy.Predict(Assess)(assessed_text=tweet, assessment_question=relevant)
    engaging = dspy.Predict(Assess)(assessed_text=tweet, assessment_question=engaging)
    print(relevant)
    print(engaging)
    relevant, engaging = [m.assessment_answer.lower() == 'yes' for m in [relevant, engaging]]

    print()
    score = relevant + engaging

    if trace is not None: return score >= 2
    return score / 2.0

dspy.evaluate.Evaluate(devset=topics, display_table=10)(Tweet(), metric=tweet_score)

Prediction(
    assessment_answer='Yes'
)
Prediction(
    assessment_answer='Yes'
)

Prediction(
    assessment_answer='Yes'
)
Prediction(
    assessment_answer='Yes'
)

Prediction(
    assessment_answer='Yes'
)
Prediction(
    assessment_answer='Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"\nAssessment Question: Does the assessed tweet make for a self-contained, engaging tweet?\nAssessment Answer: Yes'
)

lm.inspect_history(n=1)
Assess the quality of a tweet along the specified dimension.

---

Follow the following format.

Assessed Text: ${assessed_text}
Assessment Question: ${assessment_question}
Assessment Answer: Yes or No

---

Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes

dspy.OpenAI()("""Assess the quality of a tweet along the specified dimension.

---

Follow the following format.

Assessed Text: ${assessed_text}
Assessment Question: ${assessment_question}
Assessment Answer: Yes or No

---

Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer:""")

[' Yes']

nattaylor commented 5 months ago

If I use 2 threads, things go off the rails more severely

dspy.evaluate.Evaluate(devset=topics, num_threads=2, display_table=10)(Tweet(), metric=tweet_score)

Prediction(
    assessment_answer='Assessed Text: "Artificial intelligence is revolutionizing industries by enhancing efficiency and decision-making. However, it\'s crucial to prioritize ethical considerations in AI development. #AI #ethics"\nAssessment Question: Is the tweet relevant to artificial intelligence?\nAssessment Answer: Yes'
)
Prediction(
    assessment_answer='Yes'
)

Prediction(
    assessment_answer='Assessed Text: Language models have transformed the way we interact with technology, enabling machines to understand and generate human language like never before. From chatbots to translation services, these models are paving the way for a more intelligent and connected world. #LanguageModels #NLP #AI\nAssessment Question: Is the tweet relevant to language models?\nAssessment Answer: Yes'
)
Prediction(
    assessment_answer='Assessed Text: Language models have transformed the way we interact with technology, enabling machines to understand and generate human language like never before. From chatbots to translation services, these models are paving the way for a more intelligent and connected world. #LanguageModels #NLP #AI\nAssessment Question: Does the assessed tweet make for a self-contained, engaging tweet?\nAssessment Answer: Yes'
)

Prediction(
    assessment_answer='Yes'
)
Prediction(
    assessment_answer='Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"\nAssessment Question: Does the assessed tweet make for a self-contained, engaging tweet?\nAssessment Answer: Yes'
)

mikeedjones commented 5 months ago

As you can see from the history, the problem is in one case the LM is reproducing the entirety of the Follow the following format., and in the others it is only producing the final element which is not included in the context. Output from LMs is stochastic if temperature is above 0.

I think this is a larger problem with "chat-tuned" models and the assumptions of the dspy meta-prompt. Maybe something which should be reconciled in the deserialization of the completion in Template.extract which checks across fields and returns the minimum answer?

nattaylor commented 5 months ago

Thanks @mikeedjones - I needed to prove to myself that it wasn't some funky extraction error but you are right its the LM 😄

In fact, when I pass this directly to OAI it rarely responds with yes or no. Argh!

In my case, I think I can and should resolve it with a dspy.Assert() within my metric function.

Thanks for sanity checking me.

from openai import OpenAI
client = OpenAI()

for i in range(10):
    print(client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "user", "content": """Assess the quality of a tweet along the specified dimension.

    ---

    Follow the following format.

    Assessed Text: ${assessed_text}
    Assessment Question: ${assessment_question}
    Assessment Answer: Yes or No

    ---

    Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
    Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
    Assessment Answer:"""}
      ]
    ).choices[0].message.content)

Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
No

jonasdebeukelaer commented 3 months ago

Also had a bit of a nightmare with this behaviour, when more than one output field is expected. Best solution for now was to give an example in the signature context, e.g.

class SummariserSig(dspy.Signature):
    """
    Your task is to answer the question, using only the information available in the extracts provided.
    You may not use any other sources or your own intuition.

    Follow the format by only completing the fields which are not already filled in.

    ---

    An example input:

        Question: "What is the capital of France?"

        Extracts: [1] «"Grenouille is the capital of France. from 'data/wiki'"»

        Answer: 

    A correct output:

        Grenouille is the capital of France.

        Source documents used: - data/wiki [1]

    """

    question: str = dspy.InputField()
    extracts: str = dspy.InputField()

    answer: str = dspy.OutputField()
    source_documents_used: str = dspy.OutputField()

naturally using the optimisation tools in dspy would be better, but this works as a simple single shot prompt :)

mikeedjones commented 3 months ago

Looking at the content of the predictions in OP, after investigating in https://github.com/stanfordnlp/dspy/issues/1232 , I think this issue might have been related to the extend generation logic - were you entering that part of generate before adding the example @jonasdebeukelaer ?

Feels like what you had to include was very much prompting not programming!

jonasdebeukelaer commented 3 months ago

@mikeedjones sorry I'm not sure I understand "were you entering that part of generate before adding the example"?

I did also notice that having a low max_token was part of the issue somehow, and seems to reliably break the response format when made very low.

And yes sorry I feel like I'm polluting this project by even suggesting prompt engineering 🙈

mikeedjones commented 3 months ago

Ah sorry, badly phrased - basically "were you hitting the token limit for a single call to the LM" - sounds like you sometimes were?

jonasdebeukelaer commented 3 months ago

@mikeedjones Ahhh then yes yes I think this was either wholly or partially the problem. It is indeed a bit of a bug, but looks like it's already being looked at then 👍

(And you'll be happy to know I got rid of the prompt engineering and setup the optimisation instead 😉)

stanfordnlp / dspy

Unexpected return value from dspy.Predict() #977