Closed nattaylor closed 5 months ago
If I use 2 threads, things go off the rails more severely
dspy.evaluate.Evaluate(devset=topics, num_threads=2, display_table=10)(Tweet(), metric=tweet_score)
Prediction(
assessment_answer='Assessed Text: "Artificial intelligence is revolutionizing industries by enhancing efficiency and decision-making. However, it\'s crucial to prioritize ethical considerations in AI development. #AI #ethics"\nAssessment Question: Is the tweet relevant to artificial intelligence?\nAssessment Answer: Yes'
)
Prediction(
assessment_answer='Yes'
)
Prediction(
assessment_answer='Assessed Text: Language models have transformed the way we interact with technology, enabling machines to understand and generate human language like never before. From chatbots to translation services, these models are paving the way for a more intelligent and connected world. #LanguageModels #NLP #AI\nAssessment Question: Is the tweet relevant to language models?\nAssessment Answer: Yes'
)
Prediction(
assessment_answer='Assessed Text: Language models have transformed the way we interact with technology, enabling machines to understand and generate human language like never before. From chatbots to translation services, these models are paving the way for a more intelligent and connected world. #LanguageModels #NLP #AI\nAssessment Question: Does the assessed tweet make for a self-contained, engaging tweet?\nAssessment Answer: Yes'
)
Prediction(
assessment_answer='Yes'
)
Prediction(
assessment_answer='Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"\nAssessment Question: Does the assessed tweet make for a self-contained, engaging tweet?\nAssessment Answer: Yes'
)
As you can see from the history, the problem is in one case the LM is reproducing the entirety of the Follow the following format.
, and in the others it is only producing the final element which is not included in the context. Output from LMs is stochastic if temperature is above 0.
I think this is a larger problem with "chat-tuned" models and the assumptions of the dspy meta-prompt. Maybe something which should be reconciled in the deserialization of the completion in Template.extract
which checks across fields and returns the minimum answer?
Thanks @mikeedjones - I needed to prove to myself that it wasn't some funky extraction error but you are right its the LM 😄
In fact, when I pass this directly to OAI it rarely responds with yes or no. Argh!
In my case, I think I can and should resolve it with a dspy.Assert()
within my metric function.
Thanks for sanity checking me.
from openai import OpenAI
client = OpenAI()
for i in range(10):
print(client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": """Assess the quality of a tweet along the specified dimension.
---
Follow the following format.
Assessed Text: ${assessed_text}
Assessment Question: ${assessment_question}
Assessment Answer: Yes or No
---
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer:"""}
]
).choices[0].message.content)
Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
Assessed Text: "Structured generation is the key to creating organized and impactful content. By carefully planning and designing templates, we can ensure that our message is clear and effective. #contentcreation #structuredgeneration"
Assessment Question: Does the assessed tweet make for a self-contained, engaging tweet?
Assessment Answer: Yes
No
Also had a bit of a nightmare with this behaviour, when more than one output field is expected. Best solution for now was to give an example in the signature context, e.g.
class SummariserSig(dspy.Signature):
"""
Your task is to answer the question, using only the information available in the extracts provided.
You may not use any other sources or your own intuition.
Follow the format by only completing the fields which are not already filled in.
---
An example input:
Question: "What is the capital of France?"
Extracts: [1] «"Grenouille is the capital of France. from 'data/wiki'"»
Answer:
A correct output:
Grenouille is the capital of France.
Source documents used: - data/wiki [1]
"""
question: str = dspy.InputField()
extracts: str = dspy.InputField()
answer: str = dspy.OutputField()
source_documents_used: str = dspy.OutputField()
naturally using the optimisation tools in dspy would be better, but this works as a simple single shot prompt :)
Looking at the content of the predictions in OP, after investigating in https://github.com/stanfordnlp/dspy/issues/1232 , I think this issue might have been related to the extend generation logic - were you entering that part of generate before adding the example @jonasdebeukelaer ?
Feels like what you had to include was very much prompting not programming!
@mikeedjones sorry I'm not sure I understand "were you entering that part of generate before adding the example"?
I did also notice that having a low max_token was part of the issue somehow, and seems to reliably break the response format when made very low.
And yes sorry I feel like I'm polluting this project by even suggesting prompt engineering 🙈
Ah sorry, badly phrased - basically "were you hitting the token limit for a single call to the LM" - sounds like you sometimes were?
@mikeedjones Ahhh then yes yes I think this was either wholly or partially the problem. It is indeed a bit of a bug, but looks like it's already being looked at then 👍
(And you'll be happy to know I got rid of the prompt engineering and setup the optimisation instead 😉)
In the following program, calling
dspy.Predict()
at first produces the expected "yes" or "no" but then goes haywire and starts to put some of the prompt into the returneddspy.Prediction()
. For example, below is whatdspy.Predict()
returns but I expect it to be "yes" or "no"In those instances, passing the prompt directly to the LM produces "yes" or "no". Any tips on how to troubleshoot this?
I have tried changing the hint slightly or the question and still in the 6x calls to the LM, there is a mucked up
dspy.Prediction()
in the mix, although it is not always the last one.Below, in order, I've put:
dspy.Prediction()s
inspect_history()
dspy.OpenAI()()
of the promptThank you