Open paulacanva opened 6 days ago
Hey @paulacanva ! Thanks for opening this detailed issue.
My understanding is that you're now passing your prompt to a DSPy signature in the line, AdviceSignature.__doc__ = self._config.system_prompt
, but I can't see what it originally was. I do see that the generated instruction is asking for "valid JSON format" and you mention "valid XML". Does your original instruction describe formatting? If so, please don't do that. Let DSPy handle formatting details.
When you say "all of your classes are removed", what are you referring to? I see that you get a prompt that has four categories, e.g. Typography and Color Usage. Are these the wrong categories? Are you saying it's missing some categories?
Overall, I can't help directly since I don't have access to your original instruction or categories. However, this is a simple task so I recommend starting much, much simpler. Build a small thing that works before adding more complexity:
Program:
from typing import Literal
ADVICE_CATEGORIES = ("Typography", "Color Usage", ) # TODO (paulacanva): Add the rest
class AdviceSignature(dspy.Signature):
"""
Evaluate the visual quality of a design description.
Provide advice on how to improve it then categorize your advice.
"""
design_description: str = dspy.InputField()
advice: str = dspy.OutputField()
category: Literal[ADVICE_CATEGORIES] = dspy.OutputField()
Usage:
design_description = """
I have a design description that I need feedback on. It's a website design for a new online store. The design includes a homepage, product pages, and a checkout process. The goal is to create a visually appealing and user-friendly experience for customers. Here are the key elements of the design:
""".strip()
dspy.ChainOfThought(AdviceSignature)(design_description=design_description)
Produce:
Prediction(
reasoning='The design description outlines a clear structure for the website, focusing on key elements like the homepage, product pages, and checkout process. However, it lacks specific details about typography and color usage, which are crucial for creating a visually appealing and user-friendly experience. The effectiveness of the design will heavily depend on how these elements are implemented.',
advice='Consider specifying the typography choices (font styles, sizes, and weights) and color palette (primary, secondary, and accent colors) that will be used throughout the website. This will help ensure consistency and enhance the overall visual appeal. Additionally, ensure that the text is legible against the background colors and that there is sufficient contrast to improve readability.',
category='Color Usage'
)
This works fine. Now we can iterate from there.
Hey @okhat, thanks for the fast response. Unfortunately, I can't share the prompt due to company policies. My prompt uses XML tags to separate sections, as it's a very long prompt in the field of design. It is indeed defining an expected JSON as the output format. Thanks for letting me know this is not advised.
My prompt defines 23 possible classes (multi-class problems). The final prompt contains only a subset of those (although my examples passed to trainset
contain annotated examples of all.
I guess this might help: category: Literal[ADVICE_CATEGORIES] = dspy.OutputField()
, I was just defining the output fields as I've seen in the docs: categories = dspy.OutputField( desc="Categories for the generated
advice, representing well-known designing guidelines. Must be provided." )
. I'll try adding that.
Thanks a lot! I hope we can share good results leveraging DSPy soon.
Just FYI, while that "worked" in terms of respecting the expected categories, but now the whole prompt remains the same after optimization, just adding a minor extended instruction.
How are you optimizing? What's the metric? The metric I saw above Assess
talks about a "tweet" so it's a bit weird.
Happy to help you set this up. Conceptually, what are you trying to maximize here? I think you probably want the advice to be "good" (what's good? what's bad? what are you looking for in the output?) and want the category to be correct (do you have labels? can we just do exact match for this?).
That must be a typo from copying from the docs. I'll fix this. Basically, the prompt asks for categories that represent issues in a design and design advice that guides a user on how to improve the design. The categories are related to the advice provided. For evaluation, I'm doing like this:
with dspy.context(lm=self._eval_lm):
principles_eval = dspy.Predict(Assess)(
input_expected_text=example.categories,
output_predicted_text=prediction.categories,
assessment_question=self._eval_config.principles_criteria,
)
advice_eval = dspy.Predict(Assess)(
input_expected_text=example.advice,
output_predicted_text=prediction.advice,
assessment_question=self._eval_config.advice_criteria,
)
Where the criteria are:
principles_criteria: str = "Evaluate whether all key principles in the input are covered in the output. Missing principles or irrelevant additions will affect the score."
advice_criteria: str = "Evaluate whether the output provides the same actionable advice as the input, allowing flexibility in phrasing but ensuring the intent remains the same."
So what we want to maximize is both the classification accuracy and the content of the advice given the expected one. A match could be done for the categories (guaranteeing order doesn't matter), but not for the advice, which is in natural language.
Again, thanks for all the feedback and guidance.
Thank you @paulacanva ! In the spirit of starting simple, could you optimize just the category first?
Here's one way to do it. First, define the program using the signature above:
advisor = dspy.ChainOfThought(AdviceSignature)
Second, evaluate the baseline. Do not proceed if anything seems wrong or the score is unexpectedly low here.
def correct_category(example, pred, trace=None):
# TODO: check that example.category == pred.category
pass
evaluate = dspy.Evaluate(devset=devset, metric=correct_category, num_threads=16, display_progress=True, display_table=5)
evaluate(advisor)
Next, optimize it:
tp = dspy.MIPROv2(metric=correct_category, auto="medium", num_threads=16)
optimized_advisor = tp.compile(advisor, trainset=trainset, requires_permission_to_run=False)
Then, evaluate it. Make sure your trainset has somewhere in the vicinity of 100--500 examples. Same for the devset.
Follow-up Question: What do you mean by "guaranteeing order doesn't matter"? Order of what?
Do you need the output of category: Literal[....]
to actually be categories: list[Literal[...]]
?
Yes, that's correct -> categories: list [Literal [...]]
. This is what I have now. However, the prompt remains the same with a 2-line addition in the extended_signature
. I'm checking with my manager if there's a way to collaborate with DSPy and share more, as I believe leveraging this framework could help the whole company. I'll get back to you, if that's okay, once I have something more concrete.
Thanks @paulacanva ! Either way, keep us posted. Happy to help here.
I've been trying to use DSPy in different contexts where I see fit, but I've been unsuccessful in obtaining any good results. I have a very long prompt for a classification task that needs to describe the classes in depth. When I use this prompt for benchmark evaluation, I get a low metric of 20% accuracy using G-Eval. So I thought DSPy could be the way to go here. I used it with both MIPROv2 and CoPro, yet I can't get anything that makes sense. For MIPROv2 with Predictor, all my classes are removed and I get a very generic prompt:
When using MIPROv2 with Chain of Thought, the output is a messed up prompt that uses XML tags, the output being not valid XML at all. Plus, a great deal of classes are also removed. I used all my data (with 30 examples per class) for this task, and yet it's not giving me anything useful. It basically copies half of my prompt, removes a lot of classes, and messes with all the XML tags. All of this using GPT4o.
Overall, it seems to be always trying to "summarize" the prompt.