Teleprompter removing classes

paulacanva commented 6 days ago

I've been trying to use DSPy in different contexts where I see fit, but I've been unsuccessful in obtaining any good results. I have a very long prompt for a classification task that needs to describe the classes in depth. When I use this prompt for benchmark evaluation, I get a low metric of 20% accuracy using G-Eval. So I thought DSPy could be the way to go here. I used it with both MIPROv2 and CoPro, yet I can't get anything that makes sense. For MIPROv2 with Predictor, all my classes are removed and I get a very generic prompt:

{
  "program": {
    "lm": null,
    "traces": [],
    "train": [],
    "demos": [],
    "signature": {
      "instructions": "<prompt>\n    As a design consultant, your role is to offer constructive feedback on visual designs, ensuring they align with key design principles. Your task is to assess a provided design description, focusing on its use of typography, color, layout, and visual elements. Provide clear, actionable advice to improve the design's aesthetic appeal and communicative efficacy.\n\n    You will receive a detailed design description. Analyze it based on the following categories:\n\n    1. **Typography**: Evaluate the font choices, sizes, colors, and overall text hierarchy. Offer suggestions for improving readability and visual harmony.\n    2. **Color Usage**: Assess the color palette for consistency and effectiveness. Recommend changes that enhance visual interest and support the design's message.\n    3. **Layout**: Examine the arrangement of elements, use of white space, and overall balance. Suggest improvements for better visual flow and organization.\n    4. **Design Elements**: Review the imagery, icons, and illustrations for relevance and style consistency. Provide feedback on how these elements contribute to the design's message.\n\n    Use the following principles to guide your evaluation:\n\n    - **Typography Design Principles**: Focus on font pairing, legibility, and text hierarchy.\n    - **Color Design Principles**: Ensure a cohesive and purposeful use of colors.\n    - **Layout Design Principles**: Aim for a balanced and well-organized design.\n    - **Design Message Principle**: Align visual elements with the intended communication.\n\n    Provide your feedback in a structured format, offering specific, actionable advice where necessary. If the design is already effective, explain why it works well without suggesting unnecessary changes.\n\n    Your response should be in valid JSON format, adhering to the specified structure for advice or a positive evaluation.\n<\/prompt>",
      "fields": [
        {
          "prefix": "Design Description:",
          "description": "A thoroughly description of the design, used to evaluate its visual quality."
        },
        {
          "prefix": "Advice:",
          "description": "Advice on how to improve the design, if necessary to do so. Must be provided."
        },
        {
          "prefix": "Categories:",
          "description": "Categories of advice provided, representing well-known designing guidelines. Must be provided."
        }
      ]
    }
  }
}

When using MIPROv2 with Chain of Thought, the output is a messed up prompt that uses XML tags, the output being not valid XML at all. Plus, a great deal of classes are also removed. I used all my data (with 30 examples per class) for this task, and yet it's not giving me anything useful. It basically copies half of my prompt, removes a lot of classes, and messes with all the XML tags. All of this using GPT4o.

import os
import re
from datetime import datetime

import config
import constants
import data_provider as dp
import dspy
import pandas as pd
import weave
from dspy.teleprompt import MIPROv2, COPRO
from loguru import logger
from tqdm import tqdm
from utils import helpers

class Assess(dspy.Signature):
    """Assess the quality of a tweet along the specified dimension."""

    input_expected_text = dspy.InputField()
    output_predicted_text = dspy.InputField()
    assessment_question = dspy.InputField()
    assessment_score = dspy.OutputField(
        desc="Score only. Float ranging from 0 (worse) to 1 (optimal)."
    )

class AdviceSignature(dspy.Signature):
    """Defines the input-output signature for design advice generation."""

    design_description = dspy.InputField(
        desc="A thoroughly description of the design, used to evaluate its visual quality."
    )
    advice = dspy.OutputField(
        desc="Advice on how to improve the design, if necessary to do so. Must be provided."
    )
    categories = dspy.OutputField(
        desc="Categories of advice provided, representing well-known designing guidelines. Must be provided."
    )

class DesignAdviceProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.program = dspy.ChainOfThought(AdviceSignature)

    def forward(self, design_description: str):
        return self.program(design_description=design_description)

class PromptOptimization:
    NAME: str = "design-advice-prompt-optimization"

    def __init__(
        self,
        llm_model: str,
        image_dir: str,
        annotation_dir: str,
        data_provider: dp.DataProvider,
    ):
        if llm_model == config.LLMModelType.GPT4o.name.lower():
            self._config = config.GPTConfig(
                running_mode=config.RunningMode.ADVICE_GENERATION
            )
        else:
            raise ValueError(f"Unsupported model: {llm_model}")

        self._human_labeled_df = data_provider.retrieve_labelled_data(
            image_folder=image_dir, annotation_folder=annotation_dir
        )
        self._eval_config = config.EvalConfig()
        self._eval_lm = dspy.OpenAI(model="gpt-4o", max_tokens=self._config.max_tokens)

        # Cheating 'cause I have no idea how to do this without copying a huge prompt.
        AdviceSignature.__doc__ = self._config.system_prompt
        logger.debug(f"Using prompt: {AdviceSignature.__doc__}")

        weave.init(self.NAME)

    @staticmethod
    def _create_training_set_examples(
        human_labeled_df: pd.DataFrame, llm_annotated_df: pd.DataFrame
    ) -> list[dspy.Example]:
        llm_annotated_df.columns = llm_annotated_df.columns.str.lower().str.replace(
            " ", "_"
        )
        human_labeled_df[
            constants.HumanLabeledDataset.DESIGN_ID.name
        ] = human_labeled_df[constants.HumanLabeledDataset.DESIGN_ID.name].apply(
            lambda x: f"{x}_input"
        )

        llm_annotated_df = llm_annotated_df.drop_duplicates(
            subset=[constants.HumanLabeledDataset.DESIGN_ID.name]
        )
        llm_annotated_df = llm_annotated_df.merge(
            human_labeled_df, on=constants.HumanLabeledDataset.DESIGN_ID.name
        )

        for _, row in tqdm(llm_annotated_df.iterrows()):
            categories = ",\n ".join(
                [
                    row[col.name]
                    for col in [
                        constants.HumanLabeledDataset.ADVICE_CATEGORY_1,
                        constants.HumanLabeledDataset.ADVICE_CATEGORY_2,
                        constants.HumanLabeledDataset.ADVICE_CATEGORY_3,
                    ]
                    if pd.notna(row[col.name])
                ]
            )

            if not categories:
                logger.warning(f"Skipping row due to missing 'categories': {row}")
                continue

            if pd.isna(
                row.get(constants.HumanLabeledDataset.ACTIONABLE_ADVICE.name, None)
            ):
                logger.warning(f"Skipping row due to missing 'advice': {row}")
                continue

            examples.append(
                dspy.Example(
                    {
                        "design_description": row[
                            constants.LLMDesignDescriptionColumns.ANNOTATION.name
                        ],
                        "categories": categories,
                        "advice": row[
                            constants.HumanLabeledDataset.ACTIONABLE_ADVICE.name
                        ],
                    }
                ).with_inputs("design_description")
            )

        return examples

    @staticmethod
    def _extract_score(score_str: str) -> float:
        try:
            # Try to convert the score directly to a float (normal case)
            return float(score_str)
        except ValueError:
            # If direct conversion fails, use regex to extract the numeric score from the text
            match = re.search(r"Assessment Score:\s*(\d*\.\d+|\d+)", score_str)
            if match:
                return float(match.group(1))
            else:
                # If no score is found, return 0.0 as a default
                return 0.0

    @helpers.log_exceptions(Exception, default_return=0.0)
    def _validate_advice(
        self, example: dspy.Example, prediction: dspy.Prediction, trace=None
    ) -> float:
        with dspy.context(lm=self._eval_lm):
            principles_eval = dspy.Predict(Assess)(
                input_expected_text=example.categories,
                output_predicted_text=prediction.categories,
                assessment_question=self._eval_config.principles_criteria,
            )
            advice_eval = dspy.Predict(Assess)(
                input_expected_text=example.advice,
                output_predicted_text=prediction.advice,
                assessment_question=self._eval_config.advice_criteria,
            )

        principles_score, advice_score = (
            m.assessment_score for m in [principles_eval, advice_eval]
        )
        principles_score = self._extract_score(principles_score)
        advice_score = self._extract_score(advice_score)

        return (principles_score + advice_score) / 2

    @weave.op()
    def run_prompt_optimization(
        self,
        llm_annotated_df: pd.DataFrame,
    ) -> None:
        with weave.attributes(
            {
                "model": self._config.model_name,
                "date": datetime.date(datetime.now()),
            }
        ):  # Adding attributes to the Weave context, for easy UI filtering
            training_set = self._create_training_set_examples(
                human_labeled_df=self._human_labeled_df,
                llm_annotated_df=llm_annotated_df,
            )

            dspy_config = config.DSPyConfig()
            program = DesignAdviceProgram()
            lm = dspy.LM(
                "openai/gpt-4o",
                temperature=self._config.temperature,
                max_tokens=self._config.max_tokens,
            )
            dspy.settings.configure(lm=lm)

            # Initialize optimizer
            teleprompter = MIPROv2(
                metric=self._validate_advice,  # Custom metric for evaluating advice generation
                prompt_model=None,  # Optional if you are not using a specific prompt model
                task_model=None,  # Optional if you are not using a specific task model
                max_bootstrapped_demos=0,  # Since we are not using few-shot learning, set this to 0
                max_labeled_demos=0,  # No labeled demos needed for instruction-only optimization
                num_candidates=dspy_config.num_candidates,
                num_threads=dspy_config.num_threads,
                max_errors=50,  # Maximum number of errors allowed during optimization
                verbose=True,  # Enable verbosity for detailed output
                track_stats=True,  # Track stats to analyze optimization progress
            )

            # teleprompter = COPRO(
            #     metric=self._validate_advice,
            #     num_threads=dspy_config.num_threads,
            #     verbose=True,
            #     track_stats=True,  # Track stats to analyze optimization progress
            # )

            # Optimize program
            logger.info("Optimizing program with MIPROv2...")
            zeroshot_optimized_program = teleprompter.compile(
                program.deepcopy(),
                trainset=training_set,
                max_bootstrapped_demos=0,
                max_labeled_demos=0,
                requires_permission_to_run=False,
            )
            logger.info("Program optimized.")
            logger.debug(f"Optimized program: {zeroshot_optimized_program}")

            # Save the optimized prompt for future use
            os.makedirs(config.OPTIMIZATION, exist_ok=True)
            zeroshot_optimized_program.save(
                config.OPTIMIZATION / f"optimized_design_prompt_{config.NOW}"
            )
            logger.info("Optimized program saved.")

Overall, it seems to be always trying to "summarize" the prompt.

okhat commented 6 days ago

Hey @paulacanva ! Thanks for opening this detailed issue.

My understanding is that you're now passing your prompt to a DSPy signature in the line, AdviceSignature.__doc__ = self._config.system_prompt, but I can't see what it originally was. I do see that the generated instruction is asking for "valid JSON format" and you mention "valid XML". Does your original instruction describe formatting? If so, please don't do that. Let DSPy handle formatting details.

When you say "all of your classes are removed", what are you referring to? I see that you get a prompt that has four categories, e.g. Typography and Color Usage. Are these the wrong categories? Are you saying it's missing some categories?

Overall, I can't help directly since I don't have access to your original instruction or categories. However, this is a simple task so I recommend starting much, much simpler. Build a small thing that works before adding more complexity:

Program:

from typing import Literal

ADVICE_CATEGORIES = ("Typography", "Color Usage", )  # TODO (paulacanva): Add the rest

class AdviceSignature(dspy.Signature):
    """
    Evaluate the visual quality of a design description.
    Provide advice on how to improve it then categorize your advice.
    """

    design_description: str = dspy.InputField()
    advice: str = dspy.OutputField()
    category: Literal[ADVICE_CATEGORIES] = dspy.OutputField()

Usage:

design_description = """
I have a design description that I need feedback on. It's a website design for a new online store. The design includes a homepage, product pages, and a checkout process. The goal is to create a visually appealing and user-friendly experience for customers. Here are the key elements of the design:
""".strip()

dspy.ChainOfThought(AdviceSignature)(design_description=design_description)

Produce:

Prediction(
    reasoning='The design description outlines a clear structure for the website, focusing on key elements like the homepage, product pages, and checkout process. However, it lacks specific details about typography and color usage, which are crucial for creating a visually appealing and user-friendly experience. The effectiveness of the design will heavily depend on how these elements are implemented.',
    advice='Consider specifying the typography choices (font styles, sizes, and weights) and color palette (primary, secondary, and accent colors) that will be used throughout the website. This will help ensure consistency and enhance the overall visual appeal. Additionally, ensure that the text is legible against the background colors and that there is sufficient contrast to improve readability.',
    category='Color Usage'
)

This works fine. Now we can iterate from there.

paulacanva commented 6 days ago

Hey @okhat, thanks for the fast response. Unfortunately, I can't share the prompt due to company policies. My prompt uses XML tags to separate sections, as it's a very long prompt in the field of design. It is indeed defining an expected JSON as the output format. Thanks for letting me know this is not advised.

My prompt defines 23 possible classes (multi-class problems). The final prompt contains only a subset of those (although my examples passed to trainset contain annotated examples of all. I guess this might help: category: Literal[ADVICE_CATEGORIES] = dspy.OutputField(), I was just defining the output fields as I've seen in the docs: categories = dspy.OutputField( desc="Categories for the generatedadvice, representing well-known designing guidelines. Must be provided." ). I'll try adding that.

Thanks a lot! I hope we can share good results leveraging DSPy soon.

paulacanva commented 6 days ago

Just FYI, while that "worked" in terms of respecting the expected categories, but now the whole prompt remains the same after optimization, just adding a minor extended instruction.

okhat commented 6 days ago

How are you optimizing? What's the metric? The metric I saw above Assess talks about a "tweet" so it's a bit weird.

Happy to help you set this up. Conceptually, what are you trying to maximize here? I think you probably want the advice to be "good" (what's good? what's bad? what are you looking for in the output?) and want the category to be correct (do you have labels? can we just do exact match for this?).

paulacanva commented 5 days ago

That must be a typo from copying from the docs. I'll fix this. Basically, the prompt asks for categories that represent issues in a design and design advice that guides a user on how to improve the design. The categories are related to the advice provided. For evaluation, I'm doing like this:

with dspy.context(lm=self._eval_lm):
            principles_eval = dspy.Predict(Assess)(
                input_expected_text=example.categories,
                output_predicted_text=prediction.categories,
                assessment_question=self._eval_config.principles_criteria,
            )
            advice_eval = dspy.Predict(Assess)(
                input_expected_text=example.advice,
                output_predicted_text=prediction.advice,
                assessment_question=self._eval_config.advice_criteria,
            )

Where the criteria are:

    principles_criteria: str = "Evaluate whether all key principles in the input are covered in the output. Missing principles or irrelevant additions will affect the score."
    advice_criteria: str = "Evaluate whether the output provides the same actionable advice as the input, allowing flexibility in phrasing but ensuring the intent remains the same."

So what we want to maximize is both the classification accuracy and the content of the advice given the expected one. A match could be done for the categories (guaranteeing order doesn't matter), but not for the advice, which is in natural language.

Again, thanks for all the feedback and guidance.

okhat commented 5 days ago

Thank you @paulacanva ! In the spirit of starting simple, could you optimize just the category first?

Here's one way to do it. First, define the program using the signature above:

advisor = dspy.ChainOfThought(AdviceSignature)

Second, evaluate the baseline. Do not proceed if anything seems wrong or the score is unexpectedly low here.

def correct_category(example, pred, trace=None):
      # TODO: check that example.category == pred.category
      pass

evaluate = dspy.Evaluate(devset=devset, metric=correct_category, num_threads=16, display_progress=True, display_table=5)
evaluate(advisor)

Next, optimize it:

tp = dspy.MIPROv2(metric=correct_category, auto="medium", num_threads=16)
optimized_advisor = tp.compile(advisor, trainset=trainset, requires_permission_to_run=False)

Then, evaluate it. Make sure your trainset has somewhere in the vicinity of 100--500 examples. Same for the devset.

okhat commented 5 days ago

Follow-up Question: What do you mean by "guaranteeing order doesn't matter"? Order of what?

Do you need the output of category: Literal[....] to actually be categories: list[Literal[...]] ?

paulacanva commented 5 days ago

Yes, that's correct -> categories: list [Literal [...]]. This is what I have now. However, the prompt remains the same with a 2-line addition in the extended_signature. I'm checking with my manager if there's a way to collaborate with DSPy and share more, as I believe leveraging this framework could help the whole company. I'll get back to you, if that's okay, once I have something more concrete.

okhat commented 5 days ago

Thanks @paulacanva ! Either way, keep us posted. Happy to help here.

stanfordnlp / dspy

Teleprompter removing classes #1717