Optimized prompt for multi-class classification contains only a subset of classifiers

aaronbriel commented 1 week ago

I followed the tutorials for optimizing a DSPy program for the task of multi-class classification and the "optimized" prompt resulted in a small subset of the available classifiers, making it unsuitable for consideration in a production environment.

I'll provide the relevant chunks of notebook code but I won't be able to actually show the prompt itself as it contains production data. Hopefully this is sufficient for identification of what may be the issue.

ISSUE 1: The main issue is that the final "optimized" prompt only contains single few-shot samples for 8 of the 41 classifiers (with one of the classifiers having 2 samples). I expected it to contain multiple few-shot samples for each of the 41 classifiers.

ISSUE 2: The secondary issue was that the evaluation metric showed a rather low score of 64.34. I expected this to be much higher since I trained with a decent size ground truth dataset (that was manually curated for accuracy) of 50 samples per classifier.

I'm guessing this is related to my optimizer configuration but I'm not sure what to adjust. Please advise. Thank you!

# source .env file
import os
import sys
from dotenv import load_dotenv
load_dotenv()

# Add the current directory to PYTHONPATH
sys.path.append('/Users/abriel/repos/projectname/')
sys.path.append(os.getenv('PYTHONPATH'))
sys.path.append(os.getenv('DEFAULT_MODEL'))

import os
import re
import dspy
from dspy import Predict
from dspy.datasets import DataLoader
from dspy.signatures import ensure_signature
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
import pandas as pd

# Load the intent keys from an external source
from src.variables import intent_keys

# Set up the model using OpenAI's GPT
gpt4o = dspy.OpenAI(model=os.environ['DEFAULT_MODEL'])
dspy.configure(lm=gpt4o)

# Define the Intent Classifier Signature
class IntentClassifier(dspy.Signature):
    """
    Classifies a person's response into one of the given intents based on the conversation
    between a two people, person1 and person2.
    """
    conversation = dspy.InputField(
        desc="A conversation between person1 and person2.",
        prefix="Conversation: "
    )
    script_question = dspy.InputField(
        desc="Person1 question.",
        prefix="Question: "
    )
    response = dspy.InputField(
        desc="Person2's response to the question from person1.",
        prefix="Response: "
    )
    intent = dspy.OutputField(desc="One of the following intents: " + ", ".join(intent_keys))

# Create the IntentClassifierModule that incorporates ChainOfThought
class IntentClassifierModule(dspy.Module):
    """
    A module that defines the intent classification process.
    """
    def __init__(self):
        super().__init__()
        self.signature = IntentClassifier
        self.predictor = dspy.ChainOfThought(signature=self.signature)

    def forward(self, conversation, question, response):
        """
        Runs the forward pass for classifying intents.
        """
        result = self.predictor(
            conversation=conversation,
            question=question,
            response=response
        )
        return dspy.Prediction(
            intent=result.intent
        )

# Load and split datasets
dl = DataLoader()

full_dataset = dl.from_csv(
    "dataset_name.csv",
    fields=("conversation", "question", "response", "intent"),
    input_keys=("conversation", "question", "response")
)
splits = dl.train_test_split(dataset, train_size=0.8)
train_dataset = splits['train']
test_dataset = splits['test']

# Validation function to compare predicted and actual intents
def validate_answer(example, pred, trace=None):
    """
    Validates the prediction by comparing it to the actual intent.
    """
    return example.intent.lower() == pred.intent.lower()

# Configure the optimizer
config_ = {
    "max_bootstrapped_demos": 8,
    "max_labeled_demos": 8,
    "num_candidate_programs": 10,
    "num_threads": 4
}

# Use BootstrapFewShotWithRandomSearch to optimize the prompt
teleprompter = BootstrapFewShotWithRandomSearch(
    metric=validate_answer,
    **config_
)

# Compile and save the optimized program
optimized_program = teleprompter.compile(IntentClassifierModule(), trainset=train_dataset)
optimized_program.save('/Users/abriel/repos/projectname/optimized_intent_classifier.json')

This resulted in successful "training", running in 8 sets. I then completed an evaluation:

from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=test_dataset, num_threads=1, display_progress=True, display_table=5)
evaluator(optimized_program, metric=validate_answer)

I then checked the optimized prompt by doing:

gpt4o.inspect_history(n=1)

ISSUE 1: The resulting optimized_intent_classifier.json had single few-shot samples for only 8 intents, with one of the intents having 2 samples. There are 41 intents, so I expected multiple few-shot samples for each of the 41 intents.

ISSUE 2: This showed a final score of 64.34, which was admittedly far lower than expected as I provided a ground truth dataset of 50 samples per intent.

arnavsinghvi11 commented 1 week ago

Hi @aaronbriel ,

The optimized_program currently includes few-shot examples from only 8 of the classifiers because the BootstrapWithRandomSearch configuration is set to select: "max_bootstrapped_demos": 8, "max_labeled_demos": 8,"

To get unique few-shot examples for all 41 classifiers, you can increase these parameters to 41.

However, note that the selection of fewshot examples in BootstrapFewShot doesn't guarantee uniqueness in all 41 few-shot demos (the optimizer just selects a set of 41 few-shots that pass the metric):

Some potential solutions for this could be:

Adjusting the metric to have a global check for each unique classifier, modifying the validate_answer function to ensure that only examples unique to each classifer are selected and not repeated (e.g. - return example.intent.lower() == pred.intent.lower() and global_class_check(example)
Filtering the train_dataset by the 41 classifier types, and then running the optimizer on each of the 41 train_sets (bootstrapping 41x!)

bootstrap_program_0 = teleprompter.compile(IntentClassifierModule(), trainset=train_dataset_0)
bootstrap_program_1 = teleprompter.compile(bootstrap_program_0, trainset=train_dataset_1)

the 2nd solution is likely more expensive but may ensure some more diversity by providing multiple sets of few-shots for the unique classifiers, which can potentially raise performance

Let me know if this helps!

aaronbriel commented 1 week ago

@arnavsinghvi11 thank for the quick response! I will try this and let you know the results. Thanks!

aaronbriel commented 1 week ago

@arnavsinghvi11 I keep running into the error below. I thought I had resolved it by adding format=str to each of the signature InputFields. It progressed a bit further but failed yet again several intent iterations later. I'm not seeing anything that jumps out in the data for that specific intent, as all of the text data across all intents contain special characters.

Do you know of any other tricks people have used to resolve this?

Traceback (most recent call last):
  File "/home/ubuntu/repos/project/experiments/dspy/build_intent_classifier_prompt.py", line 260, in <module>
    optimize_intent_classifier()
  File "/home/ubuntu/repos/project/experiments/dspy/build_intent_classifier_prompt.py", line 237, in optimize_intent_classifier
    bootstrap_program = teleprompter.compile(bootstrap_program, trainset=training_data_intent)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/project-venv/lib/python3.12/site-packages/dspy/teleprompt/random_search.py", line 95, in compile
    program2 = program.compile(student, teacher=teacher, trainset=trainset2)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/envs/project-venv/lib/python3.12/site-packages/dspy/teleprompt/bootstrap.py", line 82, in compile
    self._prepare_student_and_teacher(student, teacher)
  File "/home/ubuntu/anaconda3/envs/project-venv/lib/python3.12/site-packages/dspy/teleprompt/bootstrap.py", line 99, in _prepare_student_and_teacher
    assert getattr(self.student, "_compiled", False) is False, "Student must be uncompiled."
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Student must be uncompiled.

aaronbriel commented 1 week ago

Using the recommended solution in (1) above, the resulting prompt was still missing 20 intents so that is not a feasible solution for a production release. The "Student must be uncompiled" issue may have not occurred due to certain data in a missed intent not being encountered.

I'm going to have to hold off on leveraging this tool until I or another person can find a solution to said issue.

chiragshah285 commented 1 week ago

@aaronbriel this may be helpful https://github.com/KarelDO/xmc.dspy

stanfordnlp / dspy

Optimized prompt for multi-class classification contains only a subset of classifiers #1509