stanfordnlp / dspy

DSPy: The framework for programming—not prompting—language models
https://dspy.ai
MIT License
19.27k stars 1.47k forks source link

[Insights & Questions] About MIPROV2 #1596

Closed SauceCat closed 1 month ago

SauceCat commented 1 month ago

Firstly, thanks for the amazing package! I've done a sharing session with senior tech managers in my company, and they were quite convinced. I have some interesting insights to share and also some questions to discuss.

Background

My task is quite simple: I need to optimize a signature for intention classification (only a couple of classes).

Maybe I can contribute an example use case? As text classification is one of the most common daily use cases. But unfortunately, I found the link to https://github.com/stanfordnlp/dspy/blob/main/CONTRIBUTING.md is missing...

I've done some simple ablation studies:

After the optimization, I did inference with the optimized prompts using gpt-4o-mini or gpt-4o and evaluated the results. I used accuracy as the metric.

Insights and Questions

  1. Generally speaking, MIPROV2 Few Shots yields better results than MIPROV2 0-Shot.

I was a bit confused here, as the final optimized version from MIPROV2 Few Shots actually didn't contain any few-shot examples, regardless of the LM used. Although they yielded almost the same results for gpt-4o-mini inference, MIPROV2 Few Shots yielded significantly better results than MIPROV2 0-Shot when inferencing with gpt-4o.

It's quite weird, because if there were no few-shot examples in the final prompt, why would optimization using few-shot settings yield better instructions?

  1. When inferencing with gpt-4o-mini: Using gpt-4o-mini for prompt optimization yields significantly better results than using gpt-4o, regardless of optimizer.

  2. When inferencing with gpt-4o: When it's under 0-Shot setting, using gpt-4o-mini for prompt optimization is significantly better; when it's under Few Shots setting, using gpt-4o-mini or gpt-4o for prompt optimization yields similar results.

This could be a significant finding, as it means that we can potentially use a more cost-effective model for prompt optimization that will perform well even when applied to a much larger model.

I guess it might be because if you can write clear instructions for a small model, it naturally becomes clearer for a larger model, but not reverse. I don't know whether it makes sense or not. I will continue the study with llama 3.2 to gain more insights, I guess.

okhat commented 1 month ago

Hey thanks so much for the thoughtful questions and exporation! This is super cool. I'm tagging @XenonMolecule on this (MIPRO's co-first author).

XenonMolecule commented 1 month ago

This is really great! Thanks for trying out these settings!

It's quite weird, because if there were no few-shot examples in the final prompt, why would optimization using few-shot settings yield better instructions?

MIPROv2 fewshot should still include fewshot examples in your prompt unless (1) your data has too few examples compared to the number of fewshot demos that you specified, or (2) Our search process finds that your program works better zeroshot than with any demos we include. Perhaps that is the case for your program though and this could explain why it was left without fewshot demos.

In this case optimizing with fewshot demos is still possibly better because we include the fewshot demos that we've bootstrapped in the instruction proposal step. If you are doing zeroshot we still try to bootstrap 3 demos to inform our instruction proposal model, but if you are doing fewshot we use the max number of demos you specified. We loop over each of these demos and propose and instruction with each, so this can create a greater number of instructions and a greater diversity of instructions.

When inferencing with gpt-4o-mini: Using gpt-4o-mini for prompt optimization yields significantly better results than using gpt-4o, regardless of optimizer.

Great finding! This doesn't fully surprise me that the model would write prompts that are good for itself. Someone put it nicely when I was talking to them the other day that prompts written by the same model you test on will put them in a "low perplexity" space. Essentially making the prompt more "in-distribution" for that model.

When inferencing with gpt-4o: When it's under 0-Shot setting, using gpt-4o-mini for prompt optimization is significantly better; when it's under Few Shots setting, using gpt-4o-mini or gpt-4o for prompt optimization yields similar results.

This is cool! Some people have told me that gpt-4o-mini is better at prompt proposal than gpt-4o. I'm not certain why, but this is fascinating! We are building out a benchmark to sweep across models both as task models and prompt proposal models so hopefully we have a benchmark to point to soon with findings like this!

SauceCat commented 1 month ago

@XenonMolecule thanks for the comprehensive explanation! It's really a great learning. 😸 Looking forward to the benchmark, it would be extremely helpful for real-world application.