Closed SauceCat closed 1 month ago
Hey thanks so much for the thoughtful questions and exporation! This is super cool. I'm tagging @XenonMolecule on this (MIPRO's co-first author).
This is really great! Thanks for trying out these settings!
It's quite weird, because if there were no few-shot examples in the final prompt, why would optimization using few-shot settings yield better instructions?
MIPROv2 fewshot should still include fewshot examples in your prompt unless (1) your data has too few examples compared to the number of fewshot demos that you specified, or (2) Our search process finds that your program works better zeroshot than with any demos we include. Perhaps that is the case for your program though and this could explain why it was left without fewshot demos.
In this case optimizing with fewshot demos is still possibly better because we include the fewshot demos that we've bootstrapped in the instruction proposal step. If you are doing zeroshot we still try to bootstrap 3 demos to inform our instruction proposal model, but if you are doing fewshot we use the max number of demos you specified. We loop over each of these demos and propose and instruction with each, so this can create a greater number of instructions and a greater diversity of instructions.
When inferencing with gpt-4o-mini: Using gpt-4o-mini for prompt optimization yields significantly better results than using gpt-4o, regardless of optimizer.
Great finding! This doesn't fully surprise me that the model would write prompts that are good for itself. Someone put it nicely when I was talking to them the other day that prompts written by the same model you test on will put them in a "low perplexity" space. Essentially making the prompt more "in-distribution" for that model.
When inferencing with gpt-4o: When it's under 0-Shot setting, using gpt-4o-mini for prompt optimization is significantly better; when it's under Few Shots setting, using gpt-4o-mini or gpt-4o for prompt optimization yields similar results.
This is cool! Some people have told me that gpt-4o-mini is better at prompt proposal than gpt-4o. I'm not certain why, but this is fascinating! We are building out a benchmark to sweep across models both as task models and prompt proposal models so hopefully we have a benchmark to point to soon with findings like this!
@XenonMolecule thanks for the comprehensive explanation! It's really a great learning. 😸 Looking forward to the benchmark, it would be extremely helpful for real-world application.
Firstly, thanks for the amazing package! I've done a sharing session with senior tech managers in my company, and they were quite convinced. I have some interesting insights to share and also some questions to discuss.
Background
My task is quite simple: I need to optimize a signature for intention classification (only a couple of classes).
I've done some simple ablation studies:
After the optimization, I did inference with the optimized prompts using gpt-4o-mini or gpt-4o and evaluated the results. I used accuracy as the metric.
Insights and Questions
When inferencing with gpt-4o-mini: Using gpt-4o-mini for prompt optimization yields significantly better results than using gpt-4o, regardless of optimizer.
When inferencing with gpt-4o: When it's under 0-Shot setting, using gpt-4o-mini for prompt optimization is significantly better; when it's under Few Shots setting, using gpt-4o-mini or gpt-4o for prompt optimization yields similar results.