stanfordnlp / dspy

DSPy: The framework for programming—not prompting—language models
https://dspy.ai
MIT License
19.06k stars 1.46k forks source link

Different optimization results between 2.5.16 -> 2.5.20 #1722

Open nielsgl opened 3 weeks ago

nielsgl commented 3 weeks ago

Hi!

I created a simple module and a set of 10 questions and answers to evaluate a single pdf loaded into chromadb. When evaluating using DSPy version 2.5.16 like

evaluate = dspy.Evaluate(
    devset=data, metric=metric, num_threads=24, display_progress=True, display_table=3
)
evaluate(rag)

I get a semantic F1 score of 69, then when I run the optimization (which takes about 15 minutes) and evaluating it I get a score of about 79.

tp = dspy.MIPROv2(
    metric=metric, auto="medium", num_threads=24
)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(
    RAG(),
    trainset=data[:7],
    valset=data[7:],
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
    requires_permission_to_run=False,
    seed=0
)

evaluate(optimized_rag)

However when I run this with version 2.5.20, I first get a score of 61 and after optimization I get a score of 69. These seem quite different from each other and significantly lower. Everything is the same except I upgrade the DSPy library. Interestingly the optimization now finished in about 2 minutes which is significantly faster. Any thoughts on these differences?

okhat commented 3 weeks ago

Hey @nielsgl ! We adjusted the adapters layer (which sits between signatures and LMs) in DSPy 2.5.19, which you can find in the releases page: https://github.com/stanfordnlp/dspy/releases

Perhaps we should save the adapter logic as part of the saved program, actually, so when you load it in the future, it's exactly identical in behavior to your older runs.

What do you think?

(Separately, I wouldn't think too much about the 69 vs 79 scores, since you're working with a valset with 3 examples, so noise is going to have a lot of room.)

chenmoneygithub commented 4 days ago

@okhat We can save the adapter code with cloudpickle, so it's technically doable. But I feel our adapter change should not really cause performance downgrade? Conceptually it's just parsing the input and output, and if there is a true performance downgrade then it could indicate that we are doing something wrong. So instead of officially supporting serializing DSPy model together with Adapter code, we may want to ensure that newer-version adapter doesn't have negative performance effect?