Cached bad response: ValueError: Expected dict_keys(['answer']) but got dict_keys([])

stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models

https://dspy-docs.vercel.app/

MIT License

17.49k stars 1.33k forks source link

Cached bad response: ValueError: Expected dict_keys(['answer']) but got dict_keys([]) #1544

Open isaacbmiller opened 1 week ago

isaacbmiller commented 1 week ago

If running an evaluation but the model messes up the output formatting, litellm will still cache it because the response was valid, just not correct schema.

If this happens enough to cancel the evaluation, all of your good requests are trapped b/c the parsed examples are still there.

I think I should probably be able to specify retries for bad formatting.

Currently I don't increase the number of errors I allow, but thats another solution. Seems much more prevalent for stupider models or models without few shot examples yet, but even with 3 good bootstrapped examples 3.2 1B wasnt able to get through the evaluation set.

I could see the current caching behavior technically being desired but it isnt fun from a user experience perspective

cc @okhat @Shangyint

isaacbmiller commented 1 week ago

I can work on this once we agree on a direction

Shangyint commented 1 week ago

I've had similar experience behavior with caching for other web search APIs where bad (but valid) http response got cached too. Is there a way to wrap the caching function with the schema parsing code, so that an error could be thrown or retried internally to prevent caching for the failing cases.

okhat commented 1 week ago

@isaacbmiller I'm unsure I understand the issue. Just make sure the retry has a different temperature or a slightly different prompt? Why would you retry the identical prompt request?

isaacbmiller commented 1 week ago

@isaacbmiller I'm unsure I understand the issue. Just make sure the retry has a different temperature or a slightly different prompt? Why would you retry the identical prompt request?

For any temperature > 0, there is some chance that if a prompt fails once that does not mean that it will fail again the second time.

For evaluation its fine to be like - yep the model cant do it, mark as no, move on.

For any less error tolerant situation(still with temp > 0), and I would want to retry until I get a correct parse.

1B only errored for probably 20/500 examples, so clearly it is smart enough to maybe get it with another retry.