Evaluate should take multiple metrics

stanfordnlp / dspy

DSPy: The framework for programming—not prompting—foundation models

https://dspy-docs.vercel.app/

MIT License

18.62k stars 1.43k forks source link

Evaluate should take multiple metrics #344

Open thomasahle opened 9 months ago

thomasahle commented 9 months ago

Right now Evaluate(...) only take one metric, but often we have multiple different scores we want to test at the same time. Like "accuracy" and "gold_passages_retrieved" and "q/s" etc.

While it's not obvious how to support multiple metrics for compilation, it should be easier to do for evaluation.

okhat commented 9 months ago

@thomasahle You are right, you have made 2 great points about Evaluate.

First, we need the key parallelism logic to be factored out so people can do parallel steps. (btw this will be not too hard to make work inside modules, I know the parts that need care, it's basically dspy.settings, especially dspy.settings.trace at bootstrap time)

Second, we need to support multi-metric evaluate, which is a smaller change.

Can I help you do a PR? :sweat_smile:

thomasahle commented 9 months ago

I'm happy to send some PRs. Right now I'm just a bull in a china shop hitting random obstacles, and creating issue reports to keep track of them. I don't think this one is super important to fix right now, but I just wanted to register it. If I'm creating too much spam on the issue tracker, I'm also happy to just keep a personal list of things to look into down the line :-)

If you add tags to the github issue tracker, I can mark it as "nice to have" or "not important"

bhuvana-ak commented 7 months ago

hello there, I am also looking for something similar to this.. any recent updates?

dhanishetty commented 3 months ago

metric= evaluate.combine(["accuracy", "recall", "precision", "f1"])