microsoft / promptflow

Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.
https://microsoft.github.io/promptflow/
MIT License
8.73k stars 785 forks source link

[Feature Request] Visualizing the evaluations should look different from the promptflow traces, should provide some kind of data visualization #3492

Open tyler-suard-parker opened 1 week ago

tyler-suard-parker commented 1 week ago

Is your feature request related to a problem? Please describe. Right now, when we visualize the evaluations, it is not easy to understand the results. For example the result of visualizing in this notebook promptflow\examples\flex-flows\chat-async-stream\chat-stream-with-async-flex-flow.ipynb looks like this:

image

It is not easy to see which evals failed and which succeeded, or a proportion of successes vs failures.

Describe the solution you'd like It would be nice to have a clearer visualization for the evaluations, because their purpose is different from the traces. For an evaluation we usually just want a simple pass/fail, whereas with a trace we want the full details. Here is an example:

eval report.zip

zhengfeiwang commented 1 week ago

Thank you for your suggestion! Add screenshot of your example below:

image

@tyler-suard-parker one thing I'd like to confirm: in which step you get the above trace UI page? I see there are two runs in the url, so I guess you are getting this from the line pf.visualize([base_run, eval_run])?

If so, how about changing it to pf.visualize(base_run) and see if it looks better? The evaluation run's result will be appended to the corresponding line - maybe we should update our notebook there, pf.visualize is something different before and switch to leverage trace UI recently

tyler-suard-parker commented 1 week ago

Yes, I am getting this during the line pf.visualize([base_run, eval_run]). I will try using pf.visualize(base_run) and let you know what happens.

I'm glad you like my suggestion, note that you can click on each question to expand it. Having the traces like you already have is nice, and it would be helpful to have some kind of quick summary I can look at just to make sure all my evaluations came out ok. For example, a bar chart for each input-output pair showing correctness, etc. and you can get an explanation if you click on a bar.

tyler-suard-parker commented 1 week ago

I tried running pf.visualize(base_run) and I got this. When I enabled the metrics column it looks a little better, but there is still a lot of information I don't need if I'm doing an evaluation:

image

I use evaluations as unit tests for my prompt engineering. I have 10 standard questions I ask. Every time I change one of my agent prompts, I run those 10 standard questions again as part of my CI/CD tests and I look at the test report, just to make sure none of my changes caused any incorrect answers. When I'm doing this with every commit, I don't have time to read through all the traces. It would be nice to just have a single diagram that shows me how the entire evaluation batch went.

tyler-suard-parker commented 1 week ago

Something similar to this:

image
zhengfeiwang commented 6 days ago

Thank you for your trial, and the description of your scenario! Yes, I think something like a report shall help you better, and more intuitive; while trace UI page does not support it well for now.

Engage PM Chenlu @jiaochenlu on this topic.