stephenleo / llm-structured-output-benchmarks

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.
Apache License 2.0
113 stars 4 forks source link

Demo polyfactory framework #1

Open adrianeboyd opened 1 month ago

adrianeboyd commented 1 month ago

Hi, it's nice to come across a cross-library/model benchmark like this!

When looking at evaluations for structured output libraries, I feel like "valid response" is such a low bar when used on its own as a metric, and I think adding accuracy-related metrics would help these benchmarks be more informative.

I fully acknowledge that this is a bit on the ornery side, but since it only took a few lines of code (it was very easy to do in this repo!), I wanted to submit a demo PR for a new framework that uses polyfactory to generate valid responses based on the response model, with 100% reliability and a latency of 0.000, maybe 0.001 on a bad day.

I'd potentially be interested in contributing to work on additional metrics/tasks in the future, in particular named entity recognition!

stephenleo commented 1 month ago

Hey fully agree on adding an accuracy metric. The code already supports it but I've not published the results because I'm currently using a synthetic dataset with ambiguous real accuracy. In the long run I'd love to report metrics on standard datasets but was having difficulty finding a multilabel classification dataset with many possible classes. I'll continue to look! Do submit a PR if you can find one.

Love you PR, thanks for your submission. Will run the benchmarks today and update the README to your branch before merging!

adrianeboyd commented 1 month ago

Sorry, I did come across the hidden accuracy numbers after posting. (I think you might want to refactor the scoring so you're evaluating the whole dataset at the end rather than averaging over a per-item metric, since a lot of the standard metrics (micro f1 for NER, etc.) wouldn't work using a per-item average?)

stephenleo commented 1 month ago

Yes I'm logging the predictions for each iteration so can calculate the whole dataset metric at the end. I'll push the metric calculation code soon but wont publish the metrics till I find a good multi-label classification dataset.