Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.
Multi-label classification | Framework | Model | Reliability | Latency p95 (s) |
---|---|---|---|---|
Fructose | gpt-4o-mini-2024-07-18 | 1.000 | 1.138 | |
Modelsmith | gpt-4o-mini-2024-07-18 | 1.000 | 1.184 | |
OpenAI Structured Output | gpt-4o-mini-2024-07-18 | 1.000 | 1.201 | |
Instructor | gpt-4o-mini-2024-07-18 | 1.000 | 1.206 | |
Outlines | unsloth/llama-3-8b-Instruct-bnb-4bit | 1.000 | 7.606* | |
LMFormatEnforcer | unsloth/llama-3-8b-Instruct-bnb-4bit | 1.000 | 3.649* | |
Llamaindex | gpt-4o-mini-2024-07-18 | 0.996 | 0.853 | |
Marvin | gpt-4o-mini-2024-07-18 | 0.988 | 1.338 | |
Mirascope | gpt-4o-mini-2024-07-18 | 0.985 | 1.531 |
Named Entity Recognition | Framework | Model | Reliability | Latency p95 (s) | Precision | Recall | F1 Score |
---|---|---|---|---|---|---|---|
OpenAI Structured Output | gpt-4o-mini-2024-07-18 | 1.000 | 3.459 | 0.834 | 0.748 | 0.789 | |
LMFormatEnforcer | unsloth/llama-3-8b-Instruct-bnb-4bit | 1.000 | 6.573* | 0.701 | 0.262 | 0.382 | |
Instructor | gpt-4o-mini-2024-07-18 | 0.998 | 2.438 | 0.776 | 0.768 | 0.772 | |
Mirascope | gpt-4o-mini-2024-07-18 | 0.989 | 3.879 | 0.768 | 0.738 | 0.752 | |
Llamaindex | gpt-4o-mini-2024-07-18 | 0.979 | 5.771 | 0.792 | 0.310 | 0.446 | |
Marvin | gpt-4o-mini-2024-07-18 | 0.979 | 3.270 | 0.822 | 0.776 | 0.798 |
Synthetic Data Generation | Framework | Model | Reliability | Latency p95 (s) | Variety |
---|---|---|---|---|---|
Instructor | gpt-4o-mini-2024-07-18 | 1.000 | 1.923 | 0.750 | |
Marvin | gpt-4o-mini-2024-07-18 | 1.000 | 1.496 | 0.010 | |
Llamaindex | gpt-4o-mini-2024-07-18 | 1.000 | 1.003 | 0.020 | |
Modelsmith | gpt-4o-mini-2024-07-18 | 0.970 | 2.324 | 0.835 | |
Mirascope | gpt-4o-mini-2024-07-18 | 0.790 | 3.383 | 0.886 | |
Outlines | unsloth/llama-3-8b-Instruct-bnb-4bit | 0.350 | 3.577* | 1.000 | |
OpenAI Structured Output | gpt-4o-mini-2024-07-18 | 0.650 | 1.431 | 0.877 | |
LMFormatEnforcer | unsloth/llama-3-8b-Instruct-bnb-4bit | 0.650 | 2.561* | 0.662 |
* NVIDIA GeForce RTX 4080 Super GPU
pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
python -m main run-benchmark
results
directory.python -m main generate-results
python -m main generate-results --task ner
python -m main generate-results --task synthetic_data_generation
--help
after the command. Eg., python -m main run-benchmark --help
python -m data_sources.generate_dataset generate-multilabel-data
.python -m data_sources.generate_dataset generate-multilabel-data --help
for more details."Classify the following text: {text}"
percent_successful
values.n_runs
number of times and log the percent of successful runs for each row.python -m data_sources.generate_dataset generate-ner-data
.python -m data_sources.generate_dataset generate-ner-data --help
for more details.Extract and resolve a list of entities from the following text: {text}
percent_successful
values.n_runs
number of times and log the percent of successful runs for each row.Generate a random person's information. The name must be chosen at random. Make it something you wouldn't normally choose.
percent_successful
values.n_runs
number of times and log the percent of successful runs.text
: The text to be sent to the frameworklabels
: List of labels associated with the textdata/multilabel_classification.pkl
for an example../config.yaml
file under the source_data_pickle_path
key for all the frameworks you want to test.python -m main run-benchmark
to test the new data on all the frameworks!python -m main generate-results
The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py
file. Detailed steps are as follows:
instructor_framework.py
for the instructor framework.BaseFramework
from frameworks.base
.init
method that initializes the base class. Here are the arguments the base class expects:
task
(str): the task that the framework is being tested on. Obtained from ./config.yaml
file. Allowed values are "multilabel_classification"
and "ner"
prompt
(str): Prompt template used. Obtained from the init_kwargs
in the ./config.yaml
file.llm_model
(str): LLM model to be used. Obtained from the init_kwargs
in the ./config.yaml
file.llm_model_family
(str): LLM model family to be used. Current supported values as "openai"
and "transformers"
. Obtained from the init_kwargs
in the ./config.yaml
file.retries
(int): Number of retries for the framework. Default is $0$. Obtained from the init_kwargs
in the ./config.yaml
file.source_data_picke_path
(str): Path to the source data pickle file. Obtained from the init_kwargs
in the ./config.yaml
file.sample_rows
(int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is $0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from the init_kwargs
in the ./config.yaml
file.response_model
(Any): The response model to be used. Internally passed by the benchmarking script.run
method that takes three arguments:
task
: The task that the framework is being tested on. Obtained from the task
in the ./config.yaml
file. Eg., "multilabel_classification"
n_runs
: number of times to repeat each textexpected_response
: Output expected from the framework. Use default value of None
inputs
: a dictionary of {"text": str}
where str
is the text to be sent to the framework. Use default value of empty dictionary {}
run
method should create another run_experiment
function that takes inputs
as argument, runs that input through the framework and returns the output.run_experiment
function should be annotated with the @experiment
decorator from frameworks.base
with n_runs
, expected_resposne
and task
as arguments.run
method should call the run_experiment
function and return the four outputs predictions
, percent_successful
, metrics
and latencies
.frameworks/__init__.py
../config.yaml
file with the name of the class as the key. The yaml entry can have the following fields
task
: the task that the framework is being tested on. Obtained from ./config.yaml
file. Allowed values are "multilabel_classification"
and "ner"
n_runs
: number of times to repeat each textinit_kwargs
: all the arguments that need to be passed to the init
method of the class, including those mentioned in step 3 above.Framework related tasks: | Framework | Multi-label classification | Named Entity Recognition | Synthetic Data Generation |
---|---|---|---|---|
OpenAI Structured Output | โ OpenAI | โ OpenAI | โ OpenAI | |
Instructor | โ OpenAI | โ OpenAI | โ OpenAI | |
Mirascope | โ OpenAI | โ OpenAI | โ OpenAI | |
Fructose | โ OpenAI | ๐ง In Progress | ๐ง In Progress | |
Marvin | โ OpenAI | โ OpenAI | โ OpenAI | |
Llamaindex | โ OpenAI | โ OpenAI | โ OpenAI | |
Modelsmith | โ OpenAI | ๐ง In Progress | โ OpenAI | |
Outlines | โ HF Transformers | ๐ง In Progress | โ HF Transformers | |
LM format enforcer | โ HF Transformers | โ HF Transformers | โ HF Transformers | |
Jsonformer | โ No Enum Support | ๐ญ Planning | ๐ญ Planning | |
Strictjson | โ Non-standard schema | โ Non-standard schema | โ Non-standard schema | |
Guidance | ๐ญ Planning | ๐ญ Planning | ๐ญ Planning | |
DsPy | ๐ญ Planning | ๐ญ Planning | ๐ญ Planning | |
Langchain | ๐ญ Planning | ๐ญ Planning | ๐ญ Planning |
Contributions are welcome! Here are the steps to contribute:
To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:
@software{marie_stephen_leo_2024_12327267,
author = {Marie Stephen Leo},
title = {{stephenleo/llm-structured-output-benchmarks:
Release for Zenodo}},
month = jun,
year = 2024,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.12327267},
url = {https://doi.org/10.5281/zenodo.12327267}
}
If this work helped you in any way, please consider โญ this repository to give me feedback so I can spend more time on this project.