stephenleo / llm-structured-output-benchmarks

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.
Apache License 2.0
127 stars 5 forks source link

๐Ÿงฉ LLM Structured Output Benchmarks

Python 3.11.9 DOI GitHub - License

Github dev.to badge dev.to badge

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

๐Ÿ† Benchmark Results [2024-08-25]

  1. Multi-label classification Framework Model Reliability Latency p95 (s)
    Fructose gpt-4o-mini-2024-07-18 1.000 1.138
    Modelsmith gpt-4o-mini-2024-07-18 1.000 1.184
    OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 1.201
    Instructor gpt-4o-mini-2024-07-18 1.000 1.206
    Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 7.606*
    LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 3.649*
    Llamaindex gpt-4o-mini-2024-07-18 0.996 0.853
    Marvin gpt-4o-mini-2024-07-18 0.988 1.338
    Mirascope gpt-4o-mini-2024-07-18 0.985 1.531
  2. Named Entity Recognition Framework Model Reliability Latency p95 (s) Precision Recall F1 Score
    OpenAI Structured Output gpt-4o-mini-2024-07-18 1.000 3.459 0.834 0.748 0.789
    LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 6.573* 0.701 0.262 0.382
    Instructor gpt-4o-mini-2024-07-18 0.998 2.438 0.776 0.768 0.772
    Mirascope gpt-4o-mini-2024-07-18 0.989 3.879 0.768 0.738 0.752
    Llamaindex gpt-4o-mini-2024-07-18 0.979 5.771 0.792 0.310 0.446
    Marvin gpt-4o-mini-2024-07-18 0.979 3.270 0.822 0.776 0.798
  3. Synthetic Data Generation Framework Model Reliability Latency p95 (s) Variety
    Instructor gpt-4o-mini-2024-07-18 1.000 1.923 0.750
    Marvin gpt-4o-mini-2024-07-18 1.000 1.496 0.010
    Llamaindex gpt-4o-mini-2024-07-18 1.000 1.003 0.020
    Modelsmith gpt-4o-mini-2024-07-18 0.970 2.324 0.835
    Mirascope gpt-4o-mini-2024-07-18 0.790 3.383 0.886
    Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 0.350 3.577* 1.000
    OpenAI Structured Output gpt-4o-mini-2024-07-18 0.650 1.431 0.877
    LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 0.650 2.561* 0.662

* NVIDIA GeForce RTX 4080 Super GPU

๐Ÿƒ Run the benchmark

  1. Install the requirements using pip install -r requirements.txt
  2. Set the OpenAI api key: export OPENAI_API_KEY=sk-...
  3. Run the benchmark using python -m main run-benchmark
  4. Raw results are stored in the results directory.
  5. Generate the results using:
    • Multilabel classification: python -m main generate-results
    • NER: python -m main generate-results --task ner
    • Synthetic data generation: python -m main generate-results --task synthetic_data_generation
  6. To get help on the command line arguments, add --help after the command. Eg., python -m main run-benchmark --help

๐Ÿงช Benchmark methodology

  1. Multi-label classification:
    • Task: Given a text, predict the labels associated with it.
    • Data:
      • Base data: Alexa intent detection dataset
      • Benchmarking test is run using synthetic data generated by running: python -m data_sources.generate_dataset generate-multilabel-data.
      • The synthetic data is generated by sampling and combining rows from the base data to achieve multiple classes per row according to some distribution for num classes per row. See python -m data_sources.generate_dataset generate-multilabel-data --help for more details.
    • Prompt: "Classify the following text: {text}"
    • Evaluation Metrics:
      1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
      2. Latency: The 95th percentile of the time taken to run the framework on the data.
    • Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs for each row.
  2. Named Entity Recognition
    • Task: Given a text, extract the entities present in it.
    • Data:
      • Base data: Synthetic PII Finance dataset
      • Benchmarking test is run using a sampled data generated by running: python -m data_sources.generate_dataset generate-ner-data.
      • The data is sampled from the base data to achieve number of entities per row according to some distribution. See python -m data_sources.generate_dataset generate-ner-data --help for more details.
    • Prompt: Extract and resolve a list of entities from the following text: {text}
    • Evaluation Metrics:
      1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
      2. Latency: The 95th percentile of the time taken to run the framework on the data.
      3. Precision: The micro average of the precision of the framework on the data.
      4. Recall: The micro average of the recall of the framework on the data.
      5. F1 Score: The micro average of the F1 score of the framework on the data.
    • Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs for each row.
  3. Synthetic Data Generation
    • Task: Generate synthetic data similar according to a Pydantic data model schema.
    • Data:
      • Two level nested User details Pydantic schema.
    • Prompt: Generate a random person's information. The name must be chosen at random. Make it something you wouldn't normally choose.
    • Evaluation Metrics:
      1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
      2. Latency: The 95th percentile of the time taken to run the framework on the data.
      3. Variety: The percent of names that are unique compared to all names generated.
    • Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs.

๐Ÿ“Š Adding new data

  1. Create a new pandas dataframe pickle file with the following columns:
    • text: The text to be sent to the framework
    • labels: List of labels associated with the text
    • See data/multilabel_classification.pkl for an example.
  2. Add the path to the new pickle file in the ./config.yaml file under the source_data_pickle_path key for all the frameworks you want to test.
  3. Run the benchmark using python -m main run-benchmark to test the new data on all the frameworks!
  4. Generate the results using python -m main generate-results

๐Ÿ—๏ธ Adding a new framework

The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py file. Detailed steps are as follows:

  1. Create a .py file in frameworks directory with the name of the framework. Eg., instructor_framework.py for the instructor framework.
  2. In this .py file create a class that inherits BaseFramework from frameworks.base.
  3. The class should define an init method that initializes the base class. Here are the arguments the base class expects:
    • task (str): the task that the framework is being tested on. Obtained from ./config.yaml file. Allowed values are "multilabel_classification" and "ner"
    • prompt (str): Prompt template used. Obtained from the init_kwargs in the ./config.yaml file.
    • llm_model (str): LLM model to be used. Obtained from the init_kwargs in the ./config.yaml file.
    • llm_model_family (str): LLM model family to be used. Current supported values as "openai" and "transformers". Obtained from the init_kwargs in the ./config.yaml file.
    • retries (int): Number of retries for the framework. Default is $0$. Obtained from the init_kwargs in the ./config.yaml file.
    • source_data_picke_path (str): Path to the source data pickle file. Obtained from the init_kwargs in the ./config.yaml file.
    • sample_rows (int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is $0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from the init_kwargs in the ./config.yaml file.
    • response_model (Any): The response model to be used. Internally passed by the benchmarking script.
  4. The class should define a run method that takes three arguments:
    • task: The task that the framework is being tested on. Obtained from the task in the ./config.yaml file. Eg., "multilabel_classification"
    • n_runs: number of times to repeat each text
    • expected_response: Output expected from the framework. Use default value of None
    • inputs: a dictionary of {"text": str} where str is the text to be sent to the framework. Use default value of empty dictionary {}
  5. This run method should create another run_experiment function that takes inputs as argument, runs that input through the framework and returns the output.
  6. The run_experiment function should be annotated with the @experiment decorator from frameworks.base with n_runs, expected_resposne and task as arguments.
  7. The run method should call the run_experiment function and return the four outputs predictions, percent_successful, metrics and latencies.
  8. Import this new class in frameworks/__init__.py.
  9. Add a new entry in the ./config.yaml file with the name of the class as the key. The yaml entry can have the following fields
    • task: the task that the framework is being tested on. Obtained from ./config.yaml file. Allowed values are "multilabel_classification" and "ner"
    • n_runs: number of times to repeat each text
    • init_kwargs: all the arguments that need to be passed to the init method of the class, including those mentioned in step 3 above.

๐Ÿงญ Roadmap

  1. Framework related tasks: Framework Multi-label classification Named Entity Recognition Synthetic Data Generation
    OpenAI Structured Output โœ… OpenAI โœ… OpenAI โœ… OpenAI
    Instructor โœ… OpenAI โœ… OpenAI โœ… OpenAI
    Mirascope โœ… OpenAI โœ… OpenAI โœ… OpenAI
    Fructose โœ… OpenAI ๐Ÿšง In Progress ๐Ÿšง In Progress
    Marvin โœ… OpenAI โœ… OpenAI โœ… OpenAI
    Llamaindex โœ… OpenAI โœ… OpenAI โœ… OpenAI
    Modelsmith โœ… OpenAI ๐Ÿšง In Progress โœ… OpenAI
    Outlines โœ… HF Transformers ๐Ÿšง In Progress โœ… HF Transformers
    LM format enforcer โœ… HF Transformers โœ… HF Transformers โœ… HF Transformers
    Jsonformer โŒ No Enum Support ๐Ÿ’ญ Planning ๐Ÿ’ญ Planning
    Strictjson โŒ Non-standard schema โŒ Non-standard schema โŒ Non-standard schema
    Guidance ๐Ÿ’ญ Planning ๐Ÿ’ญ Planning ๐Ÿ’ญ Planning
    DsPy ๐Ÿ’ญ Planning ๐Ÿ’ญ Planning ๐Ÿ’ญ Planning
    Langchain ๐Ÿ’ญ Planning ๐Ÿ’ญ Planning ๐Ÿ’ญ Planning
  2. Others
    • [x] Latency metrics
    • [ ] CICD pipeline for benchmark run automation
    • [ ] Async run

๐Ÿ’ก Contribution guidelines

Contributions are welcome! Here are the steps to contribute:

  1. Please open an issue with any new framework you would like to add. This will help avoid duplication of effort.
  2. Once the issue is assigned to you, pls submit a PR with the new framework!

๐ŸŽ“ Citation

To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:

@software{marie_stephen_leo_2024_12327267,
  author       = {Marie Stephen Leo},
  title        = {{stephenleo/llm-structured-output-benchmarks: 
                   Release for Zenodo}},
  month        = jun,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.12327267},
  url          = {https://doi.org/10.5281/zenodo.12327267}
}

๐Ÿ™ Feedback

If this work helped you in any way, please consider โญ this repository to give me feedback so I can spend more time on this project.