irthomasthomas / undecidability

13 stars 2 forks source link

Comparing LLM Performance: Introducing the Open Source Leaderboard for LLM APIs #766

Open irthomasthomas opened 8 months ago

irthomasthomas commented 8 months ago

Comparing LLM Performance: Introducing the Open Source Leaderboard for LLM APIs

By Anyscale team | December 21, 2023

llm-api-leaderboard-compare

Introduction

In our previous blog, "Reproducible Performance Metrics for LLM Inference", we introduced LLMPerf, an open source tool designed to bring reproducibility and clarity to the world of LLM performance benchmarks.

Building on this foundation, we're excited to unveil:

LLMPerf Leaderboard 🏆– A Public Open Source Dashboard for LLM Inference

The LLMPerf Leaderboard is a public and open source dashboard, showcasing the performance of leading LLM inference providers in the market. Our goal with this dashboard is to equip users and developers with a clear understanding of the capabilities and limitations of LLM inference providers, informing decisions for future integrations and deployments.

In other technology domains, similar efforts around transparency across organizations have produced highly impactful standards like SQL and J2EE, as well as standardized benchmarks for comparing performance across technologies.

Unfortunately, no such standardized benchmark yet exists for LLM inference performance workloads, so we've indexed on our own experience and use cases we've seen in running Anyscale Endpoints.

The market is moving fast, and LLM developers need reliable performance metrics to compare alternatives. The LLMPerf Leaderboard aims to fulfill this need. Furthermore, the measurement tooling is fully open source -- meaning that the dashboard results are fully open and can be reproduced by anyone.

Leaderboard mechanics

The LLMPerf Leaderboard ranks LLM inference providers based on a suite of key performance metrics and gets updated weekly. These metrics are generated by the open source LLMPerf tool, using a representative use case scenario of 550 input tokens and 150 output tokens. These metrics are chosen on a best-effort basis to be as fair, transparent, and reproducible as possible.

In the same way that you might evaluate cars on performance metrics like 0-60mph, top speed, and fuel economy on highways vs. city streets, the leaderboard presents a clear overview of key metrics, encompassing:

Note that there may be some possible source of biases or discrepancies from your perceived behavior:

Featured LLM inference providers

The LLMPerf leaderboard features a range of LLM inference providers, all being evaluated with the same rigor for consistency and fairness (in the alphabetical order):

If you are an LLM inference provider and would like your API added to this dashboard, please create an issue on the Github repository and/or email us at endpoints-help@anyscale.com.

Reproducibility and Transparency

In line with our commitment to transparency, all benchmarking code is open source, and we provide detailed runtime configurations. This empowers users and developers to easily reproduce the benchmarks, facilitating a deeper understanding and independent verification of the results.

The LLMPerf Leaderboard is more than just a measure of performance; it's a resource for the broader AI and LLM community. By providing clear, comparative insights into the top LLM inference providers, we aim to drive forward the field of large language model inference, encouraging innovation and excellence.

Upgraded LLMPerf – New Features and Improvements

The latest version of LLMPerf brings a suite of significant updates designed to provide more in-depth and customizable benchmarking capabilities for LLM inference. These updates include:

Enhanced Metrics Support

LLMPerf v2 now includes a broader range of metrics, including:

Each of these metrics are now reported with quantiles (P25-P99), providing a more nuanced picture of performance.

Customizable Parameters

The new version of LLMPerf introduces upgraded parameters, allowing users to define their unique use cases:

Load Test and Correctness Test

We also have new testing modes for evaluating different characteristics of the hosted models:

With these upgrades, LLMPerf stands as a more versatile and powerful tool for benchmarking LLM inference providers. It offers unprecedented insights into the performance, accuracy, and reliability of various LLM products, aiding users and developers in making informed decisions about the right tools for specific needs.

Community Feedback

The launch of the LLMPerf Leaderboard and the updated LLMPerf mark significant milestones in our ongoing quest to provide a clear, reproducible, and transparent hub and source-of-truth for benchmarks in the LLM domain. By continuously refining our tools and methods, we aim to keep pace with the rapid advancements in this field, providing valuable insights for both developers and users of LLMs. We hope that these resources will serve as a cornerstone for objective performance assessment, driving innovation and excellence in the realm of language model inference.

We welcome anyone to give us feedback and encourage you to join the LLMPerf project community. If you are an LLM inference provider and would like your API added to this dashboard, please create an issue on the Github repository and/or email us at endpoints-help@anyscale.com.

Suggested labels

{'gh-repo': 'ai-leaderboards', 'label-description': 'Performance Comparison', 'label-name': 'llm-performance-comparison', 'confidence': 71.15}

irthomasthomas commented 8 months ago

Related content

651

Similarity score: 0.91

505

Similarity score: 0.87

763

Similarity score: 0.87

494

Similarity score: 0.86

645

Similarity score: 0.86

317

Similarity score: 0.86