Comparing LLM Performance: Introducing the Open Source Leaderboard for LLM APIs
By Anyscale team | December 21, 2023
Introduction
In our previous blog, "Reproducible Performance Metrics for LLM Inference", we introduced LLMPerf, an open source tool designed to bring reproducibility and clarity to the world of LLM performance benchmarks.
Building on this foundation, we're excited to unveil:
LLMPerf leaderboard: A public dashboard that highlights the performance of LLM inference providers in the market.
Improvements upon the LLMPerf open source project
LLMPerf Leaderboard 🏆– A Public Open Source Dashboard for LLM Inference
The LLMPerf Leaderboard is a public and open source dashboard, showcasing the performance of leading LLM inference providers in the market. Our goal with this dashboard is to equip users and developers with a clear understanding of the capabilities and limitations of LLM inference providers, informing decisions for future integrations and deployments.
In other technology domains, similar efforts around transparency across organizations have produced highly impactful standards like SQL and J2EE, as well as standardized benchmarks for comparing performance across technologies.
Unfortunately, no such standardized benchmark yet exists for LLM inference performance workloads, so we've indexed on our own experience and use cases we've seen in running Anyscale Endpoints.
The market is moving fast, and LLM developers need reliable performance metrics to compare alternatives. The LLMPerf Leaderboard aims to fulfill this need. Furthermore, the measurement tooling is fully open source -- meaning that the dashboard results are fully open and can be reproduced by anyone.
Leaderboard mechanics
The LLMPerf Leaderboard ranks LLM inference providers based on a suite of key performance metrics and gets updated weekly. These metrics are generated by the open source LLMPerf tool, using a representative use case scenario of 550 input tokens and 150 output tokens. These metrics are chosen on a best-effort basis to be as fair, transparent, and reproducible as possible.
In the same way that you might evaluate cars on performance metrics like 0-60mph, top speed, and fuel economy on highways vs. city streets, the leaderboard presents a clear overview of key metrics, encompassing:
Time to first token (TTFT): TTFT represents the duration of time between the first submitted prompt and when the LLM returns the first token. TTFT is especially important for streaming applications, such as chatbots.
Inter-token latency (ITL): The average time between consecutive tokens. ITL is a strong indicator of the decoding speed of an inference engine, contributing a lot to the user experience of streaming applications and time-sensitive enterprise applications.
Success rate: The proportion of successful responses where the inference API operates without errors. Failures may occur due to server issues or exceeding the rate limit, reflecting the reliability and stability of API.
Note that there may be some possible source of biases or discrepancies from your perceived behavior:
Our measurement of TTFT depends on client location, and can also be biased by some providers lagging on the first token in order to increase ITL. Our current measurement location is from us-west (Oregon).
Measured ITL is not a true reflection of the system capabilities but is also impacted by the existing system load and provider traffic.
Featured LLM inference providers
The LLMPerf leaderboard features a range of LLM inference providers, all being evaluated with the same rigor for consistency and fairness (in the alphabetical order):
Anyscale Endpoints
AWS Bedrock
Fireworks.ai
Lepton.ai
Perplexity
Replicate
Together.ai
If you are an LLM inference provider and would like your API added to this dashboard, please create an issue on the Github repository and/or email us at endpoints-help@anyscale.com.
Reproducibility and Transparency
In line with our commitment to transparency, all benchmarking code is open source, and we provide detailed runtime configurations. This empowers users and developers to easily reproduce the benchmarks, facilitating a deeper understanding and independent verification of the results.
The LLMPerf Leaderboard is more than just a measure of performance; it's a resource for the broader AI and LLM community. By providing clear, comparative insights into the top LLM inference providers, we aim to drive forward the field of large language model inference, encouraging innovation and excellence.
Upgraded LLMPerf – New Features and Improvements
The latest version of LLMPerf brings a suite of significant updates designed to provide more in-depth and customizable benchmarking capabilities for LLM inference. These updates include:
Expanded metrics with quantiles (P25-99): Comprehensive data representation for deeper insights.
Customizable benchmarking parameters: Tailor parameters to fit specific use case scenarios.
Introduction of a load test and a correctness test: Assessing performance and accuracy under stress.
Broad compatibility: Supports a range of products including Anyscale Endpoints, OpenAI, Anthropic, together.ai, Fireworks.ai, Perplexity, Hugging Face, Lepton AI, and various APIs supported by the LiteLLM project.
Extensibility with new providers: Seamless addition of new LLMs via the LLMClient API.
Enhanced Metrics Support
LLMPerf v2 now includes a broader range of metrics, including:
Latency metrics: time to first token, and inter-token latency.
Throughput metrics: output token per second across requests.
Error rate: the proportion of responses where the inference API responded errors.
Each of these metrics are now reported with quantiles (P25-P99), providing a more nuanced picture of performance.
Customizable Parameters
The new version of LLMPerf introduces upgraded parameters, allowing users to define their unique use cases:
Model Selection (--model MODEL): Test across different LLM models for performance comparison.
Input Token Configuration:
--mean-input-tokens: Average number of tokens per prompt.
--stddev-input-tokens: Standard deviation of input tokens.
Output Token Configuration:
--mean-output-tokens: Average tokens generated per request.
--stddev-output-tokens: Standard deviation of generated tokens.
Concurrent Requests (--num-concurrent-requests): Set the number of simultaneous requests
Test Duration and Limits:
--timeout: Duration of the load test.
--max-num-completed-requests: Cap on completed requests before test conclusion.
Additional Sampling Parameters (--additional-sampling-params): Extra parameters for nuanced testing.
Results Directory (--results-dir): Destination for saving test results.
LLM API Selection (--llm-api): Choose from available LLM APIs.
Load Test and Correctness Test
We also have new testing modes for evaluating different characteristics of the hosted models:
Load Test: Assess how LLMs handle high volumes of concurrent requests. This test measures metrics like inter-token latency and throughput, providing a clear picture of performance under load.
Correctness Test: It's not just about how fast an LLM can respond, but also how accurately. This test checks the precision of LLMs by verifying their responses against a set of predefined answers, adding another layer of reliability to our benchmarks.
With these upgrades, LLMPerf stands as a more versatile and powerful tool for benchmarking LLM inference providers. It offers unprecedented insights into the performance, accuracy, and reliability of various LLM products, aiding users and developers in making informed decisions about the right tools for specific needs.
Community Feedback
The launch of the LLMPerf Leaderboard and the updated LLMPerf mark significant milestones in our ongoing quest to provide a clear, reproducible, and transparent hub and source-of-truth for benchmarks in the LLM domain. By continuously refining our tools and methods, we aim to keep pace with the rapid advancements in this field, providing valuable insights for both developers and users of LLMs. We hope that these resources will serve as a cornerstone for objective performance assessment, driving innovation and excellence in the realm of language model inference.
We welcome anyone to give us feedback and encourage you to join the LLMPerf project community. If you are an LLM inference provider and would like your API added to this dashboard, please create an issue on the Github repository and/or email us at endpoints-help@anyscale.com.
Comparing LLM Performance: Introducing the Open Source Leaderboard for LLM APIs
By Anyscale team | December 21, 2023
Introduction
In our previous blog, "Reproducible Performance Metrics for LLM Inference", we introduced LLMPerf, an open source tool designed to bring reproducibility and clarity to the world of LLM performance benchmarks.
Building on this foundation, we're excited to unveil:
LLMPerf Leaderboard 🏆– A Public Open Source Dashboard for LLM Inference
The LLMPerf Leaderboard is a public and open source dashboard, showcasing the performance of leading LLM inference providers in the market. Our goal with this dashboard is to equip users and developers with a clear understanding of the capabilities and limitations of LLM inference providers, informing decisions for future integrations and deployments.
In other technology domains, similar efforts around transparency across organizations have produced highly impactful standards like SQL and J2EE, as well as standardized benchmarks for comparing performance across technologies.
Unfortunately, no such standardized benchmark yet exists for LLM inference performance workloads, so we've indexed on our own experience and use cases we've seen in running Anyscale Endpoints.
The market is moving fast, and LLM developers need reliable performance metrics to compare alternatives. The LLMPerf Leaderboard aims to fulfill this need. Furthermore, the measurement tooling is fully open source -- meaning that the dashboard results are fully open and can be reproduced by anyone.
Leaderboard mechanics
The LLMPerf Leaderboard ranks LLM inference providers based on a suite of key performance metrics and gets updated weekly. These metrics are generated by the open source LLMPerf tool, using a representative use case scenario of 550 input tokens and 150 output tokens. These metrics are chosen on a best-effort basis to be as fair, transparent, and reproducible as possible.
In the same way that you might evaluate cars on performance metrics like 0-60mph, top speed, and fuel economy on highways vs. city streets, the leaderboard presents a clear overview of key metrics, encompassing:
Note that there may be some possible source of biases or discrepancies from your perceived behavior:
Featured LLM inference providers
The LLMPerf leaderboard features a range of LLM inference providers, all being evaluated with the same rigor for consistency and fairness (in the alphabetical order):
If you are an LLM inference provider and would like your API added to this dashboard, please create an issue on the Github repository and/or email us at endpoints-help@anyscale.com.
Reproducibility and Transparency
In line with our commitment to transparency, all benchmarking code is open source, and we provide detailed runtime configurations. This empowers users and developers to easily reproduce the benchmarks, facilitating a deeper understanding and independent verification of the results.
The LLMPerf Leaderboard is more than just a measure of performance; it's a resource for the broader AI and LLM community. By providing clear, comparative insights into the top LLM inference providers, we aim to drive forward the field of large language model inference, encouraging innovation and excellence.
Upgraded LLMPerf – New Features and Improvements
The latest version of LLMPerf brings a suite of significant updates designed to provide more in-depth and customizable benchmarking capabilities for LLM inference. These updates include:
Enhanced Metrics Support
LLMPerf v2 now includes a broader range of metrics, including:
Each of these metrics are now reported with quantiles (P25-P99), providing a more nuanced picture of performance.
Customizable Parameters
The new version of LLMPerf introduces upgraded parameters, allowing users to define their unique use cases:
--model MODEL
): Test across different LLM models for performance comparison.--mean-input-tokens
: Average number of tokens per prompt.--stddev-input-tokens
: Standard deviation of input tokens.--mean-output-tokens
: Average tokens generated per request.--stddev-output-tokens
: Standard deviation of generated tokens.--num-concurrent-requests
): Set the number of simultaneous requests--timeout
: Duration of the load test.--max-num-completed-requests
: Cap on completed requests before test conclusion.--additional-sampling-params
): Extra parameters for nuanced testing.--results-dir
): Destination for saving test results.--llm-api
): Choose from available LLM APIs.Load Test and Correctness Test
We also have new testing modes for evaluating different characteristics of the hosted models:
With these upgrades, LLMPerf stands as a more versatile and powerful tool for benchmarking LLM inference providers. It offers unprecedented insights into the performance, accuracy, and reliability of various LLM products, aiding users and developers in making informed decisions about the right tools for specific needs.
Community Feedback
The launch of the LLMPerf Leaderboard and the updated LLMPerf mark significant milestones in our ongoing quest to provide a clear, reproducible, and transparent hub and source-of-truth for benchmarks in the LLM domain. By continuously refining our tools and methods, we aim to keep pace with the rapid advancements in this field, providing valuable insights for both developers and users of LLMs. We hope that these resources will serve as a cornerstone for objective performance assessment, driving innovation and excellence in the realm of language model inference.
We welcome anyone to give us feedback and encourage you to join the LLMPerf project community. If you are an LLM inference provider and would like your API added to this dashboard, please create an issue on the Github repository and/or email us at endpoints-help@anyscale.com.
Suggested labels
{'gh-repo': 'ai-leaderboards', 'label-description': 'Performance Comparison', 'label-name': 'llm-performance-comparison', 'confidence': 71.15}