Comparison and ranking of API host provider performance for AI LLM Models across key metrics including price, performance / speed (throughput & latency), context window & others. For more details including relating to our methodology, see our FAQs.
API host providers compared: OpenAI, Microsoft Azure, Google, Amazon Bedrock, Mistral, Anthropic, Perplexity, Cohere, Together.ai, Anyscale, Deepinfra, Fireworks, Groq, and Lepton."
137: kagisearch/pyllms: Minimal Python library to connect to LLMs (OpenAI, Anthropic, AI21, Cohere, Aleph Alpha, HuggingfaceHub, Google PaLM2, with a built-in model performance benchmark.
### DetailsSimilarity score: 0.87
- [ ] [kagisearch/pyllms: Minimal Python library to connect to LLMs (OpenAI, Anthropic, AI21, Cohere, Aleph Alpha, HuggingfaceHub, Google PaLM2, with a built-in model performance benchmark.](https://github.com/kagisearch/pyllms)
Quote
PyLLMs is a minimal Python library to connect to LLMs (OpenAI, Anthropic, Google, AI21, Cohere, Aleph Alpha, HuggingfaceHub) with a built-in model performance benchmark.
It is ideal for fast prototyping and evaluating different models thanks to:
Connect to top LLMs in a few lines of code
Response meta includes tokens processed, cost and latency standardized across the models
Multi-model support: Get completions from different models at the same time
LLM benchmark: Evaluate models on quality, speed and cost
Feel free to reuse and expand. Pull requests are welcome.
### #310: Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
### DetailsSimilarity score: 0.86
- [ ] [Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
#### Suggested labels
#### { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" } { "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }
### #463: EvalPlus Leaderboard
### DetailsSimilarity score: 0.86
- [ ] [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html)
# 🏆 EvalPlus Leaderboard 🏆
EvalPlus is an evaluation platform that assesses AI coders using a rigorous set of tests.
#### Suggested labels
#### { "label-name": "leaderboard", "description": "Leaderboard for AI Coders", "repo": "github", "confidence": 94.75 }
### #172: FastChat llm_judge
### DetailsSimilarity score: 0.85
- [ ] [FastChat/fastchat/llm_judge at main · Intelligent-Systems-Lab/FastChat](https://github.com/Intelligent-Systems-Lab/FastChat/tree/main/fastchat/llm_judge)
LLM Judge
| Paper | Leaderboard |
In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.
### #408: llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp
### DetailsSimilarity score: 0.85
- [ ] [llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/examples/llama-bench/README.md)
### Llama Benchmarking Tool
This is a performance testing tool for llama.cpp. It allows you to test the performance of the library with different models, prompt processing batch sizes, number of threads, number of layers offloaded to the GPU, and output formats.
#### Table of Contents
- [Syntax](#syntax)
- [Examples](#examples)
- [Text generation with different models](#text-generation-with-different-models)
- [Prompt processing with different batch sizes](#prompt-processing-with-different-batch-sizes)
- [Different numbers of threads](#different-numbers-of-threads)
- [Different numbers of layers offloaded to the GPU](#different-numbers-of-layers-offloaded-to-the-gpu)
- [Output formats](#output-formats)
#### Syntax
```
usage: ./llama-bench [options]
options:
-h, --help Show this help message and exit
-m, --model (default: models/7B/ggml-model-q4_0.gguf)
-p, --n-prompt (default: 512)
-n, --n-gen (default: 128)
-b, --batch-size (default: 512)
--memory-f32 <0|1> (default: 0)
-t, --threads (default: 16)
-ngl N, --n-gpu-layers
(default: 99)
-mg i, --main-gpu (default: 0)
-mmq, --mul-mat-q <0|1> (default: 1)
-ts, --tensor_split
-r, --repetitions (default: 5)
-o, --output
(default: md)
-v, --verbose (default: 0)
```
Multiple values can be given for each parameter by separating them with `,` or by specifying the parameter multiple times.
#### Examples
* Testing the performance of the model with default settings:
```
./llama-bench
```
* Testing the performance of the model with a specific batch size:
```
./llama-bench -b 1024
```
* Testing the performance of the model with a specific model file:
```
./llama-bench -m models/7B/ggml-model-q4_1.gguf
```
* Testing the performance of the model with a specific number of prompt and generated tokens:
```
./llama-bench -p 1024 -n 2048
```
* Testing the performance of the model with a specific number of threads:
```
./llama-bench -t 8
```
* Testing the performance of the model with a specific number of layers offloaded to the GPU:
```
./llama-bench -ngl 64
```
* Testing the performance of the model with a specific output format:
```
./llama-bench -o json
```
#### Text generation with different models
You can test the performance of the library with different models by specifying the model file using the `-m` or `--model` option.
#### Prompt processing with different batch sizes
You can test the performance of the library with different batch sizes by specifying the batch size using the `-b` or `--batch-size` option.
#### Different numbers of threads
You can test the performance of the library with different number of threads by specifying the number of threads using the `-t` or `--threads` option.
#### Different numbers of layers offloaded to the GPU
You can test the performance of the library with different number of layers offloaded to the GPU by specifying the number of GPU layers using the `-ngl` or `--n-gpu-layers` option.
#### Output formats
The benchmarking tool supports the following output formats:
- Markdown (`md`)
- CSV (`csv`)
- JSON (`json`)
- SQL (`sql`)
You can specify the output format using the `-o` or `--output` option.
#### Suggested labels
####
### #491: can-ai-code: Self-evaluating interview for AI coders
### DetailsSimilarity score: 0.85
- [ ] [the-crypt-keeper/can-ai-code: Self-evaluating interview for AI coders](https://github.com/the-crypt-keeper/can-ai-code)
# Title: the-crypt-keeper/can-ai-code: Self-evaluating interview for AI coders
A self-evaluating interview for AI coding models, written by humans and taken by AI.
## Key Ideas
- Interview questions written by humans, test taken by AI
- Inference scripts for all common API providers and CUDA-enabled quantization runtimes
- Sandbox enviroment (Docker-based) for untrusted Python and NodeJS code validation
- Evaluate effects of prompting techniques and sampling parameters on LLM coding performance
- Evaluate LLM coding performance degradation due to quantization
## News
- 2023-01-23: Evaluate `mlabonne/Beyonder-4x7B-v2` (AWQ only, FP16 was mega slow).
- 2
#### Suggested labels
#### { "label-name": "interview-evaluation", "description": "Self-evaluating interview for AI coding models", "repo": "the-crypt-keeper/can-ai-code", "confidence": 96.49 }
LLM API Host Leaderboard | Artificial Analysis
DESCRIPTION: "Artificial Analysis
COMPARE MODELS
COMPARE API HOSTS
LEADERBOARDS
ABOUT
Email address
Subscribe
LLM API Hosts Leaderboard
Comparison and ranking of API host provider performance for AI LLM Models across key metrics including price, performance / speed (throughput & latency), context window & others. For more details including relating to our methodology, see our FAQs.
API host providers compared: OpenAI, Microsoft Azure, Google, Amazon Bedrock, Mistral, Anthropic, Perplexity, Cohere, Together.ai, Anyscale, Deepinfra, Fireworks, Groq, and Lepton."
URL: Artificial Analysis - LLM API Host Leaderboard
Suggested labels
{'label-name': 'api-host-provider-comparison', 'label-description': 'Comparison of API host providers for AI LLM models', 'confidence': 57.48}