LLM API Host Leaderboard | Artificial Analysis

Related issues

137: kagisearch/pyllms: Minimal Python library to connect to LLMs (OpenAI, Anthropic, AI21, Cohere, Aleph Alpha, HuggingfaceHub, Google PaLM2, with a built-in model performance benchmark.

### Details

Similarity score: 0.87 - [ ] [kagisearch/pyllms: Minimal Python library to connect to LLMs (OpenAI, Anthropic, AI21, Cohere, Aleph Alpha, HuggingfaceHub, Google PaLM2, with a built-in model performance benchmark.](https://github.com/kagisearch/pyllms)

Quote

PyLLMs is a minimal Python library to connect to LLMs (OpenAI, Anthropic, Google, AI21, Cohere, Aleph Alpha, HuggingfaceHub) with a built-in model performance benchmark. It is ideal for fast prototyping and evaluating different models thanks to: Connect to top LLMs in a few lines of code Response meta includes tokens processed, cost and latency standardized across the models Multi-model support: Get completions from different models at the same time LLM benchmark: Evaluate models on quality, speed and cost Feel free to reuse and expand. Pull requests are welcome.

### #310: Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

### Details

Similarity score: 0.86 - [ ] [Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) #### Suggested labels #### { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" } { "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }

### #463: EvalPlus Leaderboard

### Details

Similarity score: 0.86 - [ ] [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) # 🏆 EvalPlus Leaderboard 🏆 EvalPlus is an evaluation platform that assesses AI coders using a rigorous set of tests. #### Suggested labels #### { "label-name": "leaderboard", "description": "Leaderboard for AI Coders", "repo": "github", "confidence": 94.75 }

### #172: FastChat llm_judge

### Details

Similarity score: 0.85 - [ ] [FastChat/fastchat/llm_judge at main · Intelligent-Systems-Lab/FastChat](https://github.com/Intelligent-Systems-Lab/FastChat/tree/main/fastchat/llm_judge) LLM Judge | Paper | Leaderboard | In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.

### #408: llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp

### Details

Similarity score: 0.85 - [ ] [llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/examples/llama-bench/README.md) ### Llama Benchmarking Tool This is a performance testing tool for llama.cpp. It allows you to test the performance of the library with different models, prompt processing batch sizes, number of threads, number of layers offloaded to the GPU, and output formats. #### Table of Contents - [Syntax](#syntax) - [Examples](#examples) - [Text generation with different models](#text-generation-with-different-models) - [Prompt processing with different batch sizes](#prompt-processing-with-different-batch-sizes) - [Different numbers of threads](#different-numbers-of-threads) - [Different numbers of layers offloaded to the GPU](#different-numbers-of-layers-offloaded-to-the-gpu) - [Output formats](#output-formats) #### Syntax ``` usage: ./llama-bench [options] options: -h, --help Show this help message and exit -m, --model (default: models/7B/ggml-model-q4_0.gguf) -p, --n-prompt (default: 512) -n, --n-gen (default: 128) -b, --batch-size (default: 512) --memory-f32 <0|1> (default: 0) -t, --threads (default: 16) -ngl N, --n-gpu-layers (default: 99) -mg i, --main-gpu (default: 0) -mmq, --mul-mat-q <0|1> (default: 1) -ts, --tensor_split -r, --repetitions (default: 5) -o, --output (default: md) -v, --verbose (default: 0) ``` Multiple values can be given for each parameter by separating them with `,` or by specifying the parameter multiple times. #### Examples * Testing the performance of the model with default settings: ``` ./llama-bench ``` * Testing the performance of the model with a specific batch size: ``` ./llama-bench -b 1024 ``` * Testing the performance of the model with a specific model file: ``` ./llama-bench -m models/7B/ggml-model-q4_1.gguf ``` * Testing the performance of the model with a specific number of prompt and generated tokens: ``` ./llama-bench -p 1024 -n 2048 ``` * Testing the performance of the model with a specific number of threads: ``` ./llama-bench -t 8 ``` * Testing the performance of the model with a specific number of layers offloaded to the GPU: ``` ./llama-bench -ngl 64 ``` * Testing the performance of the model with a specific output format: ``` ./llama-bench -o json ``` #### Text generation with different models You can test the performance of the library with different models by specifying the model file using the `-m` or `--model` option. #### Prompt processing with different batch sizes You can test the performance of the library with different batch sizes by specifying the batch size using the `-b` or `--batch-size` option. #### Different numbers of threads You can test the performance of the library with different number of threads by specifying the number of threads using the `-t` or `--threads` option. #### Different numbers of layers offloaded to the GPU You can test the performance of the library with different number of layers offloaded to the GPU by specifying the number of GPU layers using the `-ngl` or `--n-gpu-layers` option. #### Output formats The benchmarking tool supports the following output formats: - Markdown (`md`) - CSV (`csv`) - JSON (`json`) - SQL (`sql`) You can specify the output format using the `-o` or `--output` option. #### Suggested labels ####

### #491: can-ai-code: Self-evaluating interview for AI coders
### Details
Similarity score: 0.85 - [ ] [the-crypt-keeper/can-ai-code: Self-evaluating interview for AI coders](https://github.com/the-crypt-keeper/can-ai-code) # Title: the-crypt-keeper/can-ai-code: Self-evaluating interview for AI coders A self-evaluating interview for AI coding models, written by humans and taken by AI. ## Key Ideas - Interview questions written by humans, test taken by AI - Inference scripts for all common API providers and CUDA-enabled quantization runtimes - Sandbox enviroment (Docker-based) for untrusted Python and NodeJS code validation - Evaluate effects of prompting techniques and sampling parameters on LLM coding performance - Evaluate LLM coding performance degradation due to quantization ## News - 2023-01-23: Evaluate `mlabonne/Beyonder-4x7B-v2` (AWQ only, FP16 was mega slow). - 2 #### Suggested labels #### { "label-name": "interview-evaluation", "description": "Self-evaluating interview for AI coding models", "repo": "the-crypt-keeper/can-ai-code", "confidence": 96.49 }

irthomasthomas / undecidability