irthomasthomas / undecidability

3 stars 2 forks source link

LLM API Host Leaderboard | Artificial Analysis #651

Open irthomasthomas opened 4 months ago

irthomasthomas commented 4 months ago

LLM API Host Leaderboard | Artificial Analysis

DESCRIPTION: "Artificial Analysis

COMPARE MODELS

COMPARE API HOSTS

LEADERBOARDS

ABOUT

Email address

Subscribe

LLM API Hosts Leaderboard

Comparison and ranking of API host provider performance for AI LLM Models across key metrics including price, performance / speed (throughput & latency), context window & others. For more details including relating to our methodology, see our FAQs.

API host providers compared: OpenAI, Microsoft Azure, Google, Amazon Bedrock, Mistral, Anthropic, Perplexity, Cohere, Together.ai, Anyscale, Deepinfra, Fireworks, Groq, and Lepton."

URL: Artificial Analysis - LLM API Host Leaderboard

Suggested labels

{'label-name': 'api-host-provider-comparison', 'label-description': 'Comparison of API host providers for AI LLM models', 'confidence': 57.48}

irthomasthomas commented 4 months ago

Related issues

137: kagisearch/pyllms: Minimal Python library to connect to LLMs (OpenAI, Anthropic, AI21, Cohere, Aleph Alpha, HuggingfaceHub, Google PaLM2, with a built-in model performance benchmark.

### DetailsSimilarity score: 0.87 - [ ] [kagisearch/pyllms: Minimal Python library to connect to LLMs (OpenAI, Anthropic, AI21, Cohere, Aleph Alpha, HuggingfaceHub, Google PaLM2, with a built-in model performance benchmark.](https://github.com/kagisearch/pyllms)
Quote PyLLMs is a minimal Python library to connect to LLMs (OpenAI, Anthropic, Google, AI21, Cohere, Aleph Alpha, HuggingfaceHub) with a built-in model performance benchmark. It is ideal for fast prototyping and evaluating different models thanks to: Connect to top LLMs in a few lines of code Response meta includes tokens processed, cost and latency standardized across the models Multi-model support: Get completions from different models at the same time LLM benchmark: Evaluate models on quality, speed and cost Feel free to reuse and expand. Pull requests are welcome.
### #310: Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
### DetailsSimilarity score: 0.86 - [ ] [Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) #### Suggested labels #### { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" } { "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }
### #463: EvalPlus Leaderboard
### DetailsSimilarity score: 0.86 - [ ] [EvalPlus Leaderboard](https://evalplus.github.io/leaderboard.html) # 🏆 EvalPlus Leaderboard 🏆 EvalPlus is an evaluation platform that assesses AI coders using a rigorous set of tests. #### Suggested labels #### { "label-name": "leaderboard", "description": "Leaderboard for AI Coders", "repo": "github", "confidence": 94.75 }
### #172: FastChat llm_judge
### DetailsSimilarity score: 0.85 - [ ] [FastChat/fastchat/llm_judge at main · Intelligent-Systems-Lab/FastChat](https://github.com/Intelligent-Systems-Lab/FastChat/tree/main/fastchat/llm_judge) LLM Judge | Paper | Leaderboard | In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.
### #408: llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp
### DetailsSimilarity score: 0.85 - [ ] [llama.cpp/examples/llama-bench/README.md at master · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/examples/llama-bench/README.md) ### Llama Benchmarking Tool This is a performance testing tool for llama.cpp. It allows you to test the performance of the library with different models, prompt processing batch sizes, number of threads, number of layers offloaded to the GPU, and output formats. #### Table of Contents - [Syntax](#syntax) - [Examples](#examples) - [Text generation with different models](#text-generation-with-different-models) - [Prompt processing with different batch sizes](#prompt-processing-with-different-batch-sizes) - [Different numbers of threads](#different-numbers-of-threads) - [Different numbers of layers offloaded to the GPU](#different-numbers-of-layers-offloaded-to-the-gpu) - [Output formats](#output-formats) #### Syntax ``` usage: ./llama-bench [options] options: -h, --help Show this help message and exit -m, --model (default: models/7B/ggml-model-q4_0.gguf) -p, --n-prompt (default: 512) -n, --n-gen (default: 128) -b, --batch-size (default: 512) --memory-f32 <0|1> (default: 0) -t, --threads (default: 16) -ngl N, --n-gpu-layers (default: 99) -mg i, --main-gpu (default: 0) -mmq, --mul-mat-q <0|1> (default: 1) -ts, --tensor_split -r, --repetitions (default: 5) -o, --output (default: md) -v, --verbose (default: 0) ``` Multiple values can be given for each parameter by separating them with `,` or by specifying the parameter multiple times. #### Examples * Testing the performance of the model with default settings: ``` ./llama-bench ``` * Testing the performance of the model with a specific batch size: ``` ./llama-bench -b 1024 ``` * Testing the performance of the model with a specific model file: ``` ./llama-bench -m models/7B/ggml-model-q4_1.gguf ``` * Testing the performance of the model with a specific number of prompt and generated tokens: ``` ./llama-bench -p 1024 -n 2048 ``` * Testing the performance of the model with a specific number of threads: ``` ./llama-bench -t 8 ``` * Testing the performance of the model with a specific number of layers offloaded to the GPU: ``` ./llama-bench -ngl 64 ``` * Testing the performance of the model with a specific output format: ``` ./llama-bench -o json ``` #### Text generation with different models You can test the performance of the library with different models by specifying the model file using the `-m` or `--model` option. #### Prompt processing with different batch sizes You can test the performance of the library with different batch sizes by specifying the batch size using the `-b` or `--batch-size` option. #### Different numbers of threads You can test the performance of the library with different number of threads by specifying the number of threads using the `-t` or `--threads` option. #### Different numbers of layers offloaded to the GPU You can test the performance of the library with different number of layers offloaded to the GPU by specifying the number of GPU layers using the `-ngl` or `--n-gpu-layers` option. #### Output formats The benchmarking tool supports the following output formats: - Markdown (`md`) - CSV (`csv`) - JSON (`json`) - SQL (`sql`) You can specify the output format using the `-o` or `--output` option. #### Suggested labels ####
### #491: can-ai-code: Self-evaluating interview for AI coders
### DetailsSimilarity score: 0.85 - [ ] [the-crypt-keeper/can-ai-code: Self-evaluating interview for AI coders](https://github.com/the-crypt-keeper/can-ai-code) # Title: the-crypt-keeper/can-ai-code: Self-evaluating interview for AI coders A self-evaluating interview for AI coding models, written by humans and taken by AI. ## Key Ideas - Interview questions written by humans, test taken by AI - Inference scripts for all common API providers and CUDA-enabled quantization runtimes - Sandbox enviroment (Docker-based) for untrusted Python and NodeJS code validation - Evaluate effects of prompting techniques and sampling parameters on LLM coding performance - Evaluate LLM coding performance degradation due to quantization ## News - 2023-01-23: Evaluate `mlabonne/Beyonder-4x7B-v2` (AWQ only, FP16 was mega slow). - 2 #### Suggested labels #### { "label-name": "interview-evaluation", "description": "Self-evaluating interview for AI coding models", "repo": "the-crypt-keeper/can-ai-code", "confidence": 96.49 }