merplumander commented 3 weeks ago

Language Model Overview

OpenAI

	gpt-4o	gpt-4o-mini	gpt-4-turbo	o1-preview	o1-mini
Description	Our high-intelligence flagship model for complex, multi-step tasks	Our affordable and intelligent small model for fast, lightweight tasks	The previous set of high-intelligence models	reasoning model designed to solve hardproblems across domains	faster and cheaper reasoning model particularly good at coding, math, and science
Training data cut-off	Up to Oct 2023	Up to Oct 2023	Up to Dec 2023	Up to Oct 2023	Up to Oct 2023

Logprobs: yes

Anthropic

	Claude 3.5 Sonnet	Claude 3.5 Haiku
Description	Most intelligent model	fastest model
API model name	claude-3-5-sonnet-20241022	claude-3-5-haiku-20241022
Training data cut-off	Apr 2024	July 2024

Logprobs: no

Gemini

Problem: Knowledge cut-off information not available

	Gemini 1.5 Flash	Gemini 1.5 Flash-8B	Gemini 1.5 Pro
Description	Fast and versatile performance across a diverse variety of tasks	High volume and lower intelligence tasks	Complex reasoning tasks requiring more intelligence
API model name	gemini-1.5-flash	gemini-1.5-flash-8b	gemini-1.5-pro
Versions and Release Dates	gemini-1.5-flash-001 (2024-05-24), gemini-1.5-flash-002 (2024-09-24)	gemini-1.5-flash-8b-001 (2024-10-24)	gemini-1.5-pro-001 (2024-05-24), gemini-1.5-pro-002 (2024-09-24)

Logprobs: No

Llama

	LLama 3.2 1B	LLama 3.2 3B	LLama 3.2 11B	LLama 3.2 60B	LLama 3.1 8B	LLama 3.1 70B	LLama 3.2 405B
API name	llama3.2-1b	llama3.2-3b	llama3.2-11b-vision	llama3.2-90b-vision	llama3.1-8b	llama3.1-70b	llama3.1-405b
Training data cut-off	Dec 2023	Dec 2023	Dec 2023	Dec 2023	Dec 2023	Dec 2023	Dec 2023

Grok

grok-beta

No further information available without X-premium.

Mistral

No logprobs from api available.

	Mistral Large 2	Mistral Small	Ministral 8B	Ministral 3B
Description	Top-tier reasoning for high-complexity tasks, for your most sophisticated needs.	Cost-efficient, fast, and reliable option for use cases such as translation, summarization, and sentiment analysis.	Powerful model for on-device use cases.	even smaller
Models and release dates	mistral-large-2407 (2024-07-24)	mistral-small-2409 (2024-09-27),mistral-small-2402 (2024-02-26)	ministral-8b-2410 (2024-10-09)	ministral-3b-2410 (2024-10-09)

Qwen

	qwen-max	qwen-plus	qwen-turbo

Very good documentation for the open source downloadable models, but almost none for commercial models available via the API. It would be great to run the open source versions but they are too large without additional resources.

lreining commented 2 weeks ago

Based on the information above we can do an evaluation with data starting from August (the 1st):

OpenAI:
- gpt-4o
- gpt-4o-mini
- gpt-4-turbo: not as intelligent as the o-models but later training data cutoff
Anthropic:
- claude-3-5-sonnet-20241022
- claude-3-5-haiku-20241022
Gemini:
- gemini-1.5-pro-001
- gemini-1.5-flash-001
Llama:
- llama3.1-405b: largest Llama model
- llama3.2-90b-vision: largest Llama 3.2 model
- llama3.2-11b-vision: small Llama 3.2 model
Mistral:
- mistral-large-2407
- mistral-small-2402
Grok:
- grok-beta
Qwen:
- qwen-max
- qwen-plus

lreining commented 1 week ago

After a first test, we decide to exclude gpt-4-turbo (too expensive), claude-3-5-haiku (worse than claude-3-5-sonnet) and llama3.2-11b (as we already have to 90b version)

merplumander / ai-forecasting

Language Models and Knowledge Cut-offs #1