issues
search
openai
/
evals
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Other
14.35k
stars
2.54k
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Bugged Tools Eval
#1486
ojaffe
closed
3 months ago
0
Error Recovery Eval
#1485
ojaffe
closed
3 months ago
0
Support multiple completions for ModelbasedClassify
#1484
tom-christie
opened
3 months ago
0
Updates on existing evals; readmes; solvers
#1483
ojaffe
closed
3 months ago
0
Address sporadic hanging of evals on certain samples
#1482
thesofakillers
closed
3 months ago
0
Drop two datasets from steganography
#1481
thesofakillers
closed
3 months ago
0
Add info about logging and link to logviz
#1480
JunShern
closed
3 months ago
0
Investigate failing Assistants test
#1479
JunShern
closed
3 months ago
0
Investigating failing AssistantsSolver test
#1478
JunShern
closed
3 months ago
0
Drop two datasets from steganography
#1477
thesofakillers
closed
3 months ago
2
Polymer hswang
#1476
HongshuaiWang1
closed
3 months ago
0
add a new eval:needle_in_a_matrix
#1475
gordbegli
closed
1 month ago
1
PL
#1474
syusuke9999
closed
4 months ago
0
Create Chatgpt.New.Era
#1473
Myahr2088
closed
4 months ago
1
Update Biomedicine brach
#1472
Linmj-Judy
closed
4 months ago
0
tasks and metrics for Biomedicine from lmj
#1471
Linmj-Judy
closed
4 months ago
0
Extending to Azure OpenAI implementation
#1470
pkt1583
opened
4 months ago
1
Support for Azure OpenAI client
#1469
pkt1583
opened
4 months ago
2
Suppress 'HTTP/1.1 200 OK' logs from openai library
#1468
JunShern
closed
4 months ago
1
enzyme benchmark
#1466
Type59pro
closed
5 months ago
1
Wandb report bugfixes on failed evals
#1465
TablewareBox
closed
5 months ago
0
Fix small typos and inconsistencies in README
#1464
kwinkunks
closed
4 months ago
1
Skip ThreadPool in sequential mode
#1463
Tradunsky
closed
5 months ago
1
feat: run completion like we run in correlation
#1462
AdamGold
closed
5 months ago
0
Updates for Solvers
#1461
JunShern
closed
5 months ago
0
Logged spec now includes overridden args
#1460
ojaffe
closed
5 months ago
0
Local run doesn't save logs to disk
#1459
charles-somm
closed
5 months ago
1
Add mbpp dataset
#1458
GauravRanganath
closed
5 months ago
0
Tagged Release For 2.0.0
#1456
michaelAlvarino
closed
5 months ago
1
Gaurav/20240110/add gsm eval dataset
#1455
GauravRanganath
closed
5 months ago
0
Add ARC-Challenge
#1454
jacobbieker
closed
5 months ago
0
Add eval yaml for Theory of Mind eval
#1453
ojaffe
closed
5 months ago
0
Add run_id to final_report from LocalRecorder
#1452
ianmckenzie-oai
closed
5 months ago
0
Fix formatting/typing so pre-commit hooks pass
#1451
ianmckenzie-oai
closed
5 months ago
0
Improve MMMU performance with prompt engineering
#1450
etr2460
closed
6 months ago
0
Log model and usage stats in `record.sampling`
#1449
JunShern
closed
3 months ago
1
Request to change arithmetical_puzzles prompting
#1448
ArcticBeat05
opened
6 months ago
0
Randomly select MMMU answer when none is returned from the model
#1447
etr2460
closed
6 months ago
0
Fix Pydantic warning on data_test run
#1445
inwaves
closed
6 months ago
0
Release 2.0.0
#1444
etr2460
closed
6 months ago
0
Use the API key for testing evals in CI
#1443
etr2460
closed
6 months ago
0
Add MMMU evals and runner
#1442
etr2460
closed
6 months ago
0
Run tests on all commits to main
#1441
etr2460
closed
6 months ago
0
Fix branch tests with empty API Key
#1440
etr2460
closed
6 months ago
0
Fix make decision prompt in ballots to send from system, not assistant
#1439
james-aung
closed
6 months ago
0
Fix small typo in oaieval run function
#1438
inwaves
closed
6 months ago
0
Possibility to sell high quality benchmarks
#1437
guliashvili
closed
1 month ago
1
Add complete list of errors to MakeMeSay utils
#1436
inwaves
closed
6 months ago
0
Change wrong kwargs name
#1435
LoryPack
closed
6 months ago
0
Mismatch between LangChainChatModelCompletionFn code and registry
#1434
LoryPack
closed
6 months ago
3
Previous
Next