openai evals issues - Githubissues

openai / evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Other

14.35k stars 2.54k forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Bugged Tools Eval

#1486 ojaffe closed 3 months ago
0
Error Recovery Eval

#1485 ojaffe closed 3 months ago
0
Support multiple completions for ModelbasedClassify

#1484 tom-christie opened 3 months ago
0
Updates on existing evals; readmes; solvers

#1483 ojaffe closed 3 months ago
0
Address sporadic hanging of evals on certain samples

#1482 thesofakillers closed 3 months ago
0
Drop two datasets from steganography

#1481 thesofakillers closed 3 months ago
0
Add info about logging and link to logviz

#1480 JunShern closed 3 months ago
0
Investigate failing Assistants test

#1479 JunShern closed 3 months ago
0
Investigating failing AssistantsSolver test

#1478 JunShern closed 3 months ago
0
Drop two datasets from steganography

#1477 thesofakillers closed 3 months ago
2
Polymer hswang

#1476 HongshuaiWang1 closed 3 months ago
0
add a new eval:needle_in_a_matrix

#1475 gordbegli closed 1 month ago
1
PL

#1474 syusuke9999 closed 4 months ago
0
Create Chatgpt.New.Era

#1473 Myahr2088 closed 4 months ago
1
Update Biomedicine brach

#1472 Linmj-Judy closed 4 months ago
0
tasks and metrics for Biomedicine from lmj

#1471 Linmj-Judy closed 4 months ago
0
Extending to Azure OpenAI implementation

#1470 pkt1583 opened 4 months ago
1
Support for Azure OpenAI client

#1469 pkt1583 opened 4 months ago
2
Suppress 'HTTP/1.1 200 OK' logs from openai library

#1468 JunShern closed 4 months ago
1
enzyme benchmark

#1466 Type59pro closed 5 months ago
1
Wandb report bugfixes on failed evals

#1465 TablewareBox closed 5 months ago
0
Fix small typos and inconsistencies in README

#1464 kwinkunks closed 4 months ago
1
Skip ThreadPool in sequential mode

#1463 Tradunsky closed 5 months ago
1
feat: run completion like we run in correlation

#1462 AdamGold closed 5 months ago
0
Updates for Solvers

#1461 JunShern closed 5 months ago
0
Logged spec now includes overridden args

#1460 ojaffe closed 5 months ago
0
Local run doesn't save logs to disk

#1459 charles-somm closed 5 months ago
1
Add mbpp dataset

#1458 GauravRanganath closed 5 months ago
0
Tagged Release For 2.0.0

#1456 michaelAlvarino closed 5 months ago
1
Gaurav/20240110/add gsm eval dataset

#1455 GauravRanganath closed 5 months ago
0
Add ARC-Challenge

#1454 jacobbieker closed 5 months ago
0
Add eval yaml for Theory of Mind eval

#1453 ojaffe closed 5 months ago
0
Add run_id to final_report from LocalRecorder

#1452 ianmckenzie-oai closed 5 months ago
0
Fix formatting/typing so pre-commit hooks pass

#1451 ianmckenzie-oai closed 5 months ago
0
Improve MMMU performance with prompt engineering

#1450 etr2460 closed 6 months ago
0
Log model and usage stats in `record.sampling`

#1449 JunShern closed 3 months ago
1
Request to change arithmetical_puzzles prompting

#1448 ArcticBeat05 opened 6 months ago
0
Randomly select MMMU answer when none is returned from the model

#1447 etr2460 closed 6 months ago
0
Fix Pydantic warning on data_test run

#1445 inwaves closed 6 months ago
0
Release 2.0.0

#1444 etr2460 closed 6 months ago
0
Use the API key for testing evals in CI

#1443 etr2460 closed 6 months ago
0
Add MMMU evals and runner

#1442 etr2460 closed 6 months ago
0
Run tests on all commits to main

#1441 etr2460 closed 6 months ago
0
Fix branch tests with empty API Key

#1440 etr2460 closed 6 months ago
0
Fix make decision prompt in ballots to send from system, not assistant

#1439 james-aung closed 6 months ago
0
Fix small typo in oaieval run function

#1438 inwaves closed 6 months ago
0
Possibility to sell high quality benchmarks

#1437 guliashvili closed 1 month ago
1
Add complete list of errors to MakeMeSay utils

#1436 inwaves closed 6 months ago
0
Change wrong kwargs name

#1435 LoryPack closed 6 months ago
0
Mismatch between LangChainChatModelCompletionFn code and registry

#1434 LoryPack closed 6 months ago
3

Previous Next