Closed teetone closed 2 years ago
We can have every scenario add the following line to their __init__()
method.
self.random: random.Random = random.Random(SEED)
Scenarios where the randomness is checked to be self-contained.
The Pile
ICE
TwitterAAE
WikiText-103
NaturalQuestions
HellaSwag
OpenBookQA
MMLU
BoolQ
NewsQA
NarrativeQA
MS MARCO Passage Ranking
DROP
QuAC
Summarization, XSUM
Summarization, DailyMail
RAFT
IMDB
BLIMP
CoLA
WikidataFact
GSM8K
HumanEval
APPS
bAbI
LSAT
MATH
pattern induction
synthetic matching
synthetic substitution
synthetic fact matching
synthetic fact deduction
Dyck-n language
number relationship induction
RealToxicityPrompts
BOLD
CivilComments
Copyright
BBQ
TruthfulQA
Disinformation-Reiteration
Disinformation-Wedging
Dialogue
(subset of Common Sense Dialogues, Empathetic Dialogues, and Persona Chat)I went over all the scenarios listed above as well as all the files matching src/benchmark/*_scenario*
.
There were a few things that were not robust so I submitted https://github.com/stanford-crfm/benchmarking/pull/351 to fix them. The issue was that randomness was set in __init__()
but used in get_instances()
and we have no control over what happens in between. Still, this has not been an issue since in our current codebase (as of https://github.com/stanford-crfm/benchmarking/commit/8b94ab1e59310bc80c38f5852f66ba35c42d9a86) nothing happens between these (see https://github.com/stanford-crfm/benchmarking/blob/main/src/benchmark/runner.py#L67).
Feel free to close when 351 merges.
Ensure results are reproducible