Double check everywhere we fix the random seed

teetone commented 2 years ago

Ensure results are reproducible

dilarasoylu commented 2 years ago

Proposal

We can have every scenario add the following line to their __init__() method.

        self.random: random.Random = random.Random(SEED)

Scenarios where the randomness is checked to be self-contained.

[x] The Pile
[x] ICE
[x] TwitterAAE
[x] WikiText-103
[x] NaturalQuestions
[x] HellaSwag
[x] OpenBookQA
[x] MMLU
[x] BoolQ
[x] NewsQA
[x] NarrativeQA
[X] MS MARCO Passage Ranking
[ ] DROP
[x] QuAC
[x] Summarization, XSUM
[x] Summarization, DailyMail
[x] RAFT
[x] IMDB
[x] BLIMP
[ ] CoLA
[x] WikidataFact
[x] GSM8K
[x] HumanEval
[x] APPS
[x] bAbI
[x] LSAT
[x] MATH
[x] pattern induction
[x] synthetic matching
[x] synthetic substitution
[x] synthetic fact matching
[x] synthetic fact deduction
[x] Dyck-n language
[x] number relationship induction
[x] RealToxicityPrompts
[x] BOLD
[x] CivilComments
[x] Copyright
[x] BBQ
[x] TruthfulQA
[x] Disinformation-Reiteration
[x] Disinformation-Wedging
[x] Dialogue (subset of Common Sense Dialogues, Empathetic Dialogues, and Persona Chat)

dtsip commented 2 years ago

I went over all the scenarios listed above as well as all the files matching src/benchmark/*_scenario*.

There were a few things that were not robust so I submitted https://github.com/stanford-crfm/benchmarking/pull/351 to fix them. The issue was that randomness was set in __init__() but used in get_instances() and we have no control over what happens in between. Still, this has not been an issue since in our current codebase (as of https://github.com/stanford-crfm/benchmarking/commit/8b94ab1e59310bc80c38f5852f66ba35c42d9a86) nothing happens between these (see https://github.com/stanford-crfm/benchmarking/blob/main/src/benchmark/runner.py#L67).

Feel free to close when 351 merges.