stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.87k stars 243 forks source link

Double check everywhere we fix the random seed #315

Closed teetone closed 2 years ago

teetone commented 2 years ago

Ensure results are reproducible

dilarasoylu commented 2 years ago

Proposal

We can have every scenario add the following line to their __init__() method.

        self.random: random.Random = random.Random(SEED)

Scenario Checklist

Scenarios where the randomness is checked to be self-contained.

dtsip commented 2 years ago

I went over all the scenarios listed above as well as all the files matching src/benchmark/*_scenario*.

There were a few things that were not robust so I submitted https://github.com/stanford-crfm/benchmarking/pull/351 to fix them. The issue was that randomness was set in __init__() but used in get_instances() and we have no control over what happens in between. Still, this has not been an issue since in our current codebase (as of https://github.com/stanford-crfm/benchmarking/commit/8b94ab1e59310bc80c38f5852f66ba35c42d9a86) nothing happens between these (see https://github.com/stanford-crfm/benchmarking/blob/main/src/benchmark/runner.py#L67).

Feel free to close when 351 merges.