lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
606 stars 71 forks source link

Improve reproducibility in utils_math.py #46

Open dustalov opened 1 month ago

dustalov commented 1 month ago

This update makes the following improvements to utils_math.py:

  1. Replaced the raw tqdm(range()) call with the more specific tqdm.auto.trange() call for enhanced compatibility in various environments (e.g., notebooks).
  2. Added random state handling in get_bootstrap_result to ensure reproducibility during sampling. Instead of relying on the global random state here, our random state depends on different seed values for each sampling round.
  3. Refactored the random seed initialization using np.random.default_rng in get_bootstrap_result_style_control, improving randomness consistency across the module.

As a result, the CIs become more consistent between different runs.