stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
https://crfm.stanford.edu/helm
Apache License 2.0
1.97k stars 255 forks source link

Standardize multiple choice #269

Closed dtsip closed 2 years ago

dtsip commented 2 years ago

Should binary tasks (e.g. BoolQ, toxicity detection) be multiple choice with answers A, B or use raw answers (i.e. Yes/No, True/False)?

Decision:

dtsip commented 2 years ago

There are two parts to this issue:

  1. Make sure the overall prompting strategy is correct (i.e., the prompts we generate follow this logic).
  2. Refactor all the tasks to move this behavior into the adapter.

For the first, we should make sure we are consistent across scenarios before the next run. (We already are for scenarios I have visibility on and a quick skim tells me that most other things are also correct.)

For the second, I am not entirely sure I can do it before the big run, but it might not be a priority at the moment?

percyliang commented 2 years ago

Is this done?

dtsip commented 2 years ago

Not yet. We moved this to p2 since it does not affect the actual queries we are making.

dtsip commented 2 years ago

Also, this wants to wait on https://github.com/stanford-crfm/benchmarking/issues/225.

dtsip commented 2 years ago

Thinking about it a bit more, I am not sure there is much to do here.

If we interpret the adaptation method multiple_choice to be a/b/c/d we are fine. That is, there is no reason to mark scenarios like IMDB as multiple choice since generation is a more natural strategy.