Closed dtsip closed 2 years ago
There are two parts to this issue:
For the first, we should make sure we are consistent across scenarios before the next run. (We already are for scenarios I have visibility on and a quick skim tells me that most other things are also correct.)
For the second, I am not entirely sure I can do it before the big run, but it might not be a priority at the moment?
Is this done?
Not yet. We moved this to p2 since it does not affect the actual queries we are making.
Also, this wants to wait on https://github.com/stanford-crfm/benchmarking/issues/225.
Thinking about it a bit more, I am not sure there is much to do here.
If we interpret the adaptation method multiple_choice
to be a/b/c/d we are fine. That is, there is no reason to mark scenarios like IMDB as multiple choice since generation is a more natural strategy.
Should binary tasks (e.g. BoolQ, toxicity detection) be multiple choice with answers A, B or use raw answers (i.e. Yes/No, True/False)?
Decision: