Solidify all decisions for commonsense-related question answering

rishibommasani commented 2 years ago

[x] We should look over the results and verify we want to include all 5 commonsenseQA scenarios
[x] We should finalize whatever prompting decisions, as the performance for all models except text-davinci-002 seem much worse (fro HellaSwag, compare https://crfm-models.stanford.edu/static/benchmarking.html?runSpec=.*dataset%3Dhella.* vs. https://paperswithcode.com/sota/sentence-completion-on-hellaswag).
[x] The naming of the configurations should be more explicit. Calling them all commonsenseqa when there is also a commonsenseQA dataset is confusing.

@michiyasunaga could you look at this sometime tomorrow (Wednesday) since we need to solidify before the run.

michiyasunaga commented 2 years ago

1. dataset inclusion: I feel that we could perhaps use the initial two datasets, hellaswag and openbookqa? Looking at all the results, these two datasets seem sufficient to tell the overall trend of models
1. prompting method: yes our knowledge team has been observing this trend where using CausalLM scoring works better than MMLU-style MCQA for these commonsense QA datasets. Our discussion with Percy so far is that MMLU-style MCQA prompting is cleaner/more principled, and in theory can handle comparison among answer choices (e.g. handling “none of the above” choice) as well, so we have been leaning toward using it for all datasets. We could confirm with Percy one more time.
1. naming: yeah.. the original motivation discussed with Percy was to group these datasets together because they all have the multiple choice QA format and test commonsense. I guess we do still want to have some group name.. What about we just call this group “commonsense” instead of “commonsenseqa”? (then at least the name would not conflict exactly with the commonsenseqa dataset).

percyliang commented 2 years ago

i. Those two seem good to me. ii. Yes, this is what we discussed, but I didn't realize that the CausalLM scoring method is way better for HellaSwag (like a 50 point difference!). I think we should move the CLM code into the adapter (which we discussed before), and use that for any LM-like multiple-choice task, which would be useful for BLIMP as well. iii. Let's rename to "commonsense" then.

percyliang commented 2 years ago

For ii, tracking in #550

rishibommasani commented 2 years ago

Closing, with remainder handled in #550

stanford-crfm / helm

Solidify all decisions for commonsense-related question answering #526