stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
https://crfm.stanford.edu/helm
Apache License 2.0
1.92k stars 246 forks source link

Solidify all decisions for commonsense-related question answering #526

Closed rishibommasani closed 2 years ago

rishibommasani commented 2 years ago

@michiyasunaga could you look at this sometime tomorrow (Wednesday) since we need to solidify before the run.

michiyasunaga commented 2 years ago
percyliang commented 2 years ago

i. Those two seem good to me. ii. Yes, this is what we discussed, but I didn't realize that the CausalLM scoring method is way better for HellaSwag (like a 50 point difference!). I think we should move the CLM code into the adapter (which we discussed before), and use that for any LM-like multiple-choice task, which would be useful for BLIMP as well. iii. Let's rename to "commonsense" then.

percyliang commented 2 years ago

For ii, tracking in #550

rishibommasani commented 2 years ago

Closing, with remainder handled in #550