StrategyQA and some ideas

stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).

Apache License 2.0

1.86k stars 242 forks source link

Hello. I am interested in the evaluation of AI systems and LLMs both in general and specifically with respect to reading comprehension, story comprehension (e.g., NarrativeQA), question-answering, and question-answering strategies.

With respect to AI evaluation, in general, I find interesting the forefront R&D topics of:

automatic item generation, e.g., using templates, LLMs [1][2], or both (e.g., Guidance),
item response theory and other means of measuring, comparing, and evaluating items,
adaptivity and parallelism; while evaluating multiple instances of an AI system, items could be selected based upon the performances of those instances on previous items.

With respect to question-answering strategies, in particular, has the StrategyQA dataset been considered for HELM? Thank you.

References

[1] Laverghetta Jr, Antonio, and John Licato. "Generating better items for cognitive assessments using large language models." (2023).

[2] Olney, Andrew M. "Generating multiple choice questions from a textbook: LLMs match human performance on most metrics." In AIED Workshops. 2023.

Thanks for the suggestions! StrategyQA is very relevant to HELM for assessing implicit reasoning so I think adding it would be a good idea.

automatic item generation: We aren't working on this in HELM, but another researcher in the Stanford NLP group is working on a paper on something similar.
item response theory: We don't have any scenarios that have item difficulty annotations, though if we did, we should be able to support it in HELM (i.e. as additional per-item metadata) and use it for computing metrics. There is also perhaps an idea of using LLMs to do item difficulty annotations.
adaptivity: (copy and pasted from our discussion elsewhere) I can see this being useful, especially for making the evaluations more efficient by reducing the number of instances that you would have to present to the LLM to evaluate it. However, this is tricky because (1) it would require a benchmark with difficulty annotations on the instances, and (2) HELM's architecture does not support this because we sample the evaluation instances randomly before the evaluation, so there is no way to pick an evaluation instance based on previous results.

stanford-crfm / helm

StrategyQA and some ideas #1868

References