HELMET: How to Evaluate Long-context Language Models Effectively and Thoroughly

[Paper]

HELMET (How to Evaluate Long-context Models Effectively and Thoroughly) is a comprehensive benchmark for long-context language models covering seven diverse categories of tasks. The datasets are application-centric and are designed to evaluate models at different lengths and levels of complexity. Please check out the paper for more details, and this repo will detail how to run the evaluation.

Release Progress

See CHANGELOG.md for updates and more details.

[x] HELMET Code
[x] HELMET data
[x] VLLM Support
[ ] Correlation analysis notebook
[ ] Retrieval setup

Setup

Please install the necessary packages with

pip install -r requirements.txt

Additionally, if you wish to use the API models, you will need to install the package corresponding to the API you wish to use

pip install openai # OpenAI API
pip install anthropic # Anthropic API
pip install google-generativeai # Google GenerativeAI API
pip install together # Together API

You should also set the environmental variables accordingly so the API calls can be made correctly. To see the variable that you should set up, check out model_utils.py and the corresponding class (e.g., GeminiModel).

Data

You can download the data with the script:

bash scripts/download_data.sh

This will first download the .tar.gz file and then decompress it to the data directory.

The data is hosted on this Huggingface repo, which stores our preprocessed data in jsonl files and is about 34GB in storage. For Recall, RAG, Passage Re-ranking, and ALCE, we either generate the data ourselves or do retrieval, so these are stored in jsonl files, whereas our script will load the data from Huggingface for the other tasks, LongQA, Summ, and ICL. The data also contains the key points extracted for evaluating summarization with model-based evaluation.

In the future, we will add support for simply loading from Huggingface with all the input-outputs formatted, so you can plug in your own evaluation pipeline easily, stay tuned!

Running evaluation

To run the evaluation, simply use one of the config files in the configs directory, you may also overwrite any arguments in the config file or add new arguments simply through the command line (see arguments.py):

python eval.py --config configs/cite.yaml --model_name_or_path {local model path or huggingface model name} --output_dir {output directory, defaults to output/{model_name}}

This will output the results file under the output directory in two files: .json contains all the data point details while .json.score only contain the aggregated metrics.

You may also run the whole suite with a simple bash statement:

bash scripts/run_eval.sh
bash scripts/run_api.sh # for the API models, note that API models results may vary due to the randomness in the API calls

Check out the script file for more details! See Others for the slurm scripts, easily collecting all the results, and using VLLM.

The full results from our evaluation are here.

Tested model that we didn't? Please email me the result files and I will add them to the spreadsheet! See Contacts for my email.

Model-based evaluation

To run the model-based evaluation for LongQA and Summarization, please make sure that you have set the environmental variables for OpenAI so you can make calls to GPT-4o, then you can run:

python scripts/eval_gpt4_longqa.py
python scripts/eval_gpt4_summ.py

# Alternatively, if you want to shard the process
bash scripts/eval_gpt4_longqa.sh
bash scripts/eval_gpt4_summ.sh

To specify which model/paths you want to run model-based evaluation for, check out the python scripts and modify the model_to_check field. You may also use Claude, Gemini, or other models for model-based evaluation by modifying the class but we have tested for gpt-4o-2024-05-13.

Adding new models

The existing code supports using HuggingFace-supported models and API models (OpenAI, Anthropic, Google, and Together). To add a new model or use a different framework (other than HuggingFace), you can modify the model_utils.py file. Specifically, you need to create a new class that implements prepare_inputs (how the inputs are processed) and generate functions. Then, you can add a new case to load_LLM. Please refer to the existing classes for examples.

Adding new tasks

To add a new task/dataset, you just need to modify the data.py file:

Create a function that specifies how to load the data:

Specify the string templates for the task through user_template, system_template, and prompt_template (which is usually just the concatenation of the two)
Process each sample to fit the specified templates (the tokenization code will call user_template.format(**test_sample) and same for system_template). Importantly, each sample should have a context field, which will be truncated automatically if the input is too long (e.g., for QA, this is the retrieved passages; for NarrativeQA, this is the book/script). You should use the question and answer field to make evaluation/printing easier.
Optionally, add a post_process function to process the model output (e.g., for MS MARCO, we use a ranking parse function; for RULER, we calculate the recall). There is also a default_post_process function that parses and calculate simple metrics like EM and F1 that you may use. This function should take in the model output and the test sample and return a tuple of (metrics, changed_output), the metrics (e.g., EM, ROUGE) are aggregated across all samples, and the changed_output are added to the test_sample and saved to the output file.
The function should return {'data': [list of data samples], 'prompt_template': prompt_template, 'user_template': user_template, 'system_template': system_template, 'post_process': [optional custom function]}.

Finally, simply add a new case to the load_data function that calls the function that you just wrote to load your data. You can refer to the existing tasks for examples (e.g., load_json_kv, load_narrativeqa, and load_msmarco_rerank).

Dataset correlation analysis

We also analyze the correlation between performance on different datasets. The code will be released soon.

Others

Collecting results

To quickly collect all the results, you can use the script: ```bash python scripts/collect_results.py ``` Please check out the script and modify the specific fields to fit your needs. For example, you can change the models, task configs, output directories, tags, and more.

Slurm scripts

I have also included the slurm scripts for running all the experiments from the paper. You can run the scripts with: ```bash sbatch scripts/run_eval_slurm.sh sbatch scripts/run_short_slurm.sh sbatch scripts/run_api.sh ``` Note that you may need to modify the script to fit your cluster setup. For example: - `--array 0-1` specifies the number of jobs to run, this index corresponds to the model index in the array. - You may also specify which set of models to run with `MNAME="${S_MODELS[$M_IDX]}"` or `MNAME="${L_MODELS[$M_IDX]}"` for the short and long models respectively. - `--gres=gpu:1` specifies the number of GPUs you want to use, for the larger models, you may need more GPUs (we use up to 8x80GB GPUs). - `--mail-user` specifies the email address to send the job status to. - `source env/bin/activate` specifies the virtual environment to use. - `MODEL_NAME="/path/to/your/model/$MNAME"` you should specify the path to your model here.

Using VLLM

To use VLLM to run the evaluation, you can simply add the `--use_vllm` flag to the command line like so: ```bash python eval.py --config configs/cite.yaml --use_vllm ``` Disclaimer: VLLM can be much faster than using the native HuggingFace generation; however, we found that the results can be slightly different, so we recommend using the native HuggingFace generation for the final evaluation. All reported results in the paper are from the native HuggingFace generation. The speedup is much more noticable for tasks that generates more tokens (e.g., summarization may see up to 2x speedup), whereas the speedup is less noticable for tasks that generate fewer tokens (e.g., JSON KV may see less than 5% speedup).

Contacts

If you have any questions, please email me at hyen@cs.princeton.edu. If you encounter any problems, you can also open an issue here. Please try to specify the problem with details so we can help you better and quicker!

Citation

If you find our work useful, please cite us:

@misc{yen2024helmetevaluatelongcontextlanguage,
      title={HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly}, 
      author={Howard Yen and Tianyu Gao and Minmin Hou and Ke Ding and Daniel Fleischer and Peter Izsak and Moshe Wasserblat and Danqi Chen},
      year={2024},
      eprint={2410.02694},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.02694}, 
}

Please also cite the original dataset creators, listed below:

Citations

``` @article{Liu2023LostIT, title={Lost in the Middle: How Language Models Use Long Contexts}, author={Nelson F. Liu and Kevin Lin and John Hewitt and Ashwin Paranjape and Michele Bevilacqua and Fabio Petroni and Percy Liang}, journal={Transactions of the Association for Computational Linguistics}, year={2023}, volume={12}, pages={157-173}, url={https://api.semanticscholar.org/CorpusID:259360665} } @inproceedings{ hsieh2024ruler, title={{RULER}: What{\textquoteright}s the Real Context Size of Your Long-Context Language Models?}, author={Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Boris Ginsburg}, booktitle={First Conference on Language Modeling}, year={2024}, url={https://openreview.net/forum?id=kIoBbc76Sy} } @inproceedings{mallen-etal-2023-trust, title = "When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories", author = "Mallen, Alex and Asai, Akari and Zhong, Victor and Das, Rajarshi and Khashabi, Daniel and Hajishirzi, Hannaneh", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = acl, month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.546", doi = "10.18653/v1/2023.acl-long.546", pages = "9802--9822", } @inproceedings{yang-etal-2018-hotpotqa, title = "{H}otpot{QA}: A Dataset for Diverse, Explainable Multi-hop Question Answering", author = "Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D.", editor = "Riloff, Ellen and Chiang, David and Hockenmaier, Julia and Tsujii, Jun{'}ichi", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D18-1259", doi = "10.18653/v1/D18-1259", pages = "2369--2380", } @inproceedings{joshi2017triviaqa, title = "{T}rivia{QA}: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension", author = "Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke", editor = "Barzilay, Regina and Kan, Min-Yen", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P17-1147", doi = "10.18653/v1/P17-1147", pages = "1601--1611", } @inproceedings{petroni-etal-2021-kilt, title = "{KILT}: a Benchmark for Knowledge Intensive Language Tasks", author = {Petroni, Fabio and Piktus, Aleksandra and Fan, Angela and Lewis, Patrick and Yazdani, Majid and De Cao, Nicola and Thorne, James and Jernite, Yacine and Karpukhin, Vladimir and Maillard, Jean and Plachouras, Vassilis and Rockt{\"a}schel, Tim and Riedel, Sebastian}, booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.200", doi = "10.18653/v1/2021.naacl-main.200", pages = "2523--2544", } @article{kwiatkowski2019natural, title = "Natural Questions: A Benchmark for Question Answering Research", author = "Kwiatkowski, Tom and Palomaki, Jennimaria and Redfield, Olivia and Collins, Michael and Parikh, Ankur and Alberti, Chris and Epstein, Danielle and Polosukhin, Illia and Devlin, Jacob and Lee, Kenton and Toutanova, Kristina and Jones, Llion and Kelcey, Matthew and Chang, Ming-Wei and Dai, Andrew M. and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav", editor = "Lee, Lillian and Johnson, Mark and Roark, Brian and Nenkova, Ani", journal = "Transactions of the Association for Computational Linguistics", volume = "7", year = "2019", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/Q19-1026", doi = "10.1162/tacl_a_00276", pages = "452--466", } @inproceedings{gao2023alce, title = "Enabling Large Language Models to Generate Text with Citations", author = "Gao, Tianyu and Yen, Howard and Yu, Jiatong and Chen, Danqi", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.398", doi = "10.18653/v1/2023.emnlp-main.398", pages = "6465--6488", } @inproceedings{stelmakh2022asqa, title = "{ASQA}: Factoid Questions Meet Long-Form Answers", author = "Stelmakh, Ivan and Luan, Yi and Dhingra, Bhuwan and Chang, Ming-Wei", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.566", doi = "10.18653/v1/2022.emnlp-main.566", pages = "8273--8288", } @inproceedings{fan-etal-2019-eli5, title = "{ELI}5: Long Form Question Answering", author = "Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael", booktitle = acl, year = "2019", url = "https://aclanthology.org/P19-1346", doi = "10.18653/v1/P19-1346", pages = "3558--3567", } @article{rubin2022qampari, title={{QAMPARI: An Open-domain Question Answering Benchmark for Questions with Many Answers from Multiple Paragraphs}}, author={Rubin, Samuel Joseph Amouyal Ohad and Yoran, Ori and Wolfson, Tomer and Herzig, Jonathan and Berant, Jonathan}, journal={arXiv preprint arXiv:2205.12665}, year={2022}, url="https://arxiv.org/abs/2205.12665" } @misc{bajaj2018ms, title={MS MARCO: A Human Generated MAchine Reading COmprehension Dataset}, author={Payal Bajaj and Daniel Campos and Nick Craswell and Li Deng and Jianfeng Gao and Xiaodong Liu and Rangan Majumder and Andrew McNamara and Bhaskar Mitra and Tri Nguyen and Mir Rosenberg and Xia Song and Alina Stoica and Saurabh Tiwary and Tong Wang}, year={2018}, eprint={1611.09268}, archivePrefix={arXiv}, primaryClass={cs.CL}, url="https://arxiv.org/abs/1611.09268" } @article{kocisky2018narrativeqa, title = "The {N}arrative{QA} Reading Comprehension Challenge", author = "Ko{\v{c}}isk{\'y}, Tom{\'a}{\v{s}} and Schwarz, Jonathan and Blunsom, Phil and Dyer, Chris and Hermann, Karl Moritz and Melis, G{\'a}bor and Grefenstette, Edward", journal = "Transactions of the Association for Computational Linguistics", volume = "6", year = "2018", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/Q18-1023", doi = "10.1162/tacl_a_00023", pages = "317--328" } @inproceedings{ shen2022multilexsum, title={Multi-LexSum: Real-world Summaries of Civil Rights Lawsuits at Multiple Granularities}, author={Zejiang Shen and Kyle Lo and Lauren Yu and Nathan Dahlberg and Margo Schlanger and Doug Downey}, booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2022}, url={https://openreview.net/forum?id=z1d8fUiS8Cr} } @misc{zhang2024inftybenchextendinglongcontext, title={$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens}, author={Xinrong Zhang and Yingfa Chen and Shengding Hu and Zihang Xu and Junhao Chen and Moo Khai Hao and Xu Han and Zhen Leng Thai and Shuo Wang and Zhiyuan Liu and Maosong Sun}, year={2024}, eprint={2402.13718}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2402.13718}, } @inproceedings{li-roth-2002-learning, title = "Learning Question Classifiers", author = "Li, Xin and Roth, Dan", booktitle = "{COLING} 2002: The 19th International Conference on Computational Linguistics", year = "2002", url = "https://aclanthology.org/C02-1150", } @article{Liu2019BenchmarkingNL, title={Benchmarking Natural Language Understanding Services for building Conversational Agents}, author={Xingkun Liu and Arash Eshghi and Pawel Swietojanski and Verena Rieser}, journal={ArXiv}, year={2019}, volume={abs/1903.05566}, url={https://api.semanticscholar.org/CorpusID:76660838} } @inproceedings{casanueva-etal-2020-efficient, title = "Efficient Intent Detection with Dual Sentence Encoders", author = "Casanueva, I{\~n}igo and Tem{\v{c}}inas, Tadas and Gerz, Daniela and Henderson, Matthew and Vuli{\'c}, Ivan", editor = "Wen, Tsung-Hsien and Celikyilmaz, Asli and Yu, Zhou and Papangelis, Alexandros and Eric, Mihail and Kumar, Anuj and Casanueva, I{\~n}igo and Shah, Rushin", booktitle = "Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.nlp4convai-1.5", doi = "10.18653/v1/2020.nlp4convai-1.5", pages = "38--45", } @inproceedings{larson-etal-2019-evaluation, title = "An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction", author = "Larson, Stefan and Mahendran, Anish and Peper, Joseph J. and Clarke, Christopher and Lee, Andrew and Hill, Parker and Kummerfeld, Jonathan K. and Leach, Kevin and Laurenzano, Michael A. and Tang, Lingjia and Mars, Jason", editor = "Inui, Kentaro and Jiang, Jing and Ng, Vincent and Wan, Xiaojun", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D19-1131", doi = "10.18653/v1/D19-1131", pages = "1311--1316", } ```

princeton-nlp / HELMET

readme